TIP: Tabular-Image Pre-training for multimodal classification with incomplete data
File(s)2407.07582v1.pdf (15.68 MB)
Accepted version
Author(s)
Type
Chapter
Abstract
Images and structured tables are essential parts of real-world databases. Though tabular-image representation learning is promising for creating new insights, it remains a challenging task, as tabular data is typically heterogeneous and incomplete, presenting significant modality disparities with images. Earlier works have mainly focused on simple modality fusion strategies in complete data scenarios, without considering the missing data issue, and thus are limited in practice. In this paper, we propose TIP, a novel tabular-image pre-training framework for learning multimodal representations robust to incomplete tabular data. Specifically, TIP investigates a novel self-supervised learning (SSL) strategy, including a masked tabular reconstruction task to tackle data missingness, and image-tabular matching and contrastive learning objectives to capture multimodal information. Moreover, TIP proposes a versatile tabular encoder tailored for incomplete, heterogeneous tabular data and a multimodal interaction module for inter-modality representation learning. Experiments are performed on downstream multimodal classification tasks using both natural and medical image datasets. The results show that TIP outperforms state-of-the-art supervised/SSL image/multimodal methods in both complete and incomplete data scenarios. Our code is available at https://github.com/siyi-wind/TIP.
Editor(s)
Leonardis, A
Date Issued
2024-11-22
Citation
Computer Vision - ECCV 2024, 2024, 15073, pp.478-496
ISBN
978-3-031-72632-3
Publisher
Springer
Start Page
478
End Page
496
Journal / Book Title
Computer Vision - ECCV 2024
Volume
15073
Copyright Statement
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG.
Subjects
Computer Science
Computer Science, Artificial Intelligence
Computer Science, Interdisciplinary Applications
Computer Science, Theory & Methods
HEALTH INFORMATION-TECHNOLOGY
Image-tabular Representation Learning
Missing Data
MISSING DATA
Multimodal
MULTIPLE IMPUTATION
Science & Technology
Self-supervised Learning
Technology
Publication Status
Published