Recently, research titled “MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant,” led by ZJUI Prof. Wang Hongwei from the Data Science and Knowledge Engineering Laboratory (DSKE LAB), has been accepted by the computer vision summit CVPR 2024. This work introduces a framework for medical multi-modal generation, aligning, extracting, and generating medical multi-modal within a unified model. The first author of the paper is Zhan Chenlu, a 26’ doctoral student from DSKE LAB, with Prof. Wang Hongwei and Assist Prof. Wang Gaoang from ZJUI as corresponding authors.
In recent years, various advanced medical generative works based on denoising diffusion models have significantly improved the efficiency of medical diagnostics tasks. However, most of these medical generative models rely on distinct single-flow pipelines for specialized generative tasks with cumbersome and slow processes. In real-world medical scenarios that demand the integration of multiple medical modalities for analysis, this generative approach faces substantial limitations in its extension. Besides, recent advanced multi-modal generative works face challenges in extracting specific medical knowledge and leveraging limited medical paired data to attain cross-modal generation capabilities. These deficiencies are motivations to construct a unified medical generative model capable of handling tasks of multiple medical modalities. There still exist some non-trivial challenges, as follows:
(1) The substantial disparities among multiple medical modalities pose significant challenges in achieving alignment and come with expensive costs.
(2) Distinct from images in the general domain, medical imaging modalities (CT, MRI, X-ray) each possess their specific clinical properties. The conventional unified alignment methods often lead to a mixing.
(3) Unlike the general multi-modal generative models pre-trained with large well-matched cross-modal databases, the lack of medical cross-modal paired training datasets poses difficulty in retraining generative capabilities of medical multi-modal.
Introduction
▲ Principal Framework of MedM2G
In order to address the above challenges, the paper proposes MedM2G, a unified Medical Multi-Modal Generative Model that innovates to align, extract, and generate multiple medical modalities in a unified model. MedM2G enables medical multi-modal generation by interacting with multiple diffusion models. The primary motivation of the paper is to address the following issues:
1.MedM2G can generate paired data for arbitrary modalities. By leveraging the data generated to pre-train, the performance of downstream tasks (classification, segmentation, detection, translation) can be improved.
2.MedM2G can compensate for scarce medical modals by generation.
3.MedM2G can fuse and generate multi-modal for medical comprehensive analysis.
4.MedM2G can handle multiple tasks within a unified model and achieves SOTA results.
Specific contributions of this paper mainly include the following aspects:
1.Proposing MedM2G, the first unified medical multi-flow generative framework capable of aligning, extracting, and generating multiple medical modalities.
2.Presenting the multi-flow cross-guided diffusion strategy with the adaptive parameter as the condition for efficient medical multi-modal generation, cooperating with the medical visual invariant preservation to maintain specific medical knowledge. Specifically, this team first proposes the central alignment efficiently adopted in the input and output sharing space, which simply aligns the embedding of each modality with the text embedding, resulting in the alignment across all modalities. Significantly, with the innovation to maintain the specific medical knowledge of three medical imaging modalities unique to the cross-modal concept generation, the team proposes the medical visual invariant preservation by minimizing the off-diagonal elements of the two augmented views for better extraction. Moreover, boosting the interaction of medical cross-modal is crucial, the team hence condition the adaptive representation and a shareable cross-attention sub-layer into each cross-modal diffuser. Combined with the proposed multi-flow training strategy, this model can seamlessly handle multiple medical generation tasks without cross-modal paired datasets.
3.MedM2G attains state-of-the-art results on 5 medical multi-modal generation tasks with 10 corresponding benchmarks. Additionally, MedM2G can pre-train by generating paired medical modality data, effectively enhancing the performance of downstream tasks such as classification, segmentation, detection, and translation.
▲ The multi-flow training strategy
▲ Generated results and qualitative analysis of medical modalities
▲ Comparison of our baseline and SOTA medical vision-language pre-train model MGCA after adding the data generated by us and by SOTA medical generative models
About the author
DSKE LAB
Zhan Chenlu, a jointly supervised doctoral student by ZJUI and College of Computer Science and Technology, is co-supervised by Prof. Wang Hongwei, Assist Prof. Wang Gaoang, and Assist Prof. Lin Yu. Her research focuses on multi-modal and medical image processing.
About the paper
DSKE LAB
Article:MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
Article Link:https://arxiv.org/abs/2403.04290