Show simple item record

dc.contributor.authorMaleki, Danial 19:48:41 (GMT)
dc.description.abstractIn recent years, the exponential growth of data across various domains has necessitated the development of advanced techniques to process and analyze multi-modal big data. This is particularly relevant in the medical domain where data comes in diverse formats, such as images, reports, and molecular data. Consequently, bidirectional cross-modal data retrieval has become crucial for numerous research disciplines and domains. Cross-modal retrieval seeks to identify a shared latent space where different modalities, such as image-text pairs, are closely related. Obtaining high-quality vision and text embeddings is vital for achieving this objective. Although training language models is feasible due to the availability of public data and the absence of labelling requirements, training vision models to generate effective embeddings can be challenging due to the scarcity of labelled data when relying on supervised models. To address this challenge, an end-to-end approach to learning vision embeddings in a self-supervised manner, coined H-DINO+LILE, is introduced through a modification of the DINO model. The suggested innovation to improve the DINO model involves transforming the existing local and global patching scheme into a new harmonizing patching approach, termed H-DINO, where the magnitude of various augmentations is consistently maintained. This method captures the contextual information of images more consistently, thereby improving feature representation and retrieval accuracy. Furthermore, a unique architecture is proposed that integrates self-supervised learning and cross-modal retrieval modules in a back-to-back configuration, enabling improved representation of cross-modal and individual modalities using self-attention and cross-attention modules. This architecture features end-to-end training with a new loss term that facilitates image and text representation in the joint latent space. The efficacy of the proposed framework is validated on various private and public datasets across diverse tasks such as patch-based (sub-images) and WSI-based (whole slide images) retrieval, as well as text retrieval tasks. This thesis demonstrates that the proposed framework significantly bolsters cross-modal retrieval within the medical domain. Moreover, its applicability extends beyond the medical field, as it can be utilized in other domains that require cross-modal retrieval and contain patching of gigapixel images in their methodologies.en
dc.publisherUniversity of Waterlooen
dc.subjectMachine Learingen
dc.subjectSelf Supervised Learningen
dc.subjectCross Modality Retrievalen
dc.subjectDigital Pathologyen
dc.titleHarmonizing the Scale: An End-to-End Self-Supervised Approach for Cross-Modal Data Retrieval in Histopathology Archivesen
dc.typeDoctoral Thesisen
dc.pendingfalse Design Engineeringen Design Engineeringen of Waterlooen
uws-etd.degreeDoctor of Philosophyen
uws-etd.embargo.terms1 yearen
uws.contributor.advisorTizhoosh, Hamid
uws.contributor.advisorRahnamayan, Shahryar
uws.contributor.affiliation1Faculty of Engineeringen

Files in this item


This item appears in the following Collection(s)

Show simple item record


University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages