Show simple item record

dc.contributor.authorMillan Arias, Pablo
dc.date.accessioned2024-05-22 19:09:25 (GMT)
dc.date.available2024-05-22 19:09:25 (GMT)
dc.date.issued2024-05-22
dc.date.submitted2024-05-10
dc.identifier.urihttp://hdl.handle.net/10012/20581
dc.description.abstractAmid the recent surge in next-generation sequencing technologies, alignment-free algorithms stand out as a promising alternative to traditional alignment-based methods in phylogenetic analyses. Specifically, the use of genomic signatures has enabled the success of supervised machine learning-based alignment-free methods in taxonomic classification. Motivated by this success, this dissertation investigates the potential of unsupervised learning-based alignment-free algorithms in genomic signature categorization. We conclude that meaningful information can be learned without reliance on labels, suggesting that supervision can be effectively eliminated from the learning process. First, we developed a Deep Learning-based Unsupervised Clustering method for DNA Sequences, DeLUCS. It trains a discriminative neural network to identify meaningful taxonomic clusters without supervision. In this process, we designed and conducted several proof-of-concept experiments to validate the effectiveness of our methodology in various datasets. Building on the contrastive nature of DeLUCS, we enhance it through self-supervised representation learning. We introduce $i$DeLUCS and its applicability in non-parametric clustering of DNA sequences, matching the performance of alignment-based and alignment-assisted clustering algorithms. In addition, we successfully apply unsupervised learning to categorize the genomic signatures of microbial extremophiles. We provide quantitative evidence suggesting that microbial extremophile genomes may contain information beyond ancestry or taxonomy. The evidence provided by our computational experiments led to the biological insight that a pervasive environmental component exists in the genomic signature of extremophilic organisms and could potentially redefine the concept of genomic signature. Finally, we introduce BarcodeBERT, a transformer-based encoder optimized for DNA barcodes. Since barcodes are short DNA fragments that contain enough information for the taxonomic identification of an organism, our model learns this taxonomy information and generates expressive embeddings that enable efficient classification of barcodes of novel specimens. We evaluate the quality of these embeddings through several downstream tasks, such as supervised fine-tuning and linear probing for species classification of known species and nearest neighbours probing for genus classification of unknown species. Additionally, the learned embeddings proved effective in a zero-shot classification framework for images of insects, underscoring the model's utility in integrating genomic and visual data for species identification. Our work attempts to connect the worlds of biodiversity and taxonomic identification with the world of deep unsupervised learning. Our findings reveal deep learning's untapped potential to capture taxonomic information, even without supervision. The methodologies presented in this dissertation can also be used to learn expressive DNA embeddings and test evolutionary hypotheses.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectbioinformaticsen
dc.subjectclusteringen
dc.subjectcontrastive learningen
dc.subjectgenomic signaturesen
dc.subjectDNA barcodingen
dc.titleDeep Unsupervised Learning for Biodiversity Analyses: Representation learning and clustering of bacterial, mitochondrial, and barcode DNA sequencesen
dc.typeDoctoral Thesisen
dc.pendingfalse
uws-etd.degree.departmentDavid R. Cheriton School of Computer Scienceen
uws-etd.degree.disciplineComputer Scienceen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.degreeDoctor of Philosophyen
uws-etd.embargo.terms0en
uws.contributor.advisorKari, Lila
uws.contributor.affiliation1Faculty of Mathematicsen
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages