AfriBERTa: Towards Viable Multilingual Language Models for Low-resource Languages

Ogueji, Kelechi

dc.contributor.author	Ogueji, Kelechi
dc.date.accessioned	2022-08-29 17:13:45 (GMT)
dc.date.available	2022-08-29 17:13:45 (GMT)
dc.date.issued	2022-08-29
dc.date.submitted	2022-08-19
dc.identifier.uri	http://hdl.handle.net/10012/18662
dc.description.abstract	There are over 7000 languages spoken on earth, but many of these languages suffer from a dearth of natural language processing (NLP) tools. Multilingual pretrained language models have been introduced to help alleviate this problem. However, the largest pretrained multilingual models were trained on only hundreds of languages. This is a small amount when compared to the number of spoken languages. While these models have displayed impressive performance on several languages, including those they were not pretrained on, there is a lot of ground to be covered. A lot of languages are often left out because pretrained language models are assumed to require a lot of training data, which the languages do not have. Furthermore, a major motivation behind these models is that such lower-resource languages benefit from joint training with higher-resource languages. In this thesis, we challenge both these assumptions and present the first attempt at training a multilingual language model on only low-resource languages. We show that it is possible to train competitive multilingual language models on less than one gigabyte of text data containing a selection of African languages. Our model, named AfriBERTa, covers 11 African languages, including the first language model for 4 of these languages. We evaluate this model on named entity recognition and text classification spanning 10 languages. Our evaluation results show that our model is very competitive with larger multilingual models - multilingual BERT and XLM-RoBERTa - on several languages. Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high- resource languages. Furthermore, we present a comprehensive discussion of the implications of our findings.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.relation.uri	https://huggingface.co/datasets/castorini/afriberta-corpus	en
dc.relation.uri	https://github.com/castorini/afriberta	en
dc.relation.uri	https://huggingface.co/castorini/afriberta_large	en
dc.subject	natural language processing	en
dc.subject	multilingual	en
dc.subject	language model	en
dc.subject	named entity recognition	en
dc.subject	pre-trained language model	en
dc.subject	text classification	en
dc.title	AfriBERTa: Towards Viable Multilingual Language Models for Low-resource Languages	en
dc.type	Master Thesis	en
dc.pending	false
uws-etd.degree.department	David R. Cheriton School of Computer Science	en
uws-etd.degree.discipline	Computer Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.degree	Master of Mathematics	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	Lin, Jimmy
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: Ogueji_Kelechi.pdf
Size:: 1.318Mb
Format:: PDF
Description:: Main article

View/ Open

This item appears in the following Collection(s)

Show simple item record

AfriBERTa: Towards Viable Multilingual Language Models for Low-resource Languages

Files in this item

This item appears in the following Collection(s)

Related items

Syntactic Complexities of Six Classes of Star-Free Languages ﻿

Online Digital Game-Based Language Learning Environments: Opportunities for Second Language Development ﻿

The High German of Russian Mennonites in Ontario ﻿

Syntactic Complexities of Six Classes of Star-Free Languages

Online Digital Game-Based Language Learning Environments: Opportunities for Second Language Development

The High German of Russian Mennonites in Ontario