The LiRo (Limba Română) team, which includes researchers, professors, students, and others from the Technical University of Cluj-Napoca, Politehnica University of Bucharest, DeepMind, and other companies, has released last month the first benchmark for Romanian language tasks, together with datasets that can be used to train machine learning (ML) models to perform these tasks. This benchmark will be published in one of the most important artificial intelligence (AI) conferences, NeurIPS 2021.
Creating the benchmarks
The benchmark contains datasets and metrics that enable the evaluation and comparison of language models, in order to identify the best models and how they get better over time. Using these benchmarks, researchers can identify the strong points and the weak points of existing models and then they can look for ways to improve them.
Such benchmarks have existed for other languages, such as GLUE and SuperGLUE for English, and have lead to the development of algorithms such as BERT or GPT3, which give very good results, such as automated writing of news articles in a way that is almost indistinguishable from humans (coming soon™ to Transylvania Tech News Network too).
To create the LiRo benchmark, large quantities of text have been collected from various sources, such as Wikipedia, Reddit, online forums, newspapers, and others. Then these documents have to be annotated with labels, which vary from task to task. For the sentence similarity task, human reviewers are given two sentences and they have to score the similarity of the pair.
While this work is common when building datasets for any language, Romanian has the added difficulty of each noun having a grammatical gender associated with it. Because of this, great care must be taken while collecting the dataset to ensure that it is balanced and to avoid having a bias in the dataset.
The datasets released by the LiRo team have a license that allows research purposes. Using them for commercial purposes requires an additional license.
Pretrained Romanian Language Models
The LiRo team also offers pre-trained models for each of the tasks. These models are similar to ones used in English language models, but they are trained on the Romanian datasets. An encouraging result is that the models perform similarly on both languages when trained on similar quantities of data. This means that obtaining more Romanian language data will lead to better Romanian language models and the LiRo team hopes other researchers, specialists and volunteers will join this effort.
Language models have a wide variety of applications. Semantic similarity models can be used to improve the search results in an e-commerce site. A text classification model can be used by attorneys to quickly identify relevant legislative changes. A customer support system can use these models to prioritize tickets based on their urgency and importance.
The authors of the benchmark plan to add more datasets and tasks in the future, including speech-to-text ones, making it easier to create automatic transcription software for the Romanian language. They are looking for partners (such as news agencies, tribunals, public institutions) who can provide them with such datasets.
The Transylvania Tech News Network team is curious to see what applications will come out of this research and is looking forward to writing articles about startups using it.
For more details about LiRo, you can contact them at email@example.com.