Russian-Belarusian machine translation system now easier to build

Photo: gemeration.by
Photo: gemeration.by

Most modern machine translation systems use parallel corpora. A parallel corpus consists of numerous texts in both languages – in our case there are texts in Belarusian and their translations into Russian. A system can automatically analyze such corpora and define which word or phrase in one language corresponds to a word or a phrase in the other language. Software engineer Anton Bryl has created such a corpus on the basis of Euroradio’s bilingual texts:

"Taking something that is already available on the web in both languages and experimenting is not possible for all languages. I noticed that you have a lot of bilingual texts that could be turned into such a corpus. I spent a few weekends making one. It is now possible to create a machine translation system or experiment with automatic translation with its help. In fact, one can make a machine translation system from scratch using this corpus.”

This is how it works: imagine a big number of parallel sentences in Russian and Belarusian. If you notice which word in Russian corresponds to the same word in Belarusian most often, you will get a dictionary. If you take phrases and constructions, it will be a broader type of a dictionary: an automatic table of equivalents in both languages. This is the basic idea of statistical machine translation.

For instance, documents of the European Parliament, which are translated into the languages of all member states, are often used to traine machine translation systems for many Western European languages. Now there is such a corpus for the Belarusian language.

Euroradio's BE-RU corpus is a sentence-aligned Belarusian-Russian parallel corpus derived from the news on euroradion.fm for the year 2016. Overall, it contains ca. 135K sentence pairs. The encoding used is UTF-8.  Feel free to use the corpus and report mistakes. The Belarusian language will be in a more comfortable position in the huge world of machine translation now.

Permission is hereby granted to download, copy, and use this corpus free of charge for developing, evaluating, and exploring Machine Translation systems and algorithms, both for research and for commercial purposes.

DOWNLOAD