Facebook has unveiled software based on machine learning which is able to translate from any language without relying on English. According to a Facebook blog post, M2M-100 is the first multilingual machine translation (MMT) model that can translate between any pair of 100 languages without relying on English data. Stating that breaking language barriers through machine translation (MT) is one of the most important ways to bring people together, and provide information on COVID-19, Facebook said that the single multilingual model performs equally as well as traditional bilingual models and managed to get 10 BLEU point improvement over English-centric multilingual models.
According to the blog, it used novel mining strategies to create translation data and built the first truly 'may-to-many' data set with 7.5 billion sentences for 100 languages.
As per the post, Facebook used a number of scaling techniques to build a universal model with 15 billion parameters. This captures information from associated languages and shows a more varied script of languages and morphology.
The post revealed that one of the biggest issues in creating a many-to-many MMT model is bringing together massive volumes of quality sentence pairs for arbitrary translation directions not involving English. However, they took on the challenge and made it possible by combining complementary data mining resources that have been years in the making, including ccAligned, ccMatrix, and LASER.
A new LASER 2.0 and improved fastText language identification have been created that improves the quality of mining and includes open-sourced training and evaluation scripts.
According to Facebook deploying M2M-100 will improve the quality of translations for billions of people, especially those that speak low-resource languages.