Finland’s TurkuNLP group has reached the highest aggregate ranking in the global natural language parsing Shared Task 2018, beating 25 other top universities and company research groups in performance. Filip Ginter, the head of TurkuNLP, is an AI scientist in Silo.AI’s NLP team.
Read Filip Ginter’s description of the Universal Dependencies Shared Task 2018:
Automatic syntactic parsing is one of the major tasks (and challenges) of natural language processing. The objective is to split running text into words and sentences and, for every sentence, build its syntactic tree with a full morphological analysis of every word. The tree might look like this:
And, for the word “finds”, you would expect to be told that the base form is “find” and the morphological features are Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin. Text pre-processed using such a parser is a great source of features for downstream applications, as well as for search-and-extract tasks and linguistic research. That is why we care about having good parsers for as many languages as possible.
Nowadays, these parsers are, for the most part, machine-learned systems, which need data to be trained. Universal Dependencies (www.universaldependencies.org) is a large initiative to gather such training data (called treebanks), and currently covers 71 languages, ranging from just a few dozen sentences for some to well over a million words for others. Universal Dependencies is one of the very visible data initiatives in natural language processing, and, happily, Turku has been at its core since day one.
To drive the development and testing of syntactic parsers, as well as to dampen the “developed on English, tested on English, works on nothing else” problem, the Universal Dependencies initiative has organized a Shared Task in 2017 and 2018. A shared task is essentially a competition where everybody receives the exact same data to train their parsers on and competes to gain the best accuracy across the little-over 50 languages, which have sufficient amount of data to test (not necessarily train) a parser. Participating in the shared task is a great test of one’s parsing methodology, as well as their technical skills, as training dozens of parsers on a tight schedule is no small feat. The shared tasks tend to attract well-known groups in the field and their results constitute the state of the art in parsing.
In 2018, the TurkuNLP team did well, ranking 1st, 2nd and 2nd of 26 teams on the three primary metrics in the task, which included the universities of Stanford, Prague and Uppsala – the traditional parsing research heavyweights. The technical work was led by Jenna Kanerva, with contributions from Filip Ginter, Akseli Leino and Niko Miekka.
The parser was built from a combination of existing tools, in part used as is and in part retrained in new ways. Especially, we relied on the Stanford parser https://github.com/tdozat/Parser-v2 and the OpenNMT neural machine translation system http://opennmt.net/, which we creatively bent to lemmatise words. Special challenge were such languages as Breton and Thai, for which zero training data was available and knowledge from other languages had to be transferred.
As an important practical outcome of the shared task is the Turku Neural Parser Pipeline https://turkunlp.github.io/Turku-neural-parser-pipeline/, which distributes the parser with its trained models for over 50 languages under an open license. The paper describing the pipeline will be published with other systems in the shared task session at EMNLP’18 in Brussels, in November.