Luiza Sayfullina is a senior machine learning expert with 7+ years of experience in various machine learning projects. She holds a PhD in neural networks and natural language processing (NLP) from Aalto University (2019), and has a deep understanding of NLP for English and Finnish languages. In her work, Luiza helps companies find and implement AI solutions ranging from text classification and clustering, information extraction, content generation, summarization to speech-to-text applications.
AI solutions that deal with unstructured textual data
Luiza works at Silo.AI, the largest AI lab in the Nordics, as an NLP AI Scientist. In her work, she’s in charge of developing solutions that deal with textual data. She uses both general machine learning approaches as well as language specific ones. These include text preprocessing and understanding the grammatical structure through parsing, named entity recognition, neural networks, decision trees and other text processing algorithms used in AI solutions we build for our clients.
Luiza’s current work isn’t so far from her prior career in academia: “Sometimes I joke that little changed in my move from academia to Silo.AI. In fact, I use the same programming language (Python) and deep learning libraries (PyTorch or TensorFlow), read scientific papers and involve myself into deep thinking and how to approach the problem at hand.” Luiza describes.
However, the client work brings her the needed balance where she gets to apply her knowledge into real-world: “Theory and practice are more at balance now. I both read papers and work on concrete cases. I enjoy writing more code than what I used to, and aim to improve my code standards.”
Most knowledge we have at the moment is expressed in natural language, either in text or in audio
In NLP projects, the aim is to make unstructured textual data useful by putting it into a structured format. Once more structured, certain types of events or information can be extracted from it. Unstructured data can include media, news articles, user feedback, user reviews and customer feedback, reports, invoices, pdfs and other documentation. Typically NLP projects focus on finding quantity information in vast amounts of unstructured data.
“In one client case, we’ve been evaluating the sustainability of a company based on sustainability reports. It is crucial to find key numbers that indicate which activities led to decrease of emissions, and by how much the emissions were reduced. There’s a lot of this kind of data available, but reading the texts manually and putting the information together takes time,” Luiza says.
For Luiza, NLP provides a new way of creating value: these days the value of many products comes from smart processing of unstructured data and producing the most relevant information that is worth reading.
“I believe that NLP can bring much value by making information more accessible to various groups of people despite the language they speak or the difficulty of text they can handle. Machine translation has achieved tremendous success in making all this possible.” Luiza explains.
If you don’t have labeled data, additional rules can help
For the majority of AI tasks, the data needs to be annotated so that the machine can “learn” what’s in the data. Annotating is the process of adding labels to the data, in other words, explaining what is in the piece of data. It isn’t that common to have annotated data ready, which is why we at Silo.AI have created annotation tools to speed up to process.
In one example, we were categorizing risks in financial reports into four different categories. This kind of classification task required labeled data, so we needed to provide annotation tools before we could build the categorization tool. Usually the client is the best annotator, as the process requires domain knowledge. However, in some cases annotation can be outsourced.
In Luiza’s opinion, you shouldn’t be afraid to support the algorithm with additional rules or logics: “It’s good to check if an existing solution can adapt to your data. Sometimes though these trained solutions might not have encountered certain specific domain samples, and there might not be enough labeled data to fine-tune the solution. In such cases it doesn’t hurt to improve the accuracy by incorporating additional logics.”
Responsibility and freedom in projects
Luiza enjoys her responsibility in each project. So far, every project has been different. During her two years at Silo.AI, Luiza has worked on creative content generation, clause recommendation in contracts, unemployment predictions and sentiment analysis on financial reports. She has also worked on a Finnish language Speech2Text project.
At Silo.AI, we have project owners, but no project managers. In her work, Luiza needs to be responsible for teaching clients and presenting material in an easy to understand way. She’s constantly learning about client interaction and communication. AI Scientists also get a lot of autonomy and they are free to choose ways of working and approaches to try.
“I love the creativity in problem solving, involving freedom in exploring the ways to solve the problem. I’m used to iterating on different approaches.” Luiza says.
“In one case, I work with environmental data and climate change, assessing how companies act towards reducing emissions. I need to get familiar with the topic and its terminology, and then try to ask good questions from the client. This domain knowledge helps me to see possible opportunities and to assess the ongoing project critically.” Luiza says.
The meaning behind the lines
In NLP, the scientific community took a large step forward towards understanding the semantic relations between words with the invention of vector representations for words.
“In NLP, we try to teach the model the meanings of the phrases. The machine learning model is like a foreigner who sometimes fails to understand the full meaning behind the sentences. This often happens with complicated logic, jokes or sarcasm. When we can approximate meaning, we are able to solve a variety of problems including answering simple questions, finding similar relevant content and parsing the grammatical structure of the text. Inferring the exact meaning behind the lines is not something that we can delegate yet to the algorithms, which lack the reasoning component.” Luiza explains.
Luiza appreciates language as data, as it’s intuitively understandable by humans:
“When it comes to data analysis, working with language is valuable because the text that we analyze can be intuitively understood by humans too. Compared to vector data, where you have a sequence of numbers, you don’t always get an intuitive understanding.”
“Studying languages has also been my priority and a fun past-time throughout my life. The beauty of working with text is the power of its expression that makes this field challenging,” she concludes.
Music and sports give energy
Luiza enjoys the most reading books and doing sports, like bicycling or badminton. She always enjoys from time to time playing guitar and singing. Recently she’s been combining her research on artificial intelligence with studying psychology and deeper understanding of the human mind at University of Helsinki. She enjoys learning this different but still academic perspective.
Favorite Silo.AI value?
“While my most favorite value is Keep Learning, this year I would like to combine more our two values Keep Learning and Build Bonds by exchanging more knowledge with my colleagues coming both from tech and operational teams. The value of understanding the tech side as well as business side is vital while growing as an AI-scientist.”
Resources & Libraries from Luiza
- While coming up with alternative ways to solve a specific type of task one can check the ranking of state-of-the art algorithms in terms of performance on the website
https://nlpprogress.com/. Sometimes the performance is crucial and 1% of accuracy matters, then state-of-the-art algorithms should be considered for the case.
- My favourite NLP book which is still in progress is Speech and Language Processing, written by Dan Jurafsky and James Martin. Since the book is in progress it is possible to contribute by sending comments and suggestions on the existing chapters, which I am planning to do.
- I recommend as well neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization developed by Turku NLP group (https://turkunlp.org/Turku-neural-parser-pipeline/docker.html) which supports more than 50 languages!
- I use SpaCy library on a daily basis when it comes to text preprocessing and quick prototyping. For running pre-trained models such as BERT and GPT2 and I use hugging-face libraries with PyTorch.
- I also recommend the Deep Pavlov library containing a variety of pre-trained SOTA models for English or Russian.
Would you like to join Silo.AI as Luiza’s colleague?
We are especially looking for NLP experts to our offices in Helsinki and Turku to solve real-life cases using the latest NLP techniques.