Blog

Europe’s open language model family Poro extends checkpoints, languages and modalities

wintery landscape with a forest in the background and a reindeer standing on a field. Poro and SiloGen logos on the right.

Poro is a family of multilingual open source large language models (LLMs), with the aim of strengthening European digital sovereignty and democratizing access to LLMs. To ensure transparency and openness, and as part of the Poro Research Checkpoint program, we are today announcing new model checkpoints, as well as the next-generation models with additional languages and modalities.

  • Together with the University of Turku and HPLT, SiloGen launched an initiative to build a family of
    open multilingual LLMs
    with a world-class team, access to a record amount of compute and data, and a distinctive software layer to train LLMs.
  • Two months later, we are now releasing the next two checkpoint milestones, covering a total of 50% of training for Poro 34B. Model evaluations prove performance for low-resource languages, with best-in-class performance for the Finnish language.
  • As a next step, the model family adds support to the Nordic languages, including Swedish, Norwegian, Danish and Icelandic, and announces a partnership with LAION, adding vision capability and commencing the training of multimodal models

In mid-November, we published the first three checkpoints of Poro 34B, a multilingual, open European language model showing performance evidence on low-resource languages like Finnish, without compromising performance in English. We’re now publishing the next two checkpoints for Poro 34B, with in total 50% of the model trained. After five model checkpoints, the results for Poro 34B show that Poro is already outperforming all existing open language models on the Finnish language, including FinGPT, Mistral, Llama and the BLUUMI 176 billion parameter model among others (FinGPT is the first large generative Finnish language model (Luukkonen et al., forthcoming, EMNLP)).

“I’m proud of the results we have already been able to achieve with the Poro models. Already at this stage, I believe it’s safe to say that Poro 34B is, to date, the best open Finnish language model available. It’s inspiring to see how we have been able to use some of the learnings from FinGPT and the BLUUMI 176 billion parameter model, improve on those, and now have an even better model. We expect to reach 100% of training Poro 34B in the coming weeks.” says Research Fellow Sampo Pyysalo from TurkuNLP.

Added languages and modalities

With the proficient initial results of the Poro model family, we are now excited to announce a set of new models with additional capabilities. We have commenced training a model family covering English, Finnish, Swedish, Norwegian, Danish, Icelandic and code. These models have an updated and more modern architecture, and comes in a variety of model sizes. This is an important step towards the aim of covering all European languages, and our vision of European digital sovereignty with AI infrastructure for European companies to benefit from.

Language models with vision

While extending support to additional European languages, we are now also announcing that the upcoming model generations will add vision to their capabilities. This is enabled through a partnership with LAION (Large-scale Artificial Intelligence Open Network) for building a set of multimodal models. LAION is a global non-profit organization, with an aim to make large-scale data sets, machine learning models and related code publicly available. They provide assets, such as the LAION-5B dataset and the open toolbox for NSFW and toxicity detection LAION-SAFETY, for developing safe, trustworthy and reliable multimodal models. Their assets are among others behind the image generation tool Stable Diffusion. LAION and their collaborators already made pivotal contributions to training, studying and open-sourcing multi-modal foundation models and corresponding datasets with works like openCLIP, openFlamingo, CLAP and DataComp. This partnership will introduce vision capabilities to the Poro model family through a modular architecture by providing vision to existing models, as well as opening up opportunities to additional multimodal architectures in the future.

“In line with the plan to cover all European languages, it’s a natural step to start with an extension to the Nordic languages. And it’s likewise natural to extend Poro with vision. Through a partnership with LAION, multimodal models help in expanding the potential use cases and possibilities for value creation. Models with vision capabilities will be able to interpret, summarize, and describe documents containing both text and images. Like textual data, we see an even larger potential for generative AI to consolidate large amounts of data of different modalities.” Peter Sarlin, Silo AI CEO and co-founder, notes.

The collaboration with LAION brings together industry expertise and experience, strong and rigorous academic research, and an open source philosophy. This is a strong foundation for ensuring trustworthy, reliable and robust models. We hope the level of transparency enabled by our open source approach, in combination with the Poro Research Checkpoint program, will add to the trust we have been able to build with partners and clients alike.

Considerations for Use

The intended audience for Poro Research Checkpoints is academic and industry research. These checkpoints are not suitable for deployment in a production use case without further training, fine-tuning and testing.

Acknowledgments

We wish to thank the operators of the LUMI/EuroHPC supercomputer for computational resources and technical support, including AMD, HPE and CSC – the IT Center for Science, Finland. TurkuNLP researchers have received funding from the European Union’s Horizon Europe research and innovation programme High Performance Language Technologies (HPLT) under grant agreement No 101070350.

About

Silo AI

Silo AI is Europe’s largest private AI lab on a mission to ensure Europe has a flagship AI company. We’re a trusted AI partner that brings competitive advantage to product R&D. We build AI-driven solutions and products to enable smart devices, autonomous vehicles, industry 4.0, and smart cities. Silo AI provides its customers unique access to world-class AI models and expertise, as well as the Silo OS infrastructure to speed up AI development and deployment. With SiloGen, Silo AI is currently building market leading open source LLMs, with the intent to ensure European digital sovereignty and democratize access to LLMs.
www.silo.ai

SiloGen

SiloGen is a large-scale initiative with the aim of building generative AI technology for Europe’s digital sovereignty. As Silo AI’s generative AI arm, SiloGen combines some of Europe’s leading generative AI and large language model (LLM) experts with access to data sources, powerful computational resources and infrastructure to train, run and operate LLMs. SiloGen has been operational since late 2022 and is currently working with clients like Allianz, Happeo, Sandvik and Tietoevry. As a trusted provider SiloGen offers base and specialized models as well as a development platform to ensure accurate, trustworthy and robust downstream applications.

TurkuNLP

The TurkuNLP Group is a group of researchers at the University of Turku, with a research focus on various aspects of natural language processing, language technology and digital linguistics. TurkuNLP has contributed to a large number of open source NLP resources, such as FinBERT, WikiBERT, FinGPT, Turku Dependency Treebank, Universal Dependencies, Turku Neural Parsing Pipeline, Large internet corpora, Turku Paraphrase Corpus, Turku Sentiment Corpus, Wikidata normalization, TurkuONE etc. The University of Turku is an international academic community of 25,000 students and staff and was ranked among the 301–400 best universities in the 2023 Shanghai Ranking.

LAION

LAION is a non-profit organization bringing together a diverse community passionate about advancing the field of machine learning for the greater good. Our mission is to democratize access to large-scale machine learning models, datasets, and code, fostering collaboration and innovation on a worldwide scale. We provide assets, such as the LAION-5B dataset and the open toolbox for NSFW and toxicity detection LAION-SAFETY, for developing safe, trustworthy and reliable multimodal models. Our assets are among others behind the image generation tool Stable Diffusion. LAION and their collaborators already made pivotal contributions to training, studying and open-sourcing multi-modal foundation models and corresponding datasets with works like openCLIP, openFlamingo, CLAP and DataComp. We invite you to be part of our movement. Join us and explore the possibilities of a future where machine learning is a force for positive change.
www.laion.ai

Want to discuss how Silo AI could help your organization?

Get in touch with our AI experts.
Peter Sarlin, PhD
CEO & Co-Founder
peter.sarlin@silo.ai
+358 40 572 7670
Author
Authors

Share on Social
Subscribe to our newsletter

Join the 5000+ subscribers who read the Silo AI monthly newsletter to be among the first to hear about the latest insights, articles, podcast episodes, webinars, and more.

What to read next

Ready to level up your AI capabilities?

Succeeding in AI requires a commitment to long-term product development. Let’s start today.