Together with the University of Turku and HPLT, Silo AI, the largest private AI lab in Europe, has reached a significant milestone with the successful completion of training the Poro model. This marks an important step for SiloGen, the company's generative AI arm, and its efforts to strengthen European digital sovereignty and democratize access to large language models (LLMs) for all European languages. The model is evidence of the successful application of a novel method to train LLMs for low-resource languages.
Silo AI and TurkuNLP are building a family of multilingual open source LLMs, with the aim of strengthening European digital sovereignty and democratizing access to LLMs. The development of base models aligned with European values is crucial to this effort, ensuring they are built on data and information accurately representing the diverse languages, citizens, organizations and cultural landscape of the European Union. This approach not only aligns with European values, but also allows for sovereignty in how downstream applications and value creation happens.
Proven approach to build performant LLMs for low-resource languages
The completion of training Poro functions as a proof point for an innovative approach in developing AI models for languages with scarce data resources. Poro outperforms all existing open language models on the Finnish language, including FinGPT, Mistral, Llama, and the BLUUMI 176 billion parameter model, among others.
This success is attributed to pairing the low-resource Finnish language with high-resource languages. The team has worked on determining optimal data reuse frequencies for low-resource languages during training and incorporated translated paired texts between English and Finnish. This strategy relies on a cross-lingual signal to enhance the model's understanding of the connections between languages, proving crucial in achieving superior performance for low-resource languages, without compromising performance in English.
The completion of Poro exemplifies Silo AI's commitment to advancing AI models for low-resource languages. Releasing Poro as an open-source model facilitates widespread access and collaborative improvement, particularly for underrepresented European languages. This approach enriches the AI community, offering a valuable resource for research and development, reflecting a deliberate effort to enhance linguistic diversity in AI applications.
The completion of Poro is the first step in SiloGen’s efforts to train state-of-the-art LLMs for all official EU languages.
Features of Poro 34B
Below is a summary of key features of Poro 34B. For transparency with respect to model architecture, data and other technical information, please refer to the official model card.
- Poro Research Checkpoints: Checkpoints for the model are released throughout the training process, providing external researchers with unprecedented access to investigate the model training process.
- Model architecture: Poro 34B is 34.2 billion parameters and uses a BLOOM architecture with ALiBi embeddings to allow for context window extrapolation. While model architecture for the initial model has been kept simple, future models under progress will support additional capabilities, such as flash attention, rotary embeddings and grouped query attention.
- Multilingual capabilities: Poro is designed to process English and Finnish and has proficiency with a variety of programming languages. Additionally, it can perform translation between English and Finnish.
- Open source: Poro is freely available under the Apache 2.0 License, implying applicability for both commercial and research use.
- Dataset: The model is trained with a dataset of 1 trillion tokens, with English, Finnish and a variety of programming languages represented.
- Training details: Poro is trained using 512 AMD MI250X GPUs on the LUMI supercomputer in Finland.
More information
A family of European open multilingual LLMs
- Together with the University of Turku and HPLT, SiloGen launched an initiative to build a family of open multilingual LLMs with a world-class team, access to a record amount of compute and data, and a distinctive software layer to train LLMs.
- In November, we published the first three checkpoints of Poro 34B, a multilingual, open European language model showing performance evidence on low-resource languages like Finnish, without compromising performance in English.
- Later, we released the next two checkpoint milestones, covering a total of 50% of training for Poro 34B. Model evaluations prove performance for low-resource languages, with best-in-class performance for the Finnish language.
- Now, the model family is adding support to the Nordic languages, including Swedish, Norwegian, Danish and Icelandic, and announced a partnership with LAION, adding vision capability and commencing the training of multimodal models.
- Next, the expansion continues with the inclusion of all other official EU languages, broadening the linguistic scope and reinforcing its mission to democratize access to LLMs across the entire European Union.
Considerations for Use
The final model has been trained as a robust base model that can be finetuned for specific purposes. The intended audience for Poro Research Checkpoints is academic and industry research. The checkpoints are not suitable for deployment in a production use case without further training, fine-tuning and testing. For more on Silo AI's SaaS-based custom LLMs we invite you to familiarize yourself with the SiloGen platform.
Acknowledgments
We wish to thank the operators of the LUMI/EuroHPC supercomputer for computational resources and technical support, including AMD, HPE and CSC – the IT Center for Science, Finland. TurkuNLP researchers have received funding from the European Union’s Horizon Europe research and innovation programme High Performance Language Technologies (HPLT) under grant agreement No 101070350.
About
Silo AI
TurkuNLP
Want to discuss how Silo AI could help your organization?
Join the 5000+ subscribers who read the Silo AI monthly newsletter to be among the first to hear about the latest insights, articles, podcast episodes, webinars, and more.