Open Source LLMs Advance Europe's Digital Sovereignty Agenda

Last week, large language models (LLMs) made a significant impact on the agenda for Europe’s digital sovereignty, with reports emerging about a new initiative aimed at developing a lineup of genuinely open-source LLMs that encompass all languages spoken in the European Union.

This initiative encompasses the 24 official languages of the EU, along with those of countries in the process of entering the EU market, including Albania. The focus is on future-proofing these developments.

OpenEuroLLM is a collaborative effort involving around 20 organizations, co-led by Jan Hajič, a computational linguist from Charles University in Prague, and Peter Sarlin, the CEO and co-founder of Silo AI, a Finnish AI lab that was acquired by AMD for $665 million last year.

This project fits into a larger narrative wherein Europe emphasizes digital sovereignty, working to make critical infrastructure and tools more accessible locally. Major cloud providers are investing in local setups to ensure that data remains within the EU, while the AI leader OpenAI recently launched a new solution allowing clients to manage data processing and storage within Europe.

Moreover, the EU has also signed an $11 billion contract to establish a sovereign satellite network as a challenge to Elon Musk’s Starlink.

Hence, OpenEuroLLM is well-aligned with current trends.

However, the declared budget for developing these models is €37.4 million, with approximately €20 million sourced from the EU’s Digital Europe Programme—which is a small fraction in comparison to the investments made by the major players in corporate AI. The total budget will rise appreciably when considering additional funding designated for associated tasks, with compute being the largest expense. Participants in the OpenEuroLLM project include EuroHPC supercomputer facilities located in Spain, Italy, Finland, and the Netherlands, with the overall EuroHPC initiative boasting a budget of around €7 billion.

The diverse cast of organizations involved—from academia and research, to corporate entities—has prompted some observers to question the feasibility of achieving the project’s ambitious objectives. Anastasia Stasenko, co-founder of LLM company Pleias, has raised the concern about whether coordinating a “massive consortium of over 20 organizations” can maintain the same focused direction as a nimble private AI firm.

“Europe’s recent AI successes shine through smaller, focused teams like Mistral AI and LightOn — companies that fully take ownership of their creations,” Stasenko articulated. “They are directly accountable for their financial decisions, market strategies, and reputations.”

Ready for the Challenge

The OpenEuroLLM initiative is either starting from the ground up or has a head start—depending on one’s perspective.

Since 2022, Hajič has been leading the High Performance Language Technologies (HPLT) project, which aims to produce free, reusable datasets, models, and workflows utilizing high-performance computing (HPC). Scheduled for completion in late 2025, it serves as somewhat of a “precursor” to OpenEuroLLM, as most parties from HPLT (excluding those from the U.K.) are involved in this current effort.

“This [OpenEuroLLM] is essentially an expanded collaboration that is more concentrated on generative LLMs,” Hajič explained. “Thus, we are not starting from ground zero in relation to data, expertise, tools, and computational experience. We have gathered individuals who possess the necessary knowledge—we should be able to ramp up quickly.”

Hajič anticipates that the first iteration(s) will be available by mid-2026, with final versions expected by the project’s conclusion in 2028. However, achieving these benchmarks may seem ambitious considering that, thus far, there isn’t much to show aside from a basic GitHub profile.

“In that sense, we are starting anew—the project officially commenced on Saturday [February 1],” Hajič remarked. “But we have been laying the groundwork for a year [the tender process began in February 2024].”

Participants from academia and research span across Czechia, the Netherlands, Germany, Sweden, Finland, and Norway as part of the OpenEuroLLM collective, in collaboration with EuroHPC centers. Corporate representatives include Finland’s AMD-owned Silo AI, Aleph Alpha (Germany), Ellamind (Germany), Prompsit Language Engineering (Spain), and LightOn (France).

However, a notable absence from the partnership roster is French AI startup Mistral, which has positioned itself as an open-source alternative to established competitors such as OpenAI. While no representative from Mistral provided comments to TechCrunch, Hajič confirmed that attempts were made to engage them, though they did not yield a defined discussion regarding their involvement.

“I reached out, but it didn’t lead to productive talks about their participation,” Hajič stated.

There is potential for the project to attract new participants as part of the EU’s funding initiative; however, this is restricted to EU-based organizations, thus excluding participants from the U.K. and Switzerland. This contrasts with the Horizon R&D program, which the U.K. rejoined in 2023 after a protracted Brexit negotiation and which previously funded HPLT.

Progression

The primary goal of the project, highlighted in its tagline, is to create: “A suite of foundational models for transparent AI in Europe.” These models are also intended to uphold the “linguistic and cultural diversity” of all existing and future EU languages.

The exact deliverables are still being finalized, but they will likely involve a core multilingual model tailored for general applications where accuracy is critical, along with smaller “quantized” versions that may be designed for edge use cases where speed and efficiency are prioritized.

“We still need to devise a comprehensive plan for this,” Hajič noted. “Our aim is to create models that are both compact and of the utmost quality. We want to avoid releasing anything that is underdeveloped, especially considering that this initiative involves significant public funding from the European Commission.”

While the objective is to achieve maximal competence across all languages, equal proficiency in every language poses a unique challenge.

“That is our target, yet its achievement, especially for languages with limited digital resources, remains in question,” Hajič explained. “This is precisely why we want accurate benchmarks for these languages, and not rely on benchmarks that may not accurately represent the languages and their associated cultures.”

On the data front, much of the foundational work conducted in the HPLT project will be advantageous, given that version 2.0 of its dataset was made available four months ago. This dataset included 4.5 petabytes of web crawled data and over 20 billion documents, and Hajič mentioned that additional information from Common Crawl (a public repository of web-crawled data) will also be incorporated.

Defining Open Source

In the realm of traditional software, the ongoing debate between open source and proprietary often centers on the “true” interpretation of “open source.” This can generally be clarified by referring to the official definition established by the Open Source Initiative, the governing body for legitimate open source licenses.

Recently, the OSI has developed a definition for “open source AI,” although not everyone agrees with the outcome. Advocates of open source AI contend that not only the models should be made available freely but also the datasets, pretrained models, and weights—essentially the full package. The OSI’s definition, however, does not mandate the inclusion of training data, as AI models are frequently developed using proprietary or restricted-access data.

Consequently, the OpenEuroLLM initiative grapples with similar dilemmas, and despite its ambition of becoming “truly open,” it may need to make certain compromises in order to meet its “quality” standards.

“The aim is to have everything accessible,” Hajič stated. “Naturally, we face some restrictions. We aspire to deliver models of the highest possible quality, and in accordance with European copyright laws, can utilize whatever resources we can acquire. While some data cannot be redistributed, others can be kept for future reference.”

This implies that the OpenEuroLLM project may have to keep portions of the training data confidential but ensure it is accessible to auditors upon request—as mandated for high-risk AI systems under the EU AI Act.

“We hope that much of the data, especially that derived from Common Crawl, will be open,” Hajič remarked. “Our aspiration is to have it entirely open, but we will see. At any rate, we must adhere to AI regulations.”

One Too Many

Following the public presentation of OpenEuroLLM, a common critique has emerged: a very similar initiative, EuroLLM, was unveiled in Europe just a few months earlier. EuroLLM released its initial model in September and another in December, and is co-funded by the EU alongside a consortium of nine partners, including academic institutions like the University of Edinburgh and companies like Unbabel, which secured millions of GPU training hours on EU supercomputers last year.

EuroLLM shares parallel objectives with its near-namesake: “To construct an open-source European Large Language Model supporting 24 official European languages, along with several other strategically significant languages.”

Andre Martins, head of research at Unbabel, took to social media to highlight these parallels, suggesting that OpenEuroLLM is adopting a name that is already in use. “I hope the different communities can collaborate openly, share their knowledge, and refrain from reinventing the wheel with every new project funded,” wrote Martins.

Hajič characterized the scenario as “unfortunate,” expressing hope for potential collaboration, though he emphasized that due to the nature of its funding from the EU, OpenEuroLLM is restricted in its interactions with non-EU entities, such as U.K. universities.

The Finance Challenge

The emergence of China’s DeepSeek and its cost-performance ratio has sparked optimism that AI projects could achieve much more with significantly less funding than initially presumed. Nonetheless, recent weeks have witnessed skepticism surrounding the actual expenses connected to the development of DeepSeek.

“Regarding DeepSeek, we are largely unaware of the specifics involved in its build,” remarked Peter Sarlin, who serves as the technical co-lead for the OpenEuroLLM initiative.

Nevertheless, Sarlin believes that OpenEuroLLM will secure sufficient funding mainly for personnel. Indeed, a significant portion of AI system development expenses is related to compute, which is largely supported through partnerships with EuroHPC centers.

“OpenEuroLLM actually possesses a considerable budget,” Sarlin stated. “EuroHPC has invested billions into AI and computational infrastructure, with further commitments to expand significantly in the coming years.”

It’s crucial to highlight that the OpenEuroLLM project does not aim to develop a consumer or enterprise-grade product. The focus is purely on the models themselves, which is why Sarlin believes the funding is adequate.

“The objective here is not to create a chatbot or an AI assistant—that would require a product-focused strategy needing considerable effort, which is where ChatGPT excelled,” Sarlin noted. “We are delivering an open-source foundational model that serves as AI infrastructure for European companies to build upon. We understand what is essential to build these models, and it’s not something that requires billions.”

Since 2017, Sarlin has been at the helm of Silo AI, which—alongside others, including the HPLT project—has developed a range of Poro and Viking open models. These already cater to several European languages, and the company is now preparing the next iteration of “Europa” models, which will encompass all European languages.

This aligns with Hajič’s notion of “not starting from scratch,” as there is a solid foundation of knowledge and technology ready to leverage.

The Quest for Sovereignty

Critics have pointed out that OpenEuroLLM consists of many components—a complexity that Hajič acknowledges while maintaining a hopeful perspective.

“I have participated in numerous collaborative projects, and I believe this brings certain advantages in contrast to a single corporate entity,” he remarked. “Certainly, companies like OpenAI and Mistral have accomplished fantastic things, but I’m optimistic that the combined expertise of academia and corporate focus can create something innovative.”

Ultimately, it is not merely about outperforming Big Tech or multi-billion-dollar AI startups; the ultimate objective is digital sovereignty: the establishment of predominantly open foundational LLMs crafted by and for Europe.

“While I hope this won’t be the case, if we ultimately are not the leading model, but do have a competent model, it will still signify a positive outcome, as it will feature all core components based in Europe,” Hajič concluded.

Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence

Open Source LLMs Advance Europe’s Digital Sovereignty Agenda

About Us

Top Categories

Latest Articles

Editor's Picks

The reputation of struggling YC...

Roku Introduces Standalone App for...

Meta Launches Initial Testing of...

Uber and WeRide Accelerate Robotaxi...