News > The House > OPEN-SOURCE AI SAFETY

OPEN-SOURCE AI SAFETY

A team of artificial intelligence experts, including Professor Yarin Gal, has reported a major advance in preventing open-weight language models from incorporating harmful knowledge during training.

	19 Aug 2025
	The House

By filtering out potentially harmful knowledge during training, the researchers from the University of Oxford, EleutherAI and the UK AI Security Institute have been able to build models that resist subsequent malicious updates – especially valuable in sensitive domains such as biothreat research. The findings have been published as a preprint on arXiv.

Senior author Professor Yarin Gal, Associate Professor of Machine Learning at the University of Oxford and Tutor in Computer Science at Christ Church, said: ‘The research community has made great progress with AI safeguards over the past few years, but a remaining massive challenge is safeguarding open weight models – how do we build models that we can distribute to all without raising risks of misuse. Our study makes a significant stride in this direction.’

This approach represents a shift in the approach to AI safety: rather than retrofitting safeguards, safety is embedded from the start. The method reduces risk without sacrificing openness, enabling transparency and research without compromising security.

Open-weight models are a cornerstone of transparent, collaborative AI research. Their availability promotes red teaming, mitigates market concentration, and accelerates scientific progress. With the recent releases of prominent models like Kimi-K2, GLM-4.5 and gpt-oss, open-weight models are steadily increasing in their capabilities and influence, with capabilities that reportedly lag behind the best closed models by just 6–12 months, according to researchers at Stanford and Epoch AI.

However, openness introduces risk. Just as open models can be refined for positive applications, they can also be modified for harm. Modified text models lacking safeguards are already widespread, while open image generators have become tools for producing illegal content. Because these models can be downloaded, altered and redistributed by anyone, developing robust protections against tampering is critical.

Instead of training a general-purpose model and then adding filters, this work builds safeguards throughout the entire training process by filtering unwanted knowledge from the training data. The team focused on a biothreat setting and filtered biology-related content from the model’s training data, aiming to deny the model this knowledge entirely, rather than suppressing it post hoc which can often be reversed easily.

The filtered model was able to resist training on up to 25,000 papers on biothreat-related topics (such as virology, bioweapons, reverse genetics, and viral vectors), proving over ten times more effective than prior state-of-the-art methods. Unlike traditional fine-tuning or access-limiting strategies, which can often be bypassed, filtering pretraining data proved resilient even under sustained adversarial attack – surviving 10,000 steps and over 300 million tokens of targeted fine-tuning.

The team used a multi-stage filtering pipeline combining keyword blocklists and a machine-learning classifier trained to detect high-risk content. This allowed them to remove only the relevant materials – around 8–9% of the dataset – while preserving the breadth and depth of general information.

They then trained AI models from scratch using this filtered data, benchmarking them against both unfiltered models and models using state-of-the-art safety fine-tuning methods. Across evaluations, the filtered models performed just as well on standard tasks – like common-sense reasoning and scientific Q&A.

The findings come at a critical moment for global AI governance. Several recent AI safety reports from OpenAI, Anthropic and DeepMind have warned that frontier models may soon be able to assist with the creation of biological or chemical threats. Many governments have expressed concern about the lack of safeguards for openly available models, which cannot be recalled once released.

Study co-author Stephen Casper (UK AI Security Institute) said: ‘By removing the unwanted knowledge from the start, the resulting model had no basis for acquiring dangerous capabilities, even after further training attempts. Our study therefore shows that data filtration can be a powerful tool in helping developers balance safety and innovation in open-source AI.’

Original article.

Have your say

Share your story, news and more. Please submit your content below, which will be sent to the Development Office and turned into a story for the website or content for e-Matters/ Annual Report.

News category:

(At least 596 x 372 pixels and max. 2MB)

OPEN-SOURCE AI SAFETY

A team of artificial intelligence experts, including Professor Yarin Gal, has reported a major advance in preventing open-weight language models from incorporating harmful knowledge during training.

SIMILAR STORIES

MERRY CHRISTMAS FROM CHRIST CHURCH!

CHRISTMAS TREE FESTIVAL

IntoUniversity 2025

RETIREMENT ANNOUNCEMENT

TRUTH WORTH THE SEEKING: JOHN LOCKE AND CHRIST CHURCH

Have your say

OPEN-SOURCE AI SAFETY

A team of artificial intelligence experts, including Professor Yarin Gal, has reported a major advance in preventing open-weight language models from incorporating harmful knowledge during training.

SIMILAR STORIES

MERRY CHRISTMAS FROM CHRIST CHURCH!

CHRISTMAS TREE FESTIVAL

IntoUniversity 2025

RETIREMENT ANNOUNCEMENT

TRUTH WORTH THE SEEKING: JOHN LOCKE AND CHRIST CHURCH

Have your say

Submit your news

Login or join