Attention: You are using an outdated browser, device or you do not have the latest version of JavaScript downloaded and so this website may not work as expected. Please download the latest software or switch device to avoid further issues.
19 Aug 2025 | |
The House |
By filtering out potentially harmful knowledge during training, the researchers from the University of Oxford, EleutherAI and the UK AI Security Institute have been able to build models that resist subsequent malicious updates – especially valuable in sensitive domains such as biothreat research. The findings have been published as a preprint on arXiv.
Senior author Professor Yarin Gal, Associate Professor of Machine Learning at the University of Oxford and Tutor in Computer Science at Christ Church, said: ‘The research community has made great progress with AI safeguards over the past few years, but a remaining massive challenge is safeguarding open weight models – how do we build models that we can distribute to all without raising risks of misuse. Our study makes a significant stride in this direction.’
This approach represents a shift in the approach to AI safety: rather than retrofitting safeguards, safety is embedded from the start. The method reduces risk without sacrificing openness, enabling transparency and research without compromising security.
Open-weight models are a cornerstone of transparent, collaborative AI research. Their availability promotes red teaming, mitigates market concentration, and accelerates scientific progress. With the recent releases of prominent models like Kimi-K2, GLM-4.5 and gpt-oss, open-weight models are steadily increasing in their capabilities and influence, with capabilities that reportedly lag behind the best closed models by just 6–12 months, according to researchers at Stanford and Epoch AI.
However, openness introduces risk. Just as open models can be refined for positive applications, they can also be modified for harm. Modified text models lacking safeguards are already widespread, while open image generators have become tools for producing illegal content. Because these models can be downloaded, altered and redistributed by anyone, developing robust protections against tampering is critical.
Instead of training a general-purpose model and then adding filters, this work builds safeguards throughout the entire training process by filtering unwanted knowledge from the training data. The team focused on a biothreat setting and filtered biology-related content from the model’s training data, aiming to deny the model this knowledge entirely, rather than suppressing it post hoc which can often be reversed easily.
The filtered model was able to resist training on up to 25,000 papers on biothreat-related topics (such as virology, bioweapons, reverse genetics, and viral vectors), proving over ten times more effective than prior state-of-the-art methods. Unlike traditional fine-tuning or access-limiting strategies, which can often be bypassed, filtering pretraining data proved resilient even under sustained adversarial attack – surviving 10,000 steps and over 300 million tokens of targeted fine-tuning.
The team used a multi-stage filtering pipeline combining keyword blocklists and a machine-learning classifier trained to detect high-risk content. This allowed them to remove only the relevant materials – around 8–9% of the dataset – while preserving the breadth and depth of general information.
They then trained AI models from scratch using this filtered data, benchmarking them against both unfiltered models and models using state-of-the-art safety fine-tuning methods. Across evaluations, the filtered models performed just as well on standard tasks – like common-sense reasoning and scientific Q&A.
The findings come at a critical moment for global AI governance. Several recent AI safety reports from OpenAI, Anthropic and DeepMind have warned that frontier models may soon be able to assist with the creation of biological or chemical threats. Many governments have expressed concern about the lack of safeguards for openly available models, which cannot be recalled once released.
Study co-author Stephen Casper (UK AI Security Institute) said: ‘By removing the unwanted knowledge from the start, the resulting model had no basis for acquiring dangerous capabilities, even after further training attempts. Our study therefore shows that data filtration can be a powerful tool in helping developers balance safety and innovation in open-source AI.’
Original article.
Former Head Librarian, Henry John Richard Wing (known as John), has sadly passed away. John retired from Christ Church in 1994. His funeral will be at Oxford Crematorium on Wednesd… More...
A conference in celebration of the 500th anniversary of Cardinal College will take place in the Upper Library on Saturda… More...
What do we know when we read poetry? I try and answer this question by working – without any intention of diplomacy – in… More...
A passion for plants and being outdoors in nature inspired Christ Church's Head Gardner, Steve Howes, to pursue a career… More...
Simon Dadson, Professor in Physical Geography, and a team of international researchers have published a paper which inve… More...