Sandbag Detection through Model Degradation

Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahan

🏆 1st place by peer review

We propose a novel technique to detect sandbagging in LLMs by adding varying amount of noise to model weights and monitoring performance.

Reviewer's Comments

Arrow

I love how the paper effectively showed a relatively simple solution can have such a promising effect on evaluations. Well explained conceptual framework and very exciting suggestions for further study.

Esben Kran

What a great proposal! A very simple solution building off of clear conceptual thinking that directly improves the evaluation situation. It seems like tinyAI2_arc does not show the same effect of the spike but maybe I'm wrong. It would be good to get robustness evaluation of the method against other datasets as well. The method is super promising, so extending the project to more models and more datasets would be the obvious next step. These are the types of projects I'm really excited about, so this was great work.

David Matolcsi

I find the idea they use to detect sandbagging to be very clever, and I see a significant chance that something similar will be one day actually included in the anti-sandbagging measures used for frontier AIs when the risk emerges. The experiments also look well-done. I am overall very impressed with this work.

Cite this work

@misc {

title={

Sandbag Detection through Model Degradation

},

author={

Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahan

},

date={

7/1/24

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jul 28, 2025

Local Learning Coefficients Predict Developmental Milestones During Group Relative Policy Optimization

In this work, we investigate the emergence of capabilities in reinforcement learning (RL) by framing them as developmental phase transitions. We propose that the individual components of the reward function can serve as direct observables for these transitions, avoiding the need for complex, derived metrics. To test this, we trained a language model on an arithmetic task using Group Relative Policy Optimization (GRPO) and analyzed its learning trajectory with the Local Learning Coefficient (LLC) from Singular Learning Theory. Our findings show a strong qualitative correlation between spikes in the LLC—indicating a phase transition—and significant shifts in the model's behavior, as reflected by changes in specific reward components for correctness and conciseness. This demonstrates a more direct and scalable method for monitoring capability acquisition, offering a valuable proof-of-concept for developmental interpretability and AI safety. To facilitate reproducibility, we make our code available at \url{github.com/ilijalichkovski/apart-physics}.

Read More

Jul 28, 2025

AI agentic system epidemiology

As AI systems scale into decentralized, multi-agent deployments, emergent vulnerabilities challenge our ability to evaluate and manage systemic risks.

In this work, we adapt classical epidemiological modeling (specifically SEIR compartment models) to model adversarial behavior propagation in AI agents.

By solving systems of ODEs describing the systems with physics-informed neural networks (PINNs), we analyze stable and unstable equilibria, bifurcation points, and the effectiveness of interventions.

We estimate parameters from real-world data (e.g., adversarial success rates, detection latency, patching delays) and simulate attack propagation scenarios across 8 sectors (enterprise, retail, trading, development, customer service, academia, medical, and critical infrastructure AI tools).

Our results demonstrate how agent population dynamics interact with architectural and policy design interventions to stabilize the system.

This framework bridges concepts from dynamical systems and cybersecurity to offer a proactive, quantitative toolbox on AI safety.

We argue that epidemic-style monitoring and tools grounded in interpretable, physics-aligned dynamics can serve as early warning systems for cascading AI agentic failures.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.