Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
RewardHackWatch is a runtime detection system for identifying reward hacking and misalignment behaviors in LLM agents, using a multi-layer approach combining ML classification (89.7% F1), regex patterns, and LLM judges. It's designed for AI safety researchers and engineers who need to monitor agentic systems for behaviors like test manipulation, code patching, and deceptive reasoning.
How the donated funds are distributed
Kivach works on the Obyte network, and therefore you can track all donations.