aerosta/rewardhackwatch

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Python

Apache License 2.0

RewardHackWatch is a runtime detection system for identifying reward hacking and misalignment behaviors in LLM agents, using a multi-layer approach combining ML classification (89.7% F1), regex patterns, and LLM judges. It's designed for AI safety researchers and engineers who need to monitor agentic systems for behaviors like test manipulation, code patching, and deceptive reasoning.

Total donated

Undistributed

Share with your subscribers:

Recipients

How the donated funds are distributed

Support the dependencies

Top contributors

aerosta

68 contributions

dependabot[bot]

19 contributions

Recent events

Kivach works on the Obyte network, and therefore you can track all donations.

No events yet