aerosta

aerosta/rewardhackwatch

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Python
7
0
Apache License 2.0

RewardHackWatch is a runtime detection system for identifying reward hacking and misalignment behaviors in LLM agents, using a multi-layer approach combining ML classification (89.7% F1), regex patterns, and LLM judges. It's designed for AI safety researchers and engineers who need to monitor agentic systems for behaviors like test manipulation, code patching, and deceptive reasoning.

Total donated
Undistributed
Share with your subscribers:

Recipients

How the donated funds are distributed

Support the dependencies

Support the repos that depend on this repository

Top contributors

aerosta's profile
aerosta
54 contributions
dependabot[bot]'s profile
dependabot[bot]
5 contributions

Recent events

Kivach works on the Obyte network, and therefore you can track all donations.

No events yet