Evals

Run a golden dataset through the agent, grade with code and an LLM judge, and track the pass rate.

Section: testing-evaluation · scene id evals · tutorial 04-testing-evaluation/02-evals