Evals
Run a golden dataset through the agent, grade with code and an LLM judge, and track the pass rate.
Section: testing-evaluation · scene id evals · tutorial 04-testing-evaluation/02-evals
Run a golden dataset through the agent, grade with code and an LLM judge, and track the pass rate.
Section: testing-evaluation · scene id evals · tutorial 04-testing-evaluation/02-evals