Operational scorecards
Score task completion, tool discipline, evidence quality, safety, and communication.
Agent Evaluation Field Notes
Compact templates, replay checklists, and RAG guardrail smoke tests for builders who need repeatable agent evaluation workflows.
Score task completion, tool discipline, evidence quality, safety, and communication.
Debug agent runs by replaying decisions, tool calls, evidence, and recovery points.
Smoke-test prompt injection, vector poisoning, source grounding, and citation behavior.