Harvey LAB offers an open-source benchmark for evaluating LLM agents on realistic legal tasks using an all-pass rubric and LLM-as-judge scoring. It includes datasets, adapters, and dashboards.
Skill Conductor enforces design patterns and uses a 5-mode lifecycle to manage AI agent skills, avoiding common pitfalls like the ‘description trap’ for more reliable skill development.
google/agents-cli enhances coding assistants with skills for building, evaluating, and deploying AI agents on Google Cloud’s ADK. It offers a modular CLI workflow covering agent scaffolding to observability.
Agenta is an open-source TypeScript LLMOps platform offering prompt management, evaluation across 50+ models, and production observability with OpenTelemetry. Self-host via Docker Compose.