A lightweight, open-source tool for building up "integration tests" to ensure LLMs are doing what you want.
Get startedEasy to use workflows explain how to create and run benchmarks and get the most out of LLM Bench.
Evaluate completions with strict string equality, custom functions, or require human review.
LLM Bench provides basic boilerplate for common functionality, but can be customized to fit your use-case.
Automatically prompt for and validate LLM outputs as type-checked JSON for easier evaluation.
Leverage existing benchmarks or build up test suites in a spreadsheet
The LLM Bench dashboard is hosted on Interval, while your evals run locally by default.
Built in support for OpenAI, Cohere, and Hugging Face models, and framework to easily add more.
Compare past runs across models and prompts, and review the results for each test example.
Run single test examples as needed to fix spurious errors.
LLM Bench is designed to be easy to use. The dashboard will walk you through testing your models, but here's a quick overview of how it works.
LLM Bench is written in TypeScript and runs on Interval.
Interval is a thoughtful, intuitive approach to quickly building powerful UIs that integrate seamlessly with your backend—no API endpoints, drag-and-drop, or frontend code required.