┳━┳ LLM Bench

Built with

Create and run test suites for your language models

A lightweight, open-source tool for building up "integration tests" to ensure LLMs are doing what you want.

Get started
Why bother?

LLMs are tough to wrangle 😫

Cherry-picked examples may dazzle, but the real challenge in deploying large language models (LLMs) is ensuring reliability across many inputs and scenarios. LLM Bench makes this easier.
  • Test across a wide spectrum of inputs to build confidence in LLM behavior
  • Compare models, and make informed trade-offs between performance, latency, and cost
  • Understand limitations in the models you use and where exactly they fall short
  • Learn to prompt better systematically instead of fiddling with one-off tweaks
Demo

See it in action

Features

What’s in the box?

Friendly UI

Easy to use workflows explain how to create and run benchmarks and get the most out of LLM Bench.

Flexible evaluation

Evaluate completions with strict string equality, custom functions, or require human review.

Open source and extensible

LLM Bench provides basic boilerplate for common functionality, but can be customized to fit your use-case.

Structured completions

Automatically prompt for and validate LLM outputs as type-checked JSON for easier evaluation.

CSV upload to create examples

Leverage existing benchmarks or build up test suites in a spreadsheet

Hosted dashboard

The LLM Bench dashboard is hosted on Interval, while your evals run locally by default.

Test any model

Built in support for OpenAI, Cohere, and Hugging Face models, and framework to easily add more.

Benchmark history

Compare past runs across models and prompts, and review the results for each test example.

Retry any test

Run single test examples as needed to fix spurious errors.

How it works

Create a benchmark in minutes

LLM Bench is designed to be easy to use. The dashboard will walk you through testing your models, but here's a quick overview of how it works.

1. Create a benchmark 🛠

2. Run it against a model 🏃

3. Compare results 👩‍🔬

Built with Interval

LLM Bench is written in TypeScript and runs on Interval.

Interval is a thoughtful, intuitive approach to quickly building powerful UIs that integrate seamlessly with your backend—no API endpoints, drag-and-drop, or frontend code required.