Evals

Runs compare candidate models against a dataset. Datasets can be materialized from live product data — see a product page to kick one off.

RunDatasetModelKindStatusProgressAccuracyCostAvg latency
run_793ee525eb84…viva — human adjudications (2026-04-21)GPT-4o-minimodel_comparisonrunning29/10$0.001564974ms
run_1e1e3fe5d8d6…viva — human adjudications (2026-04-21)GPT-4omodel_comparisonrunning31/10$0.0273895ms
run_2d9dbf859a9e…viva — human adjudications (2026-04-21)Claude Sonnet 4.5model_comparisonrunning12/10$0.01872430ms
run_5f6d411cfddd…viva — human adjudications (2026-04-21)Claude Haiku 4.5model_comparisonrunning26/10$0.01401089ms