Evals
Runs compare candidate models against a dataset. Datasets can be materialized from live product data — see a product page to kick one off.
| Run | Dataset | Model | Kind | Status | Progress | Accuracy | Cost | Avg latency |
|---|---|---|---|---|---|---|---|---|
| run_793ee525eb84… | viva — human adjudications (2026-04-21) | GPT-4o-mini | model_comparison | running | 29/10 | — | $0.001564 | 974ms |
| run_1e1e3fe5d8d6… | viva — human adjudications (2026-04-21) | GPT-4o | model_comparison | running | 31/10 | — | $0.0273 | 895ms |
| run_2d9dbf859a9e… | viva — human adjudications (2026-04-21) | Claude Sonnet 4.5 | model_comparison | running | 12/10 | — | $0.0187 | 2430ms |
| run_5f6d411cfddd… | viva — human adjudications (2026-04-21) | Claude Haiku 4.5 | model_comparison | running | 26/10 | — | $0.0140 | 1089ms |