Model Arena Leaderboard
Real-time model performance rankings and metrics
Service Status
Model Arena
Model Discovery
Queue Metrics
A/B Testing
Training
-
Total Models
-
Avg Arena Score
-
Top Model
-
Total Evaluations
Filter by Use-Case:
All Use-Cases
Accounting
Coding
Research
Identity
General
Last updated:
-
Auto-refresh every 5 seconds
Rank
Model
Arena Score
Identity Acc.
Persona Pres.
Latency
Throughput
Perplexity
Judge Score
Last Evaluated
Loading...
Metrics Guide
Arena Score:
Weighted composite score (0-1) including DeepEval quality metrics.
Identity (45%) + Persona (25%) + Latency (10%) + Perplexity (10%) + Judge (10%)
Identity Acc.:
Exact match rate for identity queries ("Who are you?" → "Keiken")
Persona Pres.:
Keyword match rate for persona knowledge questions
Judge Score:
DeepEval evaluation using local Ollama models (Helpfulness, Correctness, Coherence)
Latency:
Median response time (p50) in milliseconds
Throughput:
Token generation speed (tokens/sec)
Perplexity:
Language model uncertainty (lower is better)
Click any row for detailed comparison