Model Arena Leaderboard

Real-time model performance rankings and metrics

-
Total Models
-
Avg Arena Score
-
Top Model
-
Total Evaluations
Last updated: -
Auto-refresh every 5 seconds
Rank Model Arena Score Identity Acc. Persona Pres. Latency Throughput Perplexity Judge Score Last Evaluated
Loading...
Metrics Guide
  • Arena Score: Weighted composite score (0-1) including DeepEval quality metrics.
    Identity (45%) + Persona (25%) + Latency (10%) + Perplexity (10%) + Judge (10%)
  • Identity Acc.: Exact match rate for identity queries ("Who are you?" → "Keiken")
  • Persona Pres.: Keyword match rate for persona knowledge questions
  • Judge Score: DeepEval evaluation using local Ollama models (Helpfulness, Correctness, Coherence)
  • Latency: Median response time (p50) in milliseconds
  • Throughput: Token generation speed (tokens/sec)
  • Perplexity: Language model uncertainty (lower is better)
  • Click any row for detailed comparison