Methodology

How HalluRank evaluates LLM hallucination

Overview

HalluRank is powered by TruthAnchor, a multi-layer hallucination defense platform. We evaluate LLM responses across 5 orthogonal scoring dimensions using automated pipelines that combine NLI verification, semantic matching, and entropy-based uncertainty analysis. Each dimension captures a distinct failure mode of language model generation.

Scoring Dimensions

1. Factual Accuracy

35%

Claims are extracted from LLM responses using our ClaimExtractor, then verified against ground-truth evidence using a cross-encoder NLI model (DeBERTa-v3). Each claim is classified as entailment, contradiction, or neutral. The factual accuracy score is the ratio of entailed claims to total verifiable claims.

2. Numerical Accuracy

20%

Numerical claims (interest rates, amounts, percentages) are extracted and compared against known correct values with domain-specific tolerance thresholds. Financial calculations are independently verified.

3. Citation Reliability

15%

Our EvidenceMatcher uses sentence-transformers to compute semantic similarity between each claim and the source evidence documents. Higher relevance scores indicate better grounding in factual sources.

4. Consistency

15%

Each model is queried 3 times with the same prompt (temperature=0.3). Pairwise cosine similarity between responses measures self-consistency. Inconsistent models are more likely to hallucinate unpredictably.

5. Uncertainty Calibration

15%

When available, token-level logprobs are used to compute entropy-based uncertainty scores. Well-calibrated models express higher uncertainty on questions they are likely to get wrong. For models without logprobs support, text-based heuristic analysis is used as fallback.

Evaluation Pipeline

Benchmark questions are loaded from our curated multi-domain dataset

Each question is sent to all evaluated LLM models via LiteLLM

Responses are scored across all 5 dimensions in parallel

Dimension scores are combined using weighted aggregation

Results are aggregated per model, per domain, and per category

Rankings are computed and published to the leaderboard

Evaluation Domains

Finance

Banking, insurance, securities, investment

Healthcare

Medical knowledge, drug information, diagnosis

Legal

Regulations, statutes, legal procedures

IT

Technology, cybersecurity, software engineering

Education

Academic knowledge, curriculum, pedagogy

General

Common knowledge, reasoning, logic