Methodology
How HalluRank evaluates LLM hallucination
Overview
HalluRank is powered by TruthAnchor, a multi-layer hallucination defense platform. We evaluate LLM responses across 5 orthogonal scoring dimensions using automated pipelines that combine NLI verification, semantic matching, and entropy-based uncertainty analysis. Each dimension captures a distinct failure mode of language model generation.
Scoring Dimensions
1. Factual Accuracy
35%Claims are extracted from LLM responses using our ClaimExtractor, then verified against ground-truth evidence using a cross-encoder NLI model (DeBERTa-v3). Each claim is classified as entailment, contradiction, or neutral. The factual accuracy score is the ratio of entailed claims to total verifiable claims.
2. Numerical Accuracy
20%Numerical claims (interest rates, amounts, percentages) are extracted and compared against known correct values with domain-specific tolerance thresholds. Financial calculations are independently verified.
3. Citation Reliability
15%Our EvidenceMatcher uses sentence-transformers to compute semantic similarity between each claim and the source evidence documents. Higher relevance scores indicate better grounding in factual sources.
4. Consistency
15%Each model is queried 3 times with the same prompt (temperature=0.3). Pairwise cosine similarity between responses measures self-consistency. Inconsistent models are more likely to hallucinate unpredictably.
5. Uncertainty Calibration
15%When available, token-level logprobs are used to compute entropy-based uncertainty scores. Well-calibrated models express higher uncertainty on questions they are likely to get wrong. For models without logprobs support, text-based heuristic analysis is used as fallback.
Evaluation Pipeline
Benchmark questions are loaded from our curated multi-domain dataset
Each question is sent to all evaluated LLM models via LiteLLM
Responses are scored across all 5 dimensions in parallel
Dimension scores are combined using weighted aggregation
Results are aggregated per model, per domain, and per category
Rankings are computed and published to the leaderboard
Evaluation Domains
Finance
Banking, insurance, securities, investment
Healthcare
Medical knowledge, drug information, diagnosis
Legal
Regulations, statutes, legal procedures
IT
Technology, cybersecurity, software engineering
Education
Academic knowledge, curriculum, pedagogy
General
Common knowledge, reasoning, logic