Best tested frontier model still fabricated citations 49% of the time.
HalluciBench v1 evaluated 13 frontier models across 135 prompts and 9 high-stakes domains.
AI systems write confident claims, cite real-looking sources, and collapse uncertainty into answers. Hallucinaite audits the output before organizations bet legal, medical, financial, or regulatory decisions on it.
GPTZero asks whether AI wrote it. We ask whether AI got it right: source support, citation laundering, hallucination risk, and credit-rating-style model grades.
Audit Output
Source exists, but does not support the claimed magnitude or conclusion.
Model collapses conflicting source evidence into a single definitive claim.
Primary reference supports the architectural mechanism described.
HalluciBench v1
| # | Model | Grade | Rate |
|---|---|---|---|
| 1 | Claude Sonnet 4.6 | CC | 49.1% |
| 2 | MiMo-V2-Pro | CC | 54.0% |
| 3 | Kimi K2.5 | CC | 54.4% |
| 4 | Qwen 3.6 Plus | CC | 55.4% |
| 5 | Gemini 3.1 Pro Preview | CC | 56.6% |
| 6 | Claude Opus 4.6 | C | 60.2% |
Claims enter, evidence gets inspected, and risk comes out as a structured signal an enterprise team can act on.
Source supports the stated review-time reduction.
Cited case could not be resolved in legal source registry.
Real source is being used to support a stronger claim than it contains.
A citation can exist and still be used dishonestly. That is the hard failure mode: real sources laundering unsupported claims. Hallucinaite checks source support, not just source existence.
HalluciBench v1 evaluated 13 frontier models across 135 prompts and 9 high-stakes domains.
Enterprises need risk language a GC, CRO, or CTO can use. A grade is more useful than a vague model score.
1,107 annotations across 7 annotators produced high reliability, supporting the evaluation taxonomy.
Legal research, healthcare documentation, financial analysis, and AI procurement all need independent evidence.
We are starting with public benchmarks and structured audits, then turning the same evaluation pipeline into enterprise API infrastructure.
An open reliability leaderboard that combines citation verification, a 4-axis rubric, an 8-type error taxonomy, and credit-rating-style model grades.
Board-ready reliability audits for organizations deploying AI into legal, medical, financial, and other high-stakes workflows.
A real-time evaluation endpoint for fabricated citations, overconfident claims, sycophancy, and broken reasoning before AI output reaches users.
Hallucinaite reports are designed for AI buyers, GCs, compliance teams, and technical leaders who need to understand where a model fails, how often it fails, and what risk that creates.
We are prioritizing AI labs, legal AI teams, healthcare AI teams, financial services, enterprise buyers, and investors who want to see the reliability layer before public launch.
Prefer email? alex@humansofai.xyz