Research Brief · TR-2025-25
Cognitive MMLU Methodology
Introduces Cognitive MMLU, a benchmark aligning reasoning quality with self-reported confidence for mechanism-first systems.
Dataset
27 domain-specific scenarios with human-graded rubrics and reviewer commentary, expanding to 300+ scenarios.
Methodology
Paired each scenario with rubric-based scoring and self-reported confidence capture. Includes analysis pipeline for Pearson correlation and drift over time.
Abstract
Standard benchmarks ignore whether a system knows when it might be wrong. Cognitive MMLU pairs scenario-based questions with confidence elicitation to measure calibration alongside accuracy. This research brief explains the benchmark construction, scoring, and reviewer workflow.
Benchmark Scope
Covers safety, governance, memory, and reasoning tasks that correlate with institutional requirements. Emphasizes explainability over raw score chasing.
Reviewer Workflow
Every scenario requires dual reviewer sign-off. The brief details scoring sheets, calibration drift monitoring, and how results feed into Evaluation Notes.
Systems referenced
Programs referenced
Evaluation Notes
How to cite
Thynaptic Research. "Cognitive MMLU Methodology (TR-2025-25)." Thynaptic Technical Report Series, September 2025.