Research Brief · TR-2025-25

Cognitive MMLU Methodology

Introduces Cognitive MMLU, a benchmark aligning reasoning quality with self-reported confidence for mechanism-first systems.

Published September 12, 2025Evaluation Science Group

Dataset

27 domain-specific scenarios with human-graded rubrics and reviewer commentary, expanding to 300+ scenarios.

Methodology

Paired each scenario with rubric-based scoring and self-reported confidence capture. Includes analysis pipeline for Pearson correlation and drift over time.

0.73 r

Confidence ↔ accuracy

Initial Cognitive MMLU cohort

Evaluation note

Abstract

Standard benchmarks ignore whether a system knows when it might be wrong. Cognitive MMLU pairs scenario-based questions with confidence elicitation to measure calibration alongside accuracy. This research brief explains the benchmark construction, scoring, and reviewer workflow.

Benchmark Scope

Covers safety, governance, memory, and reasoning tasks that correlate with institutional requirements. Emphasizes explainability over raw score chasing.

Reviewer Workflow

Every scenario requires dual reviewer sign-off. The brief details scoring sheets, calibration drift monitoring, and how results feed into Evaluation Notes.

Systems referenced

Programs referenced

How to cite

Thynaptic Research. "Cognitive MMLU Methodology (TR-2025-25)." Thynaptic Technical Report Series, September 2025.