Benchmark paper

PatternBench: Evaluating Long-Context Memory, Safety, Hallucination, and Governance in Large Language Models

Evaluating long-context memory, safety, hallucination, and governance in large language models.

Authors: Owen Sakawa; Jackson Mwaniki; Bitange Ndemo; Randi C. Martin; Valentin Dragoi; Caleb Kemere; Krishna V. Palem; Douglas Natelson; Fernanda Morales-Calva; Stephanie Leal

Published: 2026-02-04

Institution: Elloe AI Research Lab

Full text: Download PDF

Abstract

PatternBench is a benchmark paper on long-context reliability in large language models, focused on multi-turn conversations in regulated environments. The benchmark is designed to test how recall, hallucination dynamics, calibration, and governance compliance change as dialogue length increases.

The paper’s core value is that it treats reliability as more than a one-turn accuracy problem. It looks at whether models remain stable, policy-aware, and decision-useful as the conversation continues and errors begin to compound.

Browse more