Testing AI on Physical-World Knowledge

An eval for the expertise that isn't written down

The Gap

We have strong benchmarks for what language models know about code, mathematics, science, and medicine. These domains have extensive written material—textbooks, papers, documentation, forums—that models can learn from.

Physical-world professions are different. The master craftsman's intuition, the equipment operator's feel for machinery, the tradesperson's accumulated pattern recognition—this knowledge is underrepresented in text. It lives in apprenticeships, job sites, and decades of hands-on experience.

How well do frontier models perform on knowledge that's sparse in their training data? That's what this project explores.

What We're Finding

The short answer: better than expected, with interesting patterns.

Models generate surprisingly specific professional insights—thermal expansion coefficients for seasonal masonry, humidity thresholds for chocolate tempering, operational signatures for reactor rod behavior. Much of this appears grounded in real expertise.

At the same time, the "confident bullshit" problem is more pronounced here than in well-documented domains. When models extrapolate beyond their actual knowledge, the output looks identical to genuine expertise. There's no visible uncertainty flag.

Mapping where models are solid versus where they're confabulating—across professions, across knowledge types, across model architectures—is the core research question.

Methodology

Hyper-Specific Persona Construction

Generic prompts produce generic outputs. We constrain each profession to specific geographic and specialty contexts:

GenericSpecific
ElectricianPre-war Chicago residential, knob-and-tube to modern
Wind turbine techWest Texas wind farms, Vestas V110 series
ChocolatierBelgian praline houses, Brussels climate
Stone masonNew England granite, heritage restoration
Ferry engineerNorwegian fjord routes, winter operations

This forces models past surface knowledge into territory where real expertise (or its absence) shows.

The Surprise Factor Prompt

We ask each model to share insights that would surprise even a 5-year veteran.

Five years is enough experience to know the standard stuff. Prompting for surprising insights pushes models toward:

  • •Genuinely deep knowledge from training data
  • •Reasonable inferences that may or may not reflect practice
  • •Extrapolations with no real grounding

Distinguishing these three is the validation challenge.

Multi-Model Comparison

Identical prompts run across 16 frontier models:

OpenAI: GPT-4o, GPT-4.1, GPT-5.1, o3

Anthropic: Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4.5

Moonshot: Kimi K2 Thinking

Google: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 3 Pro

DeepSeek: R1, V3.2

xAI: Grok 3, Grok 3 Fast, Grok 4.1 Fast

Same persona, same constraints. Where models converge suggests robust knowledge. Where they diverge suggests inference or gaps.

Structured Output

Each insight is captured with:

  • •The specific claim
  • •Why the model says it matters
  • •Category tags (thermal, procedural, sensory, relational, etc.)
  • •Model identifier and generation date

Validation Approaches

The corpus is raw material. Validation is where the interesting findings emerge.

Professional Review
Present insights to actual veterans (30+ years). Which claims are accurate? Which are plausible but wrong? Which are textbook knowledge dressed up as experience?
Blind Comparison
Mix AI-generated insights with real professional quotes. Can experts distinguish them? Can they identify which model generated which?
Cross-Model Analysis
Multiple models independently generating the same specific insight suggests grounding in real training data. Divergent or contradictory claims suggest inference.
Claim Verification
For specific factual claims—exact measurements, thresholds, timelines—verify against trade literature or empirical testing.

Research Questions

Primary
How accurately do frontier language models represent tacit professional knowledge in domains with sparse written material?

Secondary

  • •Which professions show strongest/weakest model performance?
  • •Do certain knowledge types (sensory, procedural, relational) perform differently?
  • •How do different model architectures compare?
  • •Can we identify patterns that predict reliable vs. unreliable outputs?

Scope and Limits

This corpus is:
  • ✓A structured dataset of AI-generated professional knowledge claims
  • ✓A methodology for cross-model comparison
  • ✓Raw material for validation research
  • ✓An open resource
This corpus is not:
  • ✗A validated set of true professional insights
  • ✗A comprehensive survey of any profession
  • ✗A finished benchmark with ground-truth labels

The validation work is where this becomes genuinely useful. The corpus provides the starting point.

Contribute

Researchers

The dataset is available for academic use. If you can connect with professional communities for validation interviews, let's collaborate.

Professionals

If you're a veteran of any hands-on field and curious what AI "knows" about your work, I'd like to hear your take.

Everyone

Follow along or reach out.

Models Being Tested

OpenAI

GPT-4o

OpenAI’s flagship multimodal model with real-time text, vision, and audio capabilities. Excels in natural conversation and multilingual tasks.

OpenAI

GPT-4.1

High-performance model from OpenAI with superior instruction following, coding, and long-context reasoning (1M token window).

OpenAI

GPT-5.1

OpenAI's latest flagship model with advanced reasoning capabilities.

OpenAI

o3

Advanced reasoning model from OpenAI's o-series. Excels at logic, tool use, and image understanding. Includes variants like o3-mini and o3-pro.

Google

Gemini 3 Pro

Google's most intelligent model for multimodal understanding and agentic tasks

Google

Gemini 2.5 Pro

Google's advanced thinking model for complex reasoning

Google

Gemini 2.5 Flash

Google's fast model with excellent price-performance

Anthropic

Claude Sonnet 4.5

Anthropic's smartest model for complex agents and coding

Anthropic

Claude Opus 4.1

Exceptional model for specialized reasoning tasks

Anthropic

Claude Opus 4.5

Premium model combining maximum intelligence with practical performance

Moonshot AI

Kimi K2 Thinking

Moonshot's Kimi K2 model with multi-step tool calling and reasoning

DeepSeek

DeepSeek R1

DeepSeek's reasoning model with chain-of-thought capabilities

DeepSeek

DeepSeek V3.2

DeepSeek's general chat model

xAI

Grok 3

xAI's flagship model

xAI

Grok 3 Fast

xAI's fast inference model

xAI

Grok 4.1 Fast

xAI's high-performance agentic model with 2M token context

Built with Next.js, TypeScript, and shadcn/ui.

AI-assisted throughout—fitting, given the subject.

Š 2025 Pragmatic Knowledge Corpus