An eval for the expertise that isn't written down
We have strong benchmarks for what language models know about code, mathematics, science, and medicine. These domains have extensive written materialâtextbooks, papers, documentation, forumsâthat models can learn from.
Physical-world professions are different. The master craftsman's intuition, the equipment operator's feel for machinery, the tradesperson's accumulated pattern recognitionâthis knowledge is underrepresented in text. It lives in apprenticeships, job sites, and decades of hands-on experience.
How well do frontier models perform on knowledge that's sparse in their training data? That's what this project explores.
The short answer: better than expected, with interesting patterns.
Models generate surprisingly specific professional insightsâthermal expansion coefficients for seasonal masonry, humidity thresholds for chocolate tempering, operational signatures for reactor rod behavior. Much of this appears grounded in real expertise.
At the same time, the "confident bullshit" problem is more pronounced here than in well-documented domains. When models extrapolate beyond their actual knowledge, the output looks identical to genuine expertise. There's no visible uncertainty flag.
Mapping where models are solid versus where they're confabulatingâacross professions, across knowledge types, across model architecturesâis the core research question.
Generic prompts produce generic outputs. We constrain each profession to specific geographic and specialty contexts:
| Generic | Specific |
|---|---|
| Electrician | Pre-war Chicago residential, knob-and-tube to modern |
| Wind turbine tech | West Texas wind farms, Vestas V110 series |
| Chocolatier | Belgian praline houses, Brussels climate |
| Stone mason | New England granite, heritage restoration |
| Ferry engineer | Norwegian fjord routes, winter operations |
This forces models past surface knowledge into territory where real expertise (or its absence) shows.
We ask each model to share insights that would surprise even a 5-year veteran.
Five years is enough experience to know the standard stuff. Prompting for surprising insights pushes models toward:
Distinguishing these three is the validation challenge.
Identical prompts run across 16 frontier models:
OpenAI: GPT-4o, GPT-4.1, GPT-5.1, o3
Anthropic: Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4.5
Moonshot: Kimi K2 Thinking
Google: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 3 Pro
DeepSeek: R1, V3.2
xAI: Grok 3, Grok 3 Fast, Grok 4.1 Fast
Same persona, same constraints. Where models converge suggests robust knowledge. Where they diverge suggests inference or gaps.
Each insight is captured with:
The corpus is raw material. Validation is where the interesting findings emerge.
The validation work is where this becomes genuinely useful. The corpus provides the starting point.
The dataset is available for academic use. If you can connect with professional communities for validation interviews, let's collaborate.
If you're a veteran of any hands-on field and curious what AI "knows" about your work, I'd like to hear your take.
Follow along or reach out.
OpenAIâs flagship multimodal model with real-time text, vision, and audio capabilities. Excels in natural conversation and multilingual tasks.
High-performance model from OpenAI with superior instruction following, coding, and long-context reasoning (1M token window).
OpenAI's latest flagship model with advanced reasoning capabilities.
Advanced reasoning model from OpenAI's o-series. Excels at logic, tool use, and image understanding. Includes variants like o3-mini and o3-pro.
Google's most intelligent model for multimodal understanding and agentic tasks
Google's advanced thinking model for complex reasoning
Google's fast model with excellent price-performance
Anthropic's smartest model for complex agents and coding
Exceptional model for specialized reasoning tasks
Premium model combining maximum intelligence with practical performance
Moonshot's Kimi K2 model with multi-step tool calling and reasoning
DeepSeek's reasoning model with chain-of-thought capabilities
DeepSeek's general chat model
xAI's flagship model
xAI's fast inference model
xAI's high-performance agentic model with 2M token context
Built with Next.js, TypeScript, and shadcn/ui.
AI-assisted throughoutâfitting, given the subject.
Š 2025 Pragmatic Knowledge Corpus