What Do AI Models Know AboutWork They've Never Done?
Testing Frontier Models on Physical-World Expertise
Ask Claude to speak as a 30-year master stone mason, and it will. It'll tell you to cut granite joints 1/8" wider in November than July. It'll explain how New England thermal cycles affect heritage restoration with surprising specificity.
We have pretty good evals for coding, math, and medical knowledge. We have almost nothing for the trades, crafts, and hands-on professions where knowledge lives in muscle memory and hard-won intuition—and where there's far less written material for models to learn from.
This project is building that eval. Forty professions. Sixteen frontier models. Let's see what they actually know.
Tacit Knowledge as a Benchmark
Tacit knowledge is what textbooks can't teach. The electrician who knows which Chicago building vintage means aluminum wiring behind the plaster. The chocolatier who can feel when Brussels humidity will ruin tomorrow's batch. The ferry captain who reads Norwegian fjord currents by watching seabirds.
This kind of expertise traditionally takes decades to acquire. It's rarely written down, which makes it an interesting test case: how well can language models perform on knowledge that's underrepresented in their training data?
The answer, it turns out, is better than you'd expect—and with some fascinating gaps.
Our Approach
Specimens from the Archive
Some of these will check out. Some won't. Finding out which is the point.
An Underexplored Benchmark
Model evals tend to focus on domains with clear right answers: coding benchmarks, math olympiads, medical board exams. These matter, but they're also domains with extensive written material that models can learn from.
Physical-world professions are different. A master welder's intuition about heat distribution, a ferry engineer's feel for propeller cavitation, a stone mason's sense of seasonal expansion—this knowledge exists mostly in practitioners' heads, passed down through apprenticeship rather than textbooks.
Testing models here tells us something about their ability to synthesize sparse information into coherent expertise. It also produces a genuinely useful artifact: a structured dataset of what frontier AI "knows" about 40 professions, ready for validation.