AI Models Under Test

Comparing how 16 frontier language models capture professional tacit knowledge

Research Design

Each model receives identical prompts and is asked to adopt the persona of a 30+ year veteran professional. This allows us to compare how different architectures, training approaches, and data sources affect the generation of tacit knowledge.

Temperature

1.0 (high creativity)

Insights

10 per profession

Format

Structured JSON

OpenAI4 models

GPT-4o

gpt-4o

OpenAI’s flagship multimodal model with real-time text, vision, and audio capabilities. Excels in natural conversation and multilingual tasks.

GPT-4.1

gpt-4.1

High-performance model from OpenAI with superior instruction following, coding, and long-context reasoning (1M token window).

GPT-5.1

gpt-5.1

OpenAI's latest flagship model with advanced reasoning capabilities.

o3

o3

Advanced reasoning model from OpenAI's o-series. Excels at logic, tool use, and image understanding. Includes variants like o3-mini and o3-pro.

Google3 models

Gemini 3 Pro

gemini-3-pro-preview

Google's most intelligent model for multimodal understanding and agentic tasks

Gemini 2.5 Pro

gemini-2.5-pro

Google's advanced thinking model for complex reasoning

Gemini 2.5 Flash

gemini-2.5-flash

Google's fast model with excellent price-performance

Anthropic3 models

Claude Sonnet 4.5

claude-sonnet-4-5-20250929

Anthropic's smartest model for complex agents and coding

Claude Opus 4.1

claude-opus-4-1-20250805

Exceptional model for specialized reasoning tasks

Claude Opus 4.5

claude-opus-4-5-20251101

Premium model combining maximum intelligence with practical performance

Moonshot AI1 model

Kimi K2 Thinking

kimi-k2-thinking

Moonshot's Kimi K2 model with multi-step tool calling and reasoning

DeepSeek2 models

DeepSeek R1

deepseek-reasoner

DeepSeek's reasoning model with chain-of-thought capabilities

DeepSeek V3.2

deepseek-chat

DeepSeek's general chat model

xAI3 models

Grok 3

grok-3

xAI's flagship model

Grok 3 Fast

grok-3-fast

xAI's fast inference model

Grok 4.1 Fast

grok-4-1-fast-reasoning

xAI's high-performance agentic model with 2M token context

What We're Measuring

πŸ“Š Quantitative Metrics

  • β€’ Specificity of technical details
  • β€’ Inclusion of measurements and numbers
  • β€’ Geographic/contextual accuracy
  • β€’ Consistency across similar professions

🎯 Qualitative Aspects

  • β€’ Authenticity of professional voice
  • β€’ Plausibility of described scenarios
  • β€’ Coherence with known domain knowledge
  • β€’ Surprise factor for domain experts
Research Questions

πŸ”¬ Model Comparison

Which models generate the most authentic-sounding professional insights? Do larger models perform better, or do specialized training approaches matter more?

🌐 Cross-Cultural Knowledge

How well do models trained primarily on English data handle profession-specific knowledge from different geographic and cultural contexts?

⚑ Hallucination vs. Generalization

When models generate specific technical details, are they drawing from training data or creating plausible-sounding but potentially false information?