← Ground Truth
FILE 01 \\u00b7 EPISTEMIC FOUNDATIONS

How to think under uncertainty with adversaries present

The methodology spine of the Ground Truth project. Heuer\\u2019s analytic tradecraft, the ODNI probability lexicon, Tetlock\\u2019s calibration research, the replication crisis, base-rate reasoning, and the operating system every other file assumes you have read.

00 \\u00b7 PREAMBLE

Why this file exists

This is the spine of the Ground Truth project. Every other file assumes you have read and internalized what follows. The orienting principle is symmetric skepticism with calibrated commitment: distrust mainstream-sanitized narratives and contrarian “they-don\\u2019t-want-you-to-know” narratives with equal vigor.

The institutional press is wrong often enough \\u2014 Iraq WMD, the early COVID lab-leak dismissal, the opioid epidemic\\u2019s pharma-funded soft-pedal, the 2008 ratings-agency assurances \\u2014 that its claims cannot be treated as ground truth. But the contrarian ecosystem (manosphere, anti-vaccine subcultures, crypto maximalism, the “elites are lizard people” cosmology) is wrong at least as often, with the additional pathology that being wrong is a marker of in-group loyalty.

The discipline is to apply the same scoring rubric to both sides. Anything else is partisan epistemics dressed up as skepticism.

01 \\u00b7 TWO SCAFFOLDS

Evidence tiers and truth categories

Evidence tiers

T1 — empirical, peer-reviewed, ideally meta-analytic; preregistered designs; large-N; well-replicated.

T2 — practitioner consensus or large-N industry data; insurance actuarial tables, audited trade datasets, government statistical series.

T3 — insider, anecdotal, single-expert testimony; valuable but not load-bearing on its own.

T4 — contested, speculative, ideologically motivated; advocacy organizations, partisan think tanks, single-source allegations.

Truth categories for contested topics

Category 1 — genuinely suppressed truths with empirical support.

Category 2 — uncomfortable truths well-documented but socially taboo.

Category 3 — contrarian takes popular in subcultures but empirically wrong.

Category 4 — conspiracy theories with no evidence base.

The most consequential category error in current discourse is treating Cat 1 facts (real) as if they were Cat 3 or Cat 4, and treating Cat 3/Cat 4 claims as if they had Cat 1 evidentiary status.

02 \\u00b7 HEUER FOUNDATION

Analysis of Competing Hypotheses

Richards J. Heuer Jr. spent forty-five years inside the Central Intelligence Agency. In 1999 the CIA\\u2019s Center for the Study of Intelligence published his collected essays as Psychology of Intelligence Analysis — still available free at cia.gov, and still the single best primer ever produced on how to think under uncertainty with adversaries present.

The cognitive biases that produce confident wrong answers

Confirmation bias — search asymmetrically for evidence supporting the hypothesis already in mind. Anchoring — initial estimates contaminate all subsequent revision. Availability heuristic — vivid or recent events feel more probable than they statistically are. Vividness criterion — concrete anecdotes dominate abstract base rates. Absence of evidence — analysts systematically underweight what they have not seen. Persistence of impressions based on discredited evidence — once a frame is in mind, demolishing the supporting facts does not demolish the frame. Hindsight bias. Mirror-imaging — assuming the adversary reasons the way you do.

The 8-step ACH procedure

Heuer\\u2019s deep insight is that human cognition is satisficing and confirmatory: we generate a leading hypothesis, then collect evidence that fits it, then declare victory. ACH inverts the process. You do not select the hypothesis with the most consistent evidence; you eliminate the hypotheses with the most inconsistent evidence and commit to whichever remains least falsified.

  1. Identify the possible hypotheses. Ideally ~7. Include “denial and deception is occurring.”
  2. List significant evidence and arguments for and against each. Include absence of expected indicators.
  3. Prepare a matrix. Rows = evidence; columns = hypotheses. This is the methodological trick \\u2014 it forces simultaneous evaluation rather than sequential anchoring.
  4. Refine the matrix. Delete evidence with no diagnostic value.
  5. Score each cell: C (consistent), I (inconsistent), or N (not applicable).
  6. Draw tentative conclusions by counting Is, not Cs. The hypothesis with the fewest inconsistencies is the strongest.
  7. Analyze sensitivity. How dependent is the conclusion on a few critical items?
  8. Report all hypotheses\\u2019 relative likelihood. Identify future observables that would change the analysis.

The deep payoff: ACH makes deception detectable. When an adversary has planted false evidence, ACH highlights it as an outlier inconsistent with the hypothesis the deception is meant to obscure.

03 \\u00b7 ODNI STANDARDS

The ICD 203 probability lexicon

After Iraq WMD, the U.S. Intelligence Community institutionalized analytic standards. ICD 203 governs every finished IC analytic product. Its single most operationally valuable deliverable is the standardized probability lexicon. Analysts must pick one row and stick to it.

TERMALTERNATERANGE
almost no chanceremote01\u201305%
very unlikelyhighly improbable05\u201320%
unlikelyimprobable20\u201345%
roughly even chanceroughly even odds45\u201355%
likelyprobable55\u201380%
very likelyhighly probable80\u201395%
almost certain(ly)nearly certain95\u201399%

Two rules to memorize. Do not mix terms across rows in a single product without an explicit disclaimer. Never combine a confidence level (“high confidence”) with a probability term (“likely”) in the same sentence — they are orthogonal. Confidence is about source quality. Likelihood is about the probability of the event.

Confidence is graded low / moderate / high based on: source quality (multiple independent corroborating sources vs. single fragmentary source), source access (direct vs. hearsay), whether the underlying logic is well-established or speculative, and whether prior judgments have proven reliable in this domain. “High confidence the event is unlikely” is a legitimate construction; “highly likely with high confidence” conflates the two axes.

04 \\u00b7 SUPERFORECASTING

Tetlock and the Good Judgment Project

The Intelligence Advanced Research Projects Activity (IARPA) ran the Aggregative Contingent Estimation tournament from 2011 to 2015. All forecasts were scored against ground truth using Brier scores. Philip Tetlock and Barbara Mellers of the University of Pennsylvania ran the Good Judgment Project, recruiting thousands of volunteers \\u2014 librarians, retired engineers, programmers \\u2014 and giving them a 45-minute training in probabilistic reasoning.

The outcomes were not subtle. GJP beat every other research team by 35\\u201372% by the end of year two and was the only team IARPA continued funding after year three. The top ~2% of forecasters were labeled “superforecasters.” These amateurs outperformed intelligence-community analysts — who had access to classified data — by roughly 30% on the unclassified questions GJP was scored against.

What superforecasters actually do

Reference class forecasting (the “outside view”). Before zooming into details: of cases like this one, how often does the outcome occur? Anchor on the base rate, then adjust.

Decompose the problem. Break a hard question into estimable sub-questions.

Small-step Bayesian updating. Update beliefs in small increments. Avoid both under-reaction and over-reaction. Superforecasters update more frequently and in smaller increments than ordinary forecasters.

Granular probabilities. Distinguish 63% from 65% from 68%. Rounding to the nearest 5% measurably worsens Brier scores.

Active open-mindedness. Treat beliefs as hypotheses to test, not identity to defend. Tetlock found this to be the strongest single predictor of forecasting accuracy — roughly three times as predictive as raw intelligence.

Foxes beat hedgehogs. Eclectic thinkers who hold many small models loosely substantially outperform those who view the world through one dominant theoretical lens. Hedgehogs are better on television. Foxes have better Brier scores.

05 \\u00b7 SCOUT MINDSET

Julia Galef\\u2019s motivated-reasoning checks

Soldier mindset: reasoning as defensive combat. Beliefs are fortifications to defend. The internal question is “Can I believe this?” for things you want to believe, and “Must I believe this?” for things you don\\u2019t.

Scout mindset: reasoning as mapmaking. The internal question is “Is it true?” regardless of whether the answer is convenient.

The thought-experiment toolkit

Double standard test. Would I judge this behavior the same way if my own side did it?

Outsider test. If I were a third party with no stake, what would I conclude?

Selective skeptic test. If this evidence supported the opposite conclusion, would I find it as compelling?

Status quo bias test. If the current arrangement weren\\u2019t the default, would I choose to adopt it?

Conformity test. If everyone I respect disagreed with me on this, would I still believe it?

Ideological Turing Test. Can I state the strongest version of my opponent\\u2019s position so well that a neutral observer cannot tell which side I actually hold? Inability to pass this test is near-decisive evidence of soldier-mindset capture.

06 \\u00b7 REPLICATION CRISIS

Why a single peer-reviewed study is weak evidence

If you trust any single peer-reviewed result in social science you have not independently checked, you are doing it wrong. The replication crisis is the most important methodological development of the past two decades.

Ioannidis (2005)

John Ioannidis\\u2019s Why Most Published Research Findings Are False (PLoS Medicine, 2005) modeled how, under realistic assumptions about prior probability, study power, researcher flexibility, and publication bias, the positive predictive value of a typical published finding can fall below 50%.

The empirical results that followed

Open Science Collaboration (Science, 2015): 270 collaborators attempted to replicate 100 studies from three top-tier psychology journals. Of the originals, 97% reported statistically significant effects. Of the replications, only ~36% did. Mean effect sizes in replications were about half the magnitude of originals.

Camerer et al. (Nature Human Behaviour, 2018): 21 high-profile experimental social-science findings in Nature and Science, replicated with sample sizes ~5\\u00d7 the originals. Result: 13 of 21 (62%) replicated; effect sizes averaged ~50% of originals.

Reproducibility Project: Cancer Biology (eLife, 2017\\u20132021): 53 preclinical cancer findings replicated; ~46% of effects survived; effect sizes 85% smaller than originals.

Operational consequence

For any single peer-reviewed finding in social science, your prior probability that it will replicate cleanly should sit around 40\\u201365%, depending on field and study quality. The implication is not nihilism. Single studies are weak evidence; meta-analyses of preregistered well-powered studies are strong evidence; the apparatus of citation, press release, and TED Talk is essentially uncorrelated with truth.

07 \\u00b7 RESEARCHER DEGREES OF FREEDOM

p-hacking, HARKing, the garden of forking paths

Simmons, Nelson & Simonsohn (Psychological Science, 2011) showed that by exploiting standard “researcher degrees of freedom” \\u2014 choosing when to stop data collection, which covariates to include, which dependent measures to report, how to handle outliers \\u2014 the false-positive rate of a p < .05 test can be inflated from the nominal 5% to as high as ~60%. They demonstrated this with an actual experiment claiming, with p < .05, that listening to The Beatles\\u2019 “When I\\u2019m Sixty-Four” made participants chronologically younger.

HARKing (Hypothesizing After the Results are Known) is the practice of running an exploratory analysis, finding a pattern, then writing the paper as if that pattern had been the a priori hypothesis. HARKing collapses the distinction between exploratory and confirmatory research while presenting results with the false rigor of the latter.

Andrew Gelman and Eric Loken\\u2019s garden of forking paths: researchers need not consciously p-hack to inflate false positives. Given any complex dataset, there are dozens of legitimate-seeming analytic choices. Even a researcher with pure intent who makes those choices contingent on what the data look like will produce inflated false-positive rates. The defense is preregistration — committing the analysis plan publicly before seeing the data.

Operational rule

When you read a social-science finding, ask: (a) Was it preregistered? (b) Was there a replication? (c) What was the sample size? (d) Is the effect size implausibly large relative to the field\\u2019s typical effects? (e) Did the press coverage frame an exploratory finding as confirmatory? If (a) is no and (b) does not exist, treat the result as T3 dressed up as T1.

08 \\u00b7 BASE RATES

The Linda problem and Bayesian updating

The default human cognitive failure is to ignore prior probabilities and reason purely from case-specific features. Kahneman and Tversky\\u2019s research program from the 1970s onward shows this is robust, near-universal, and trainable but not eliminable.

The Linda problem

Subjects are told: “Linda is 31, single, outspoken, very bright. She majored in philosophy. As a student she was deeply concerned with discrimination and social justice.” Which is more probable: (a) Linda is a bank teller, or (b) Linda is a bank teller and is active in the feminist movement?

A majority — 85% in the original \\u2014 pick (b). This is mathematically impossible: P(A\\u2227B) \\u2264 P(A). People are pattern-matching narrative coherence to probability.

The disease-testing worked example

A disease has population base rate 1%. The test is 99% sensitive, 99% specific. A patient tests positive. What is the probability they have the disease?

Most clinicians say 99% or 95%. The correct answer is ~50%.

Out of 10,000 people:
  100 have the disease (1%). 99 test positive (true positives).
  9,900 don\\u2019t. 1% = 99 test positive (false positives).
  Total positives: 99 + 99 = 198.
  True positives / all positives: 99/198 = 50%.

When the base rate is low and the test is imperfect, false positives can swamp true positives. The implication generalizes: any positive screening result for a rare condition should be interpreted with profound caution.

09 \\u00b7 SKILL VS LUCK

Mauboussin and the paradox of skill

Michael Mauboussin\\u2019s The Success Equation: the skill-luck continuum runs from pure skill (chess) to pure luck (roulette). A diagnostic test for where an activity sits: how slowly does performance revert to the mean? Pure skill: no reversion. Pure luck: maximal reversion. Mauboussin\\u2019s data on mutual funds — top-quartile funds in the 2000s underperformed by ~7.8% in the subsequent decade while bottom-quartile outperformed by ~7.8% \\u2014 places active investing far closer to the luck end than its fee structures suggest.

The paradox of skill: as the absolute skill of all competitors rises and the variance compresses, luck plays a larger role in determining who wins. When everyone is excellent, the marginal differences are small enough that random variation dominates outcomes.

Phil Rosenzweig\\u2019s The Halo Effect systematically demolishes the business-bestseller genre — In Search of Excellence, Built to Last, Good to Great. These books select companies on outcome (sustained superior financial performance), then collect data about strategy, culture, and leadership \\u2014 but those observations are contaminated by the outcome. The methodology generates a tautology. On follow-up, most companies celebrated in these books underperform the market in the subsequent decade.

10 \\u00b7 RANDOMNESS

Taleb: survivorship, narrative, fat tails

Survivorship bias. Datasets of “what worked” routinely omit the failures. The “10-year track record” is computed only from funds that survived the decade. The “successful entrepreneur drops out of college” pattern ignores the vastly larger population of dropouts who failed. The “habits of millionaires” books interview only millionaires. Until you have the denominator, you have not done analysis.

Narrative fallacy. Human cognition is a story-generating engine. We construct causal narratives from observed sequences whether or not causation is present. The corrective is asking, every time we encounter a coherent story about why something happened: would this narrative survive if the outcome had been reversed?

Black swans and fat tails. Domains with thin tails (human heights, IQ scores) are well-behaved under Gaussian statistics. Domains with fat tails (wealth distributions, financial returns, pandemic mortality, virality) are not. Any risk model built on Gaussian assumptions in a fat-tailed domain is not conservative \\u2014 it is dangerous.

11 \\u00b7 CALIBRATION

Calibrated commitment and Brier scores

The amateur cognitive mode is binary. The trained mode is probabilistic. This is not mealy-mouthed hedging; it is a separate skill, more difficult, more honest, more accurate.

A calibrated forecaster\\u2019s stated probabilities match the empirical frequency of events. A resolute forecaster pushes probabilities toward extremes when justified. Both are valuable; neither is sufficient.

The Brier score

BS = (1/N) \\u03a3 (forecast_i \\u2212 outcome_i)\\u00b2

A perfect Brier score is 0; “always 50%” forecasting yields 0.25. Tetlock\\u2019s superforecasters posted Brier scores around 0.15 on geopolitical questions; intelligence-community analysts on comparable unclassified questions posted scores ~30% worse.

Calibrated commitment is the combined discipline: hold beliefs at the appropriate confidence level, update incrementally as evidence arrives, and commit to an actual numerical or lexical probability rather than retreating to “we can\\u2019t really know.” “Can\\u2019t really know” is almost always a soldier-mindset evasion.

12 \\u00b7 THE OPERATING SYSTEM

An 8-step protocol for any question of consequence

  1. Identify the domain. Empirical / normative / definitional / forecasting / historical / mixed.
  2. Identify the evidence tier of every load-bearing claim. Label each T1/T2/T3/T4.
  3. If contested, identify the truth category. Cat 1/2/3/4.
  4. Surface base rates. If you don\\u2019t know, say so and estimate a range.
  5. Articulate competing hypotheses. Use ACH-style matrix if complex. Select hypothesis with fewest inconsistencies.
  6. Apply motivated-reasoning checks. Double-standard, outsider, selective-skeptic, Ideological Turing Test.
  7. Commit to a calibrated verdict with confidence level. State using ICD 203 lexicon. Pair with confidence. Identify what would change the verdict.
  8. State residual uncertainty and next observables. No real-world conclusion is final.

The operating system has one meta-rule: apply it symmetrically. Anyone who applies the methodology to one side but not the other is doing partisan epistemics dressed up as skepticism.

This file is the methodology spine. The five domain files (Wealth, Power, Geopolitics, Health, Grift) apply these tools to specific contested terrain. They are in active development and will be published as drafts pass legal review.