Methodology

This page is the analytical backbone of the editorial. Everything else on the wiki summarises other people's numbers; this page lays out the model that lets us compute our own — for any combination of model, provider, hardware, and region — and explains how we calibrate it.

The companion implementation lives in analysis/ (Python, no dependencies, a few hundred lines). Each function here maps directly to a function there.

The fundamental equation

For a single query served by a single model in a single data center on a single grid:

W_query  =  E_query  ×  ( WUE_direct  +  WUE_scope2 )

where:

Symbol	Units	Meaning
`W_query`	L	Water consumed per query (direct + indirect)
`E_query`	kWh	Electrical energy delivered to the data center per query
`WUE_direct`	L/kWh	On-site water per kWh of IT load (cooling)
`WUE_scope2`	L/kWh	Off-site water per kWh consumed at the generating plant(s)

We never report W_query as a single number. We report a low / mid / high band reflecting realistic uncertainty in each input.

Energy per query

Inference energy dominates for popular models (training amortised over 10⁹–10¹¹ queries adds <10%). Inference itself splits into prefill (process input) and decode (generate output one token at a time):

E_query  =  ( E_prefill + E_decode )  ×  S_overhead  ×  PUE  ×  R_reasoning

E_prefill  =  ( 2 · P_active · N_input  )  /  ( FLOPS_hw · MFU_prefill )  ×  P_hw
E_decode   =  ( 2 · P_active · N_output )  /  ( FLOPS_hw · MFU_decode  )  ×  P_hw

Symbol	Typical range	Notes
`P_active`	3 B – 200 B parameters	For dense models = total params; for MoE = active expert params
`N_input`	100 – 10,000 tokens	Average input tokens per query, query-type dependent
`N_output`	100 – 5,000 tokens	Average output tokens; reasoning models pre-multiplier
`FLOPS_hw`	1 – 4 PFLOPS BF16	Per-GPU peak (H100 ≈ 1 PFLOPS; B200 ≈ 2.25 PFLOPS dense BF16)
`MFU_prefill`	0.30 – 0.45	Compute-bound; well-batched
`MFU_decode`	0.05 – 0.15	Memory-bandwidth-bound; harder to batch (KV cache pressure)
`P_hw`	700 – 1,200 W	Per-GPU TDP including HBM (H100 700W, B200 1000W, GB200 ~1200W)
`S_overhead`	2× – 10×	Multiplier on the bare 2·P·N FLOPs floor for KV-cache reads, attention compute, multi-GPU communication, batch under-utilization, replica redundancy, safety/filter passes (modern stacks 3–6×; 2022-era ~5–10×)
`PUE`	1.10 – 1.40	Cooling/lighting overhead (modern hyperscalers 1.10–1.20)
`R_reasoning`	1× – 100×	Multiplier for o3 / Claude Opus thinking / R1; covers hidden CoT tokens

The S_overhead factor is what closes the gap between the textbook 2·P·N forward-pass floor and what real serving stacks measure. Goedecke and SemiAnalysis report production inference consistently lands 3–6× above the floor; 2022-era stacks were closer to 5–10×. Without it the calculator under-predicts every published anchor by roughly 5×.

The 2 · P · N factor is the standard FLOPs estimate for a transformer forward pass (one multiply + one add per parameter per token).

For MoE models, P_active is what matters for FLOPs. Memory bandwidth — what dominates decode — also primarily reads only the active experts each step, so the MoE advantage is real both for compute and bandwidth in modern routed inference.

Water from energy

WUE_direct   =  cooling_factor( cooling_tech, climate )
WUE_scope2   =  Σ ( grid_share_i  ×  WUE_source_i )

Direct (`WUE_direct`)

Empirical hyperscaler ranges by cooling type:

Cooling tech	WUE (L/kWh)	Notes
Evaporative towers (hot + arid)	1.0 – 1.8	Phoenix, Arizona, West Texas
Evaporative towers (temperate)	0.4 – 0.8	Northern Virginia, Iowa
Adiabatic / hybrid	0.1 – 0.4	Modern build, mixed climate
Closed-loop liquid	0.02 – 0.10	New B200/GB200 sites; Microsoft "zero-water"
Air-cooled (no evap assist)	0.00 – 0.02	Cold-climate sites; 10% energy penalty

Indirect (`WUE_scope2`)

Per-source operational water consumption (not withdrawal) for plants with recirculating cooling towers, from Macknick et al. 2012 (NREL TP-6A20-50900) median values:

Source	gal/MWh	L/kWh
Coal subcritical	479	1.81
Natural gas CC	205	0.78
Nuclear	672	2.54
Hydro (reservoir evap, median)	4,491	17.0
Solar PV utility	1	0.004
Wind	0	0.00

Withdrawal numbers (often ~10–50× higher, especially for once-through cooling) describe water cycled through the plant; consumption is what evaporates and leaves the watershed. The AI-water debate is about consumption, so consumption is what we use here. Hydro is reported with a wide range (0–18,000 gal/MWh) because how much reservoir evaporation gets allocated to electricity vs. flood control / recreation / irrigation varies by accounting convention.

Weighted by grid mix:

Grid	Coal	Gas	Nuc	Hydro	Solar	Wind	Other	WUE_scope2 (L/kWh)
US national average (2024)	16%	43%	19%	6%	6%	10%	0%	~2.1
Texas (ERCOT)	14%	42%	7%	0%	8%	28%	1%	~0.8
Pacific Northwest	6%	11%	4%	56%	5%	14%	4%	~9.7 (hydro-driven)
Northern Virginia (PJM)	14%	38%	35%	1%	4%	4%	4%	~1.6
Hyperscaler PPA-matched (24/7 wind+solar)	0%	0%	0%	0%	60%	40%	0%	~0.002

"Other" is biomass + geothermal + petroleum + storage round-trip losses; ignored in the WUE calculation (treated as zero) since each share is small and intensities are mid-pack.

The PPA-matched row matters: it's the case where a data center buys 24/7-matched renewable power. Microsoft, Google, and Amazon have committed to this for new builds. For those facilities, scope-2 water is essentially zero, which collapses the indirect share of the per-query water footprint to almost nothing.

Hydro-driven grids (PNW) score badly here because reservoir evaporation is huge per kWh, but treating that as a marginal cost of additional load is contested — most reservoirs would evaporate at the same rate whether they were generating power or not.

Calibration

The model has many free parameters but only one structural assumption: that energy per query is the bottleneck and scales as 2 · P · N / (efficiency). We test this against three published anchors that span two orders of magnitude:

Anchor	Configuration	Reported	Tolerance
Ren et al. 2023	GPT-3 (175B dense) on Azure US-West 2022, evap cooling, US-2022 grid	~25 mL/query (midpoint of 10-50 mL band)	within 1.5×
Goedecke Oct 2024	GPT-4o-class MoE (~17B active) on H100, modern Azure	~1 mL/query (5 mL per ~5-turn conversation; conversation length is unspecified in source)	within 2.5×
Altman, The Gentle Singularity (Jun 2025)	GPT-4o-class on OpenAI global mix, direct on-site only	0.32 mL/query	within 2×

Tolerances are expressed as factor-of, not ±%, because energy and water estimates have geometric uncertainty: an answer "5× too high" and "5× too low" are equally wrong, but ±% is asymmetric. The published anchors themselves disagree by a factor of ~50, which sets a floor on how tightly any single model can fit them all.

Anchor regions are picked to match what each source actually measured: Ren on the legacy 2022 Azure stack, Goedecke on a modern US-East config, Altman on a synthetic global-average region with WUE_direct mid 0.95 L/kWh (since his 0.32 mL is averaged across all OpenAI-served sites, including evap-cooled regions, not US-East alone).

Passing all three within these tolerances establishes structural validity. The model lands on the low end of the published spread for the modern anchors — by construction, since it uses 2025-era inputs (modern serving stacks, current PUE, current cooling mix). This is the right behavior for an editorial that argues the popular numbers are inflated, but it should be reported transparently.

Per-provider aggregation

To go from per-query to per-day-per-provider:

W_provider_daily  =  Σ_models  Q_model  ×  W_query(model, hardware, region, profile)

Inputs needed per provider:

Field	Source
Daily query count	Public statements (OpenAI), traffic estimates (Similarweb), API logs
Model mix	Statements, pricing-tier usage data, defaults (e.g. ChatGPT default = 4o-mini for free, 4o for plus)
Hardware fleet	Earnings calls, infrastructure announcements
Region distribution	Cloud provider published regions
Average query type	Educated estimates of input/output mix per surface (chat vs API)

These are inherently uncertain but they're uncertain in a way we can bound and report. The model lets us run sensitivity analysis: "if Grok serves 200 M queries/day on Memphis Colossus (gas-heavy grid, evap cooling), what's the daily total?"

Sensitivity analysis

Once the per-query model is calibrated, the next analytical step is a tornado: which input drives the most variance in the output? Hypothesis (to test):

Grid mix at the serving region dominates everything else (because scope-2 is 80%+ of the total)
Active parameters is second (factor-of-10 across model sizes)
Hardware generation is third (H100 → B200 is ~3× efficiency)
Cooling tech is a small absolute mover but drives the direct-vs-indirect split — important for siting policy
Reasoning multiplier dominates for o-series / Opus-thinking workloads specifically
Token mix is a mild lever (factor of ~3 across query types)

Rank order matters because it tells you where to spend research effort. If grid mix dominates, then the most analytically valuable thing to nail down per provider is the regional hardware footprint, not the active param count.

Things this model deliberately does not include

Training energy. ~5–10% of lifetime per-query energy if amortized; smaller for high-volume models. Worth noting; not worth modelling for now.
Network energy between user and data center. ~10⁻³ Wh per query. Negligible.
Embedding / vector search at retrieval time. Could matter for RAG-heavy products but not for the per-prompt baseline.
End-user device energy (your phone). Outside scope; would dominate any honest "carbon per query" accounting but not water.
Construction-phase water (e.g. Newton County, GA). Real impact but not assignable per-query.

These are honest exclusions, listed so the editorial doesn't get caught hand-waving them away.

Files

analysis/models.py — model / hardware / region database, dataclasses
analysis/calculator.py — implementation of the equations above
analysis/run.py — calibration check + per-model results table
analysis/results.md — generated output (re-run the script to refresh)

Sources cited on this page

Macknick et al. (NREL TP-6A20-50900) / IOPscience version — per-source water consumption table
Ren et al. — Making AI Less "Thirsty" — Ren calibration anchor
Goedecke — Talking to ChatGPT costs 5 mL of water — Goedecke calibration anchor
Altman — The Gentle Singularity — Altman calibration anchor
Full bibliography: sources.md