Methodology
This page is the analytical backbone of the editorial. Everything else on the wiki summarises other people's numbers; this page lays out the model that lets us compute our own — for any combination of model, provider, hardware, and region — and explains how we calibrate it.
The companion implementation lives in analysis/ (Python, no dependencies, a few hundred lines). Each function here maps directly to a function there.
The fundamental equation
For a single query served by a single model in a single data center on a single grid:
W_query = E_query × ( WUE_direct + WUE_scope2 )
where:
| Symbol | Units | Meaning |
|---|---|---|
W_query |
L | Water consumed per query (direct + indirect) |
E_query |
kWh | Electrical energy delivered to the data center per query |
WUE_direct |
L/kWh | On-site water per kWh of IT load (cooling) |
WUE_scope2 |
L/kWh | Off-site water per kWh consumed at the generating plant(s) |
We never report W_query as a single number. We report a low / mid / high band reflecting realistic uncertainty in each input.
Energy per query
Inference energy dominates for popular models (training amortised over 10⁹–10¹¹ queries adds <10%). Inference itself splits into prefill (process input) and decode (generate output one token at a time):
E_query = ( E_prefill + E_decode ) × S_overhead × PUE × R_reasoning
E_prefill = ( 2 · P_active · N_input ) / ( FLOPS_hw · MFU_prefill ) × P_hw
E_decode = ( 2 · P_active · N_output ) / ( FLOPS_hw · MFU_decode ) × P_hw
| Symbol | Typical range | Notes |
|---|---|---|
P_active |
3 B – 200 B parameters | For dense models = total params; for MoE = active expert params |
N_input |
100 – 10,000 tokens | Average input tokens per query, query-type dependent |
N_output |
100 – 5,000 tokens | Average output tokens; reasoning models pre-multiplier |
FLOPS_hw |
1 – 4 PFLOPS BF16 | Per-GPU peak (H100 ≈ 1 PFLOPS; B200 ≈ 2.25 PFLOPS dense BF16) |
MFU_prefill |
0.30 – 0.45 | Compute-bound; well-batched |
MFU_decode |
0.05 – 0.15 | Memory-bandwidth-bound; harder to batch (KV cache pressure) |
P_hw |
700 – 1,200 W | Per-GPU TDP including HBM (H100 700W, B200 1000W, GB200 ~1200W) |
S_overhead |
2× – 10× | Multiplier on the bare 2·P·N FLOPs floor for KV-cache reads, attention compute, multi-GPU communication, batch under-utilization, replica redundancy, safety/filter passes (modern stacks 3–6×; 2022-era ~5–10×) |
PUE |
1.10 – 1.40 | Cooling/lighting overhead (modern hyperscalers 1.10–1.20) |
R_reasoning |
1× – 100× | Multiplier for o3 / Claude Opus thinking / R1; covers hidden CoT tokens |
The S_overhead factor is what closes the gap between the textbook 2·P·N forward-pass floor and what real serving stacks measure. Goedecke and SemiAnalysis report production inference consistently lands 3–6× above the floor; 2022-era stacks were closer to 5–10×. Without it the calculator under-predicts every published anchor by roughly 5×.
The 2 · P · N factor is the standard FLOPs estimate for a transformer forward pass (one multiply + one add per parameter per token).
For MoE models, P_active is what matters for FLOPs. Memory bandwidth — what dominates decode — also primarily reads only the active experts each step, so the MoE advantage is real both for compute and bandwidth in modern routed inference.
Water from energy
WUE_direct = cooling_factor( cooling_tech, climate )
WUE_scope2 = Σ ( grid_share_i × WUE_source_i )
Direct (WUE_direct)
Empirical hyperscaler ranges by cooling type:
| Cooling tech | WUE (L/kWh) | Notes |
|---|---|---|
| Evaporative towers (hot + arid) | 1.0 – 1.8 | Phoenix, Arizona, West Texas |
| Evaporative towers (temperate) | 0.4 – 0.8 | Northern Virginia, Iowa |
| Adiabatic / hybrid | 0.1 – 0.4 | Modern build, mixed climate |
| Closed-loop liquid | 0.02 – 0.10 | New B200/GB200 sites; Microsoft "zero-water" |
| Air-cooled (no evap assist) | 0.00 – 0.02 | Cold-climate sites; 10% energy penalty |
Indirect (WUE_scope2)
Per-source operational water consumption (not withdrawal) for plants with recirculating cooling towers, from Macknick et al. 2012 (NREL TP-6A20-50900) median values:
| Source | gal/MWh | L/kWh |
|---|---|---|
| Coal subcritical | 479 | 1.81 |
| Natural gas CC | 205 | 0.78 |
| Nuclear | 672 | 2.54 |
| Hydro (reservoir evap, median) | 4,491 | 17.0 |
| Solar PV utility | 1 | 0.004 |
| Wind | 0 | 0.00 |
Withdrawal numbers (often ~10–50× higher, especially for once-through cooling) describe water cycled through the plant; consumption is what evaporates and leaves the watershed. The AI-water debate is about consumption, so consumption is what we use here. Hydro is reported with a wide range (0–18,000 gal/MWh) because how much reservoir evaporation gets allocated to electricity vs. flood control / recreation / irrigation varies by accounting convention.
Weighted by grid mix:
| Grid | Coal | Gas | Nuc | Hydro | Solar | Wind | Other | WUE_scope2 (L/kWh) |
|---|---|---|---|---|---|---|---|---|
| US national average (2024) | 16% | 43% | 19% | 6% | 6% | 10% | 0% | ~2.1 |
| Texas (ERCOT) | 14% | 42% | 7% | 0% | 8% | 28% | 1% | ~0.8 |
| Pacific Northwest | 6% | 11% | 4% | 56% | 5% | 14% | 4% | ~9.7 (hydro-driven) |
| Northern Virginia (PJM) | 14% | 38% | 35% | 1% | 4% | 4% | 4% | ~1.6 |
| Hyperscaler PPA-matched (24/7 wind+solar) | 0% | 0% | 0% | 0% | 60% | 40% | 0% | ~0.002 |
"Other" is biomass + geothermal + petroleum + storage round-trip losses; ignored in the WUE calculation (treated as zero) since each share is small and intensities are mid-pack.
The PPA-matched row matters: it's the case where a data center buys 24/7-matched renewable power. Microsoft, Google, and Amazon have committed to this for new builds. For those facilities, scope-2 water is essentially zero, which collapses the indirect share of the per-query water footprint to almost nothing.
Hydro-driven grids (PNW) score badly here because reservoir evaporation is huge per kWh, but treating that as a marginal cost of additional load is contested — most reservoirs would evaporate at the same rate whether they were generating power or not.
Calibration
The model has many free parameters but only one structural assumption: that energy per query is the bottleneck and scales as 2 · P · N / (efficiency). We test this against three published anchors that span two orders of magnitude:
| Anchor | Configuration | Reported | Tolerance |
|---|---|---|---|
| Ren et al. 2023 | GPT-3 (175B dense) on Azure US-West 2022, evap cooling, US-2022 grid | ~25 mL/query (midpoint of 10-50 mL band) | within 1.5× |
| Goedecke Oct 2024 | GPT-4o-class MoE (~17B active) on H100, modern Azure | ~1 mL/query (5 mL per ~5-turn conversation; conversation length is unspecified in source) | within 2.5× |
| Altman, The Gentle Singularity (Jun 2025) | GPT-4o-class on OpenAI global mix, direct on-site only | 0.32 mL/query | within 2× |
Tolerances are expressed as factor-of, not ±%, because energy and water estimates have geometric uncertainty: an answer "5× too high" and "5× too low" are equally wrong, but ±% is asymmetric. The published anchors themselves disagree by a factor of ~50, which sets a floor on how tightly any single model can fit them all.
Anchor regions are picked to match what each source actually measured: Ren on the legacy 2022 Azure stack, Goedecke on a modern US-East config, Altman on a synthetic global-average region with WUE_direct mid 0.95 L/kWh (since his 0.32 mL is averaged across all OpenAI-served sites, including evap-cooled regions, not US-East alone).
Passing all three within these tolerances establishes structural validity. The model lands on the low end of the published spread for the modern anchors — by construction, since it uses 2025-era inputs (modern serving stacks, current PUE, current cooling mix). This is the right behavior for an editorial that argues the popular numbers are inflated, but it should be reported transparently.
Per-provider aggregation
To go from per-query to per-day-per-provider:
W_provider_daily = Σ_models Q_model × W_query(model, hardware, region, profile)
Inputs needed per provider:
| Field | Source |
|---|---|
| Daily query count | Public statements (OpenAI), traffic estimates (Similarweb), API logs |
| Model mix | Statements, pricing-tier usage data, defaults (e.g. ChatGPT default = 4o-mini for free, 4o for plus) |
| Hardware fleet | Earnings calls, infrastructure announcements |
| Region distribution | Cloud provider published regions |
| Average query type | Educated estimates of input/output mix per surface (chat vs API) |
These are inherently uncertain but they're uncertain in a way we can bound and report. The model lets us run sensitivity analysis: "if Grok serves 200 M queries/day on Memphis Colossus (gas-heavy grid, evap cooling), what's the daily total?"
Sensitivity analysis
Once the per-query model is calibrated, the next analytical step is a tornado: which input drives the most variance in the output? Hypothesis (to test):
- Grid mix at the serving region dominates everything else (because scope-2 is 80%+ of the total)
- Active parameters is second (factor-of-10 across model sizes)
- Hardware generation is third (H100 → B200 is ~3× efficiency)
- Cooling tech is a small absolute mover but drives the direct-vs-indirect split — important for siting policy
- Reasoning multiplier dominates for o-series / Opus-thinking workloads specifically
- Token mix is a mild lever (factor of ~3 across query types)
Rank order matters because it tells you where to spend research effort. If grid mix dominates, then the most analytically valuable thing to nail down per provider is the regional hardware footprint, not the active param count.
Things this model deliberately does not include
- Training energy. ~5–10% of lifetime per-query energy if amortized; smaller for high-volume models. Worth noting; not worth modelling for now.
- Network energy between user and data center. ~10⁻³ Wh per query. Negligible.
- Embedding / vector search at retrieval time. Could matter for RAG-heavy products but not for the per-prompt baseline.
- End-user device energy (your phone). Outside scope; would dominate any honest "carbon per query" accounting but not water.
- Construction-phase water (e.g. Newton County, GA). Real impact but not assignable per-query.
These are honest exclusions, listed so the editorial doesn't get caught hand-waving them away.
Files
analysis/models.py— model / hardware / region database, dataclassesanalysis/calculator.py— implementation of the equations aboveanalysis/run.py— calibration check + per-model results tableanalysis/results.md— generated output (re-run the script to refresh)
Sources cited on this page
- Macknick et al. (NREL TP-6A20-50900) / IOPscience version — per-source water consumption table
- Ren et al. — Making AI Less "Thirsty" — Ren calibration anchor
- Goedecke — Talking to ChatGPT costs 5 mL of water — Goedecke calibration anchor
- Altman — The Gentle Singularity — Altman calibration anchor
- Full bibliography: sources.md