Thumbnail for Carbon-Aware 3D Parallelism for LLM Training

Carbon-Aware 3D Parallelism for LLM Training

February 02, 2026

Training frontier LLMs is expensive. Like, thousands of GPUs running for weeks expensive. And all those GPUs are pulling power from the grid, which means carbon emissions. The standard approach is to crank everything to max parallelism and minimize wall-clock time. Makes sense from a "get the model out the door" perspective, but it completely ignores the carbon cost.

This project started from a pretty simple observation: throughput scales sublinearly with data parallelism, but carbon emissions scale nearly linearly with the number of active devices. That asymmetry is the whole game.

Quick Primer on 3D Parallelism

When you train a huge model across a GPU cluster, you're typically combining three types of parallelism:

  1. Data Parallelism (DP): replicate the model across devices, each processes a different chunk of data, then you sync gradients. This is the easiest to scale but communication overhead grows as you add more replicas.
  2. Tensor Parallelism (TP): split individual layers across devices. Great for large layers but requires fast interconnects since devices are constantly communicating within a single forward pass.
  3. Pipeline Parallelism (PP): split the model into stages and pipeline microbatches through them. Introduces "bubble" overhead where some stages are idle waiting for others.

Systems like Megatron-LM combine all three to train models at massive scale. The standard goal is to find the (DP, TP, PP) configuration that minimizes training time. Tools like Galvatron and InternEvo automate this search.

The Asymmetry

Here's where it gets interesting. When you double DP, you roughly double your power consumption because you're activating twice as many GPUs. But you don't double your throughput, because gradient synchronization overhead eats into your gains. The more DP replicas you add, the worse this ratio gets.

So you end up in a situation where the last chunk of data parallelism you add is giving you a small bump in tokens/sec but costing you a proportionally large bump in carbon. That's wasteful, especially during times when the electrical grid is running dirty (high carbon intensity).

The Idea

The carbon intensity of the grid isn't constant. It fluctuates throughout the day depending on the energy mix: solar, wind, natural gas, coal, etc. So the same computation emits very different amounts of carbon depending on when you run it.

Our approach is a carbon-aware controller that dynamically adjusts the (DP, TP, PP) configuration over time based on carbon intensity forecasts. The logic is straightforward:

When carbon intensity is high, we scale back DP to reduce the number of active GPUs and retune TP and PP to keep utilization reasonable on the remaining devices. When carbon intensity drops, we scale DP back up to recover throughput and make progress faster.

The optimization objective minimizes total emissions 0TP(πt)CI(t)dt\int_0^T P(\pi_t) \cdot CI(t) \, dt subject to completing the required training workload WW. At each decision epoch the controller observes a carbon intensity forecast, estimates throughput and power for candidate configurations, and solves a constrained optimization to pick the best one.

What We Expect

The hypothesis is that we can cut emissions by 20-40% with less than 10% slowdown compared to time-optimized training. That's a massive win. You're giving up a tiny bit of wall-clock time in exchange for a huge reduction in carbon footprint.

We're benchmarking against three baselines: a fixed time-optimized 3D configuration, a simpler approach that only adjusts DP without retuning TP/PP, and carbon-aware workload shifting (GreenFlow style) that moves entire jobs to low-carbon periods without changing parallelism. The key metrics are total CO2 emissions, wall-clock training time, and carbon efficiency measured in tokens per kgCO2.

Why This Matters

The conversation around AI sustainability is mostly focused on measuring emissions after the fact or buying carbon offsets. That's fine but it doesn't change anything about how we actually train models. This project intervenes at the systems level, right where the parallelism decisions are being made, and bakes carbon awareness directly into the training loop.

The nice thing is that it doesn't require any new hardware or changes to the model itself. It's purely a scheduling and configuration decision. Any team running large-scale distributed training could plug this in on top of their existing infrastructure.

I'll share results once we have the full evaluation done. Stay tuned.

References

  1. D. Narayanan et al. "Efficient large-scale language model training on GPU clusters using Megatron-LM," 2021. arXiv:2104.04473
  2. Z. Li et al. "Automatically planning optimal parallel strategy for large language models," 2024. arXiv:2501.00254
  3. X. Liu et al. "Galvatron: An automatic distributed system for efficient foundation model training," 2025. arXiv:2504.21411
  4. D. Patterson et al. "Carbon emissions and large neural network training," 2021. arXiv:2104.10350
  5. A. Lacoste et al. "Quantifying the carbon emissions of machine learning," 2019. arXiv:1910.09700
  6. L. F. W. Anthony, B. Kanding, and R. Selvan. "Carbontracker: Tracking and predicting the carbon footprint of training deep learning models," 2020. arXiv:2007.03051
  7. D. Gu, Y. Zhao, P. Sun, X. Jin, and X. Liu. "GreenFlow: A carbon-efficient scheduler for deep learning workloads." IEEE Transactions on Parallel and Distributed Systems, 36(2):168-184, 2025.
  8. Q. Chen et al. "InternEvo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding," 2024. arXiv:2401.09149