Office Hours — 24 April 2026

office-hours

training

nanogpt

speedrun

modal

ddp

fsdp

distributed-training

NanoGPT speedrun infrastructure: modded-nanogpt repo walkthrough, 8x H100 cost and cloud GPU options, Modal for containerized training, Accelerate and DDP vs FSDP, sharding explained, training script demo, HF and W&B secrets setup, and troubleshooting pip/PATH issues on Windows.

Published

April 24, 2026

First Break AI — Office Hours

Session 4 — 24 April 2026 (Cohort 01). From repo structure to a live speedrun: setting up the infrastructure to train a 124M GPT on 8x H100s, understanding DDP vs FSDP, and debugging the first-time setup on a learner’s machine.

What we covered

#	Topic	Roadmap link
1	Repository overview — modded-nanogpt as a container repo	Step 4
2	Infrastructure and cost — 8x H100, cloud pricing	Step 4
3	Modal as a training platform	Step 4
4	Accelerate library — DDP, FSDP, DeepSpeed Zero	Step 4
5	Sharding explained — parameters, gradients, optimizer states	Step 4
6	Training script demo — speedrun execution and loss target	Step 4
7	Setup prerequisites — HF token, W&B secret, Modal secrets	Step 4
8	Troubleshooting — pip install, PATH issues, UV tool	Step 1

Topic 1: Repository overview — modded-nanogpt as a container repo

Roadmap connection: Step 4 — Training fundamentals

We walked through the private repository modded-nanog, which acts as a container repo for several sub-repos:

modded-nanogpt — the official speedrun repository, based on Keller Jordan’s work. This is the benchmark that has been optimized from a baseline of ~45 minutes down to approximately 1.4 minutes.
nanogpt-speedrun — related speedrun experiments

The goal: work with the speedrun and a minimal PyTorch framework to train a small 124M parameter GPT-2-scale model — small enough to iterate quickly, large enough to teach real distributed training concepts.

Topic 2: Infrastructure and cost — 8x H100, cloud pricing

Roadmap connection: Step 4 — Training fundamentals

The training script is designed to run on eight Nvidia H100 GPUs. Key cost context:

Factor	Detail
Bare-metal cost	$200–500 per GPU per hour (varies by provider and contract)
Approximate hourly	~4,000–5,000 INR per hour for 8x H100, excluding data transfer
Run time	Under 2–3 minutes for the optimized speedrun
Per-run cost	Minimal — a few minutes of GPU time at cloud rates

Why 8x H100? While the run can complete on fewer GPUs, 8x H100 is the meaningful minimum for this speedrun framework. It is also the standard hardware class for leaderboard entries, so your results are directly comparable.

Takeaway: cloud starter credits from providers often cover many experiments at this scale. A short multi-GPU run (minutes) is cheap enough to iterate compared with training a 1B+ parameter model for hours or days.

Topic 3: Modal as a training platform

Roadmap connection: Step 4 — Training fundamentals

Modal simplifies the process of running multi-GPU training jobs by providing a containerized, plug-and-play environment. Instead of manually SSH-ing into an 8-GPU machine, configuring CUDA, installing dependencies, and setting up networking:

Modal provides pre-built containers with the right CUDA/PyTorch versions
The training script (train_gpt.py) runs via a wrapper command (modal run) that handles image setup, environment variables, and GPU allocation
Secrets (API keys) are injected into the container at runtime

# Typical execution flow
cd modded-nanogpt
modal run wrapper_script.py    # launches 8x H100, runs training

The wrapper handles: spinning up the container → mounting the repo → injecting secrets → launching the distributed training script → logging results to W&B.

Why Modal for learners: the infrastructure complexity of “get 8 GPUs talking to each other” is collapsed into a single command. You focus on the training script and the curves, not on DevOps.

Topic 4: Accelerate library — DDP, FSDP, DeepSpeed Zero

Roadmap connection: Step 4 — Training fundamentals

The Accelerate library helps achieve greater parallelism and simplifies multi-GPU and multi-node setups by eliminating boilerplate code.

The learning progression

The cohort’s strategy is to start with the simpler approach and build up:

DDP (start here)
 ↓  understand gradient all-reduce, batch splitting
FSDP / DeepSpeed Zero
 ↓  understand sharding, memory savings
Advanced (pipeline, tensor, expert parallelism)

What Accelerate supports

Strategy	What it does	When you need it
DDP (Distributed Data Parallel)	Replicates model on each GPU, splits batch, all-reduces gradients	Model fits on one GPU; you want to scale batch size
FSDP (Fully Sharded Data Parallel)	Shards model parameters across GPUs, gathers on demand	Model is too large for one GPU’s memory
DeepSpeed Zero 1/2/3	Progressive sharding of optimizer states, gradients, and parameters	Similar to FSDP with different trade-offs and configuration

For the speedrun (124M model on 8x H100): DDP is sufficient — the model easily fits on one GPU. The value is learning the pattern (all-reduce, world size, rank) before hitting problems that require FSDP.

Topic 5: Sharding explained — parameters, gradients, optimizer states

Roadmap connection: Step 4 — Training fundamentals

Sharding is breaking down components of a model to reduce the memory footprint per GPU. This is crucial when the model does not fit on a single device.

The three pillars of model memory

flowchart LR
    subgraph Memory["What lives on each GPU"]
        P["Model Parameters\n(weights)"]
        G["Gradients\n(derivatives)"]
        O["Optimizer States\n(Adam moments, etc.)"]
    end

Pillar	What it is	Memory cost (approximate for fp32)
Parameters	The model weights themselves	4 bytes per parameter
Gradients	Computed during backward pass	Same size as parameters
Optimizer states	Adam’s first and second moments	2x parameter size (for Adam)

For a 32B parameter model in fp32: parameters alone = ~128 GB. With gradients + Adam states, you need ~512 GB — far more than one GPU.

How sharding strategies split these

Strategy	Parameters	Gradients	Optimizer	Complexity
DDP	Full copy	Full → all-reduced	Full copy	Lowest
Zero-1 / FSDP (stage 1)	Full copy	Full → all-reduced	Sharded	Low
Zero-2	Full copy	Sharded	Sharded	Medium
Zero-3 / Full FSDP	Sharded	Sharded	Sharded	Highest

The trade-off: more sharding = less memory per GPU, but more communication (gathering parameters before each forward/backward). You pay in latency for memory savings.

Topic 6: Training script demo — speedrun execution and loss target

Roadmap connection: Step 4 — Training fundamentals

We watched a live execution of the speedrun training script. The current script includes architectural improvements over the original nanoGPT — rotary embeddings, optimized attention, Muon/Neon optimizer variants — and completes in under 2–3 minutes on 8x H100 (compared to the original ~45 minute baseline).

Execution flow

modal run wrapper.py
  → Container spins up with 8x H100
  → Warm-up delay (CUDA/torch.compile)
  → Training loop begins
  → Loss checked each step against target (~3.28 val loss)
  → Target hit → training stops → logs pushed to W&B

What shows up in W&B after a run:

val_loss declining toward target
train_time_ms (cumulative) — should be roughly linear
step_avg_ms — per-step wall time (spike at start from compilation, then settles)
Token efficiency metrics

Key learning: the speedrun is not about “training a good model” — it is about training engineering: making the same learning happen in less wall-clock time. Reading these W&B curves is the same skill you need for reading a 32B pretrain.

Topic 7: Setup prerequisites — HF token, W&B secret, Modal secrets

Roadmap connection: Step 4 — Training fundamentals

Before running the speedrun, two API keys must be created and stored as Modal secrets:

Step-by-step

Create a Hugging Face token:
- Go to huggingface.co/settings/tokens
- Create a token with read access to public/gated repos
- This avoids rate limiting when downloading models and datasets
Create a Weights & Biases API key:
- Go to wandb.ai/authorize
- Copy the API key
Store both as Modal secrets:
- In the Modal dashboard, create two secrets:
  - HF_TOKEN — your Hugging Face token
  - WANDB_SECRET — your W&B API key
- The training wrapper injects these into the container at runtime
Clone the repository:

git clone <repo-url>   # the modded-nanogpt container repo

Install and authenticate Modal:

pip install modal
modal token new          # authenticates your local machine with Modal

After this setup, modal run will have access to both HF (for downloading tokenizers/data) and W&B (for logging training curves).

Topic 8: Troubleshooting — pip install, PATH issues, UV tool

Roadmap connection: Step 1 — First use of AI for coding

A significant portion of the session was spent debugging a Windows-specific issue where pip install modal succeeded but the modal command was not recognized in the terminal.

The problem

> pip install modal
Successfully installed modal-0.x.x

> modal token new
The term 'modal' is not recognized as the name of a cmdlet...

Common causes

Cause	Fix
pip installs to a user directory not on PATH	Add Python’s `Scripts` directory to your system PATH
Multiple Python installations	Check `where python` / `which python` to confirm which Python is active
Old pip version	Run `pip install --upgrade pip`
Shell session not refreshed	Close and reopen your terminal after installing

Resolution steps tried

Reinstalled modal with pip install modal — still not found
Upgraded pip — still not found
Investigated the uv tool as an alternative package manager — also hit recognition issues

Takeaway for learners: PATH issues are one of the most common blockers in first-time setup, especially on Windows. Before concluding “the package is broken,” always check:

python -m modal — does it work when invoked through Python directly?
pip show modal — where did pip install it?
Is that directory on your system PATH?

Follow-ups from this session

Learner to complete the HF token, W&B key, and Modal secret setup
Investigate the PATH issue further (likely needs Python Scripts directory added to Windows PATH)
Once setup is complete: run the speedrun and read the W&B curves using the techniques from the Reading the Curves blog post

← Back to Office Hours | ← Back to Roadmap