Office Hours — 10 April 2026

office-hours

training

ddp

rope

speedrun

distributed-training

benchmarking

claude-code

Learner project (LAN stock game), Claude Code harness leak and safe learning, benchmarking literacy, Gemma and Matryoshka architectures, nanoGPT speedrun and Tyler Romero’s worklog, DDP and GPU collectives, DeepSeek-style low-level optimization, RoPE/RoFormer in code, and walking through a distributed training script.

Published

April 10, 2026

First Break AI — Office Hours

Session 3 — 10 April 2026 (Cohort 01). From a learner’s shipped web game to the internals of distributed training: why DDP works, what RoPE does in a real speedrun commit, and how to read a training script end to end.

What we covered

#	Topic	Roadmap link
1	Learner project — LAN multiplayer stock game	Step 5 — Build an AI product
2	Claude Code harness leak — skills, sub-agents, and safe learning	Step 1
3	KV cache video — caveats after the explainer	Step 2
4	Benchmark literacy — reading model cards	Step 2–3
5	Gemma 4 and Matryoshka-style sparsity	—
6	Speedrun — nanoGPT, target loss, Tyler Romero’s worklog	Step 4 · Project Watch
7	Distributed Data Parallelism (DDP) — gradients and all-reduce	Step 4
8	Communication bubble, NVLink, and DeepSeek-style constraints	Step 4
9	Muon optimizer — from speedrun contest to production models	Step 4
10	RoPE in practice — RoFormer, Tyler’s commits, training script walkthrough	Step 2 · Step 4

Topic 1: Learner project — LAN multiplayer stock game

Roadmap connection: Step 5 — Build an AI product (product thinking, scoping, iteration)

We started with a learner walkthrough of a web-based, multiplayer stock-market-style game they had already deployed.

Design choices

Web first — easier for user testing and feedback than a native app early on; an app build remains a possible later step.
No traditional backend — no always-on database. Sessions are LAN / room–code based: one machine hosts; others join with a code. When the host closes the window, session state is cleared (intentionally simple operationally).
Gameplay — inspired by a physical board game: listed companies, moving prices, cards per player, power cards (e.g. debenture, freeze), and reading what opponents buy to inform strategy.
Future direction — a peer-to-peer pattern (similar in spirit to small-group mobile games that use a local hotspot as the “parent” network) with temporary nicknames instead of a full account system, to keep friction low for playtests with friends.

Takeaway: the discussion was not “which framework is trendy” but what to ship first — web, minimal persistence, and honest scope — so you can learn from real users before investing in infra.

Topic 2: Claude Code harness leak — skills, sub-agents, and safe learning

Roadmap connection: Step 1 — First use of AI for coding

There was active discussion of public posts and repositories that surfaced internal details of how Claude Code (Anthropic’s coding harness) is structured — described in session as a major leak of “nuts and bolts”: how context is compacted, how skills and sub-agents are orchestrated, and how multiple agents can work in parallel (e.g. inspecting different codebases and merging insights).

Why it matters for learners

Understanding harness design (memory, tooling, parallelism) is part of modern AI-assisted development — not just the model weights.
Reports suggest tiered models: heavier models for hard reasoning steps and smaller models for lighter sub-tasks to preserve context budget and token spend (exact behavior depends on product settings).

Safety note

The same moment produced many third-party repos claiming to replicate or port leaked internals. Treat unknown scripts as untrusted — malware has appeared in this wave. Prefer learning from descriptions of the official product behavior, primary docs, and code you can audit — not opaque “drop-in” ports from strangers.

Topic 3: KV cache — caveats after the explainer

Roadmap connection: Step 2 — Run a model locally

We reinforced a point from the roadmap’s KV cache explainer (video linked in Step 2):

The memory formula in many explainers is a useful upper-bound mental model; real KV memory depends on attention variant, precision, GQA/MQA, and other architecture choices — so measured footprint is often smaller than the worst-case story.
The Illustrated Transformer and Attention Is All You Need remain the conceptual backbone; newer models add layers (e.g. different position handling, sliding windows) on top.

Topic 4: Benchmark literacy — reading model cards

Roadmap connection: Step 2–3

Model launches advertise benchmark scores; the cohort roadmap includes a reverse mapping: for major benchmarks, what is being measured and what a sample task looks like.

Examples called out in session:

Benchmark	What it tends to measure	Flavor of task
HellaSwag	Commonsense / plausible continuation	Pick the most plausible next sentence for a short paragraph
ARC (Challenge)	Reasoning	Multi-step reasoning items
GSM8K	Math word problems	Grade-school style math
SWE-bench	Software engineering	Realistic coding / repair tasks

How to use this: when you read a README for a new checkpoint, you can ask: which of these suites align with what I care about? A model can be strong on one axis (e.g. images) and middling on another (general reasoning benchmarks) — scores are not one number to rule them all.

Topic 5: Gemma 4 and Matryoshka-style sparsity

Context: brief discussion of Gemma 4 reception (benchmarks vs. strengths in areas like image generation) and Google’s broader model line (general vs. science/coding-focused variants).

Matryoshka-style idea (as discussed): some architectures pack nested or sparse structure so you can run a smaller effective model from the same family without always resorting to post-hoc quantization alone — useful when you care about cost–latency tradeoffs at serving time.

(Exact Gemma 4 specs evolve; treat papers and model cards as the source of truth.)

Topic 6: Speedrun — nanoGPT, target loss, Tyler Romero’s worklog

Roadmap connection: Step 4 — Training fundamentals · Project Watch

We connected back to Session 2 (27 March) and went deeper on the nanoGPT / modded-nanogpt “speedrun” challenge.

The challenge

Train a small GPT-2–scale model on a standard dataset until validation loss hits a fixed target (discussed in session: ~3.28 on the agreed task — a bar derived from strong public baselines).
Measure time to target on a fixed hardware class (e.g. 8 × H100 for leaderboard entries).
The leaderboard has pushed wall-clock times from many hours down to minutes — a live demonstration that training engineering matters as much as “big model” narratives.

Why run a toy job on serious GPUs?

A short multi-GPU run (minutes to a couple of hours) is cheap enough to iterate compared with training a 1B+ parameter model for many hours or days.
Cloud starter credits often cover many experiments at this scale — making it a pedagogical sandbox for real distributed training.

Tyler Romero’s worklog

We used Tyler Romero’s NanoGPT speedrun worklog and the tyler-romero/nanogpt-speedrun repo as concrete artifacts: baseline times, commits, and architectural tweaks documented in the open.

The related community project KellerJordan/modded-nanogpt tracks the evolving record times and changes.

Topic 7: Distributed Data Parallelism (DDP) — gradients and all-reduce

Roadmap connection: Step 4 — Training fundamentals

DDP is the first distributed strategy most learners meet: replicate the full model on each GPU and split the batch across devices.

Mental model

flowchart TD
    subgraph GPUs ["Same model copy on each GPU"]
        G0["GPU 0 — batch shard A"]
        G1["GPU 1 — batch shard B"]
    end
    G0 --> F["Forward + backward → local gradients"]
    G1 --> F
    F --> AR["All-reduce: sum gradients, divide by world size"]
    AR --> U["Optimizer step → same new weights everywhere"]
    U --> N["Next minibatch"]

Each rank runs forward and backward on its shard of the data.
Gradients are aggregated across ranks — typically all-reduce (sum then divide by world size) so every rank holds the average gradient.
The optimizer updates weights; all ranks stay in lockstep with the same parameters before the next step.

Why average? Each GPU only saw part of the data; averaging combines evidence from all shards into one update rule consistent with a larger global batch.

Global vs local batch:
global_batch_size ≈ local_batch_size × num_gpus × gradient_accumulation_steps (exact layout depends on your script).

Topic 8: Communication bubble, NVLink, and DeepSeek-style constraints

Roadmap connection: Step 4

The bubble

If networking between GPUs is slow (or misconfigured), ranks wait on collectives (all-reduce, broadcast, etc.). Idle GPU time is the real “loss” — not numerical error in the collective itself.

That is why serious multi-GPU rentals are sold as pre-wired nodes with NVLink / NVSwitch — you cannot always stitch arbitrary single-GPU instances into a high-bandwidth cluster.

DeepSeek under hardware pressure (teaching story)

Export controls and hardware limits pushed teams to optimize communication paths, pipeline schedules, mixed precision, and even low-level GPU instructions — not only “more FLOPs.”

A readable starting point on the engineering story: DeepSeek’s Low Level Hardware Magic (and related analyses). Themes relevant to training at scale:

Bi-directional pipeline scheduling (reduce idle “bubbles”)
Mixed precision to fit more work in memory
Custom kernels and GEMM fusion — fusing operations (e.g. matmul chains) to keep SMs fed
PTX-level discoveries (instruction-level behavior) — illustrative of how far optimization can go

Parallelism beyond DDP: when the model does not fit one GPU, you need tensor, pipeline, expert (for MoE), or context parallelism — often combined (“4D / 5D” training). MoE routes tokens to experts and leans heavily on all-to-all communication patterns.

Topic 9: Muon optimizer — from speedrun contest to production models

Roadmap connection: Step 4

The speedrun ecosystem produced Muon — an optimizer variant that escaped the leaderboard into real checkpoints. In session we noted examples cited in passing (Kimi K2, GLM 4.5 — verify on each model card).

Takeaway: small, reproducible contests are where optimizer and architecture ideas get battle-tested before wide adoption.

Topic 10: RoPE in practice — RoFormer, Tyler’s commits, training script walkthrough

Roadmap connection: Step 2 · Step 4

Absolute vs rotary position embeddings

Classic setups add a learned or sinusoidal position vector to token embeddings. Rotary Position Embedding (RoPE) — see RoFormer / RoPE paper — injects relative position by rotating Q and K in attention instead of adding a separate position vector to the residual stream. Benefits discussed include relative distance behavior and efficiency when implemented with caching.

What changed in Tyler’s speedrun experiment (session summary)

Walking the diff / blog (not line-by-line in notes): adding RoPE, tuning learning-rate schedule (e.g. trapezoidal / warmup), adjusting batching and gradient accumulation, and removing or changing gradient clipping — together moved a reported baseline on the order of ~8.3 hours toward ~7.5 hours while also affecting token efficiency (exact numbers: see Tyler’s posts and commits).

Hyperparameters vs parameters

Hyperparameters — set before/during the run: learning rate schedule, batch sizes, sequence length, warmup steps, accumulation steps, etc.
Parameters — learned weights of the model.

Warmup steps — early training often uses lower LR so optimization is stable while random initialization noise settles (exact schedule is experiment-specific).

Reading one DDP script

We stepped through a minimal DDP-style script (pattern common in nanoGPT-style code):

rank / local_rank / world_size — which process this is in the cluster.
DistributedSampler — each rank receives its slice of the dataset.
Loop — forward, backward, gradient accumulation windows, then all-reduce of gradients, optimizer.step(), logging to Weights & Biases (or similar), validation loss check, stop when target hit.

Collectives vocabulary (for later reading): all-reduce, all-gather, reduce-scatter, broadcast, all-to-all — different patterns appear as you move from DDP to MoE and large-scale parallelism.

Follow-ups mentioned in session

A deeper write-up on RoPE and the forward pass of the speedrun script — to be added to the First Break AI site as the material matures.
Learner to read RoFormer / RoPE with the cohort’s primer first to avoid getting lost in notation.

← Back to Office Hours | ← Back to Roadmap