flowchart LR
subgraph modelFile ["Model file on disk"]
META["Metadata\narchitecture, dims,\nvocab size, etc."]
T1["Tensor: embed_tokens.weight\nshape: 151936 x 1024\ndtype: float32"]
T2["Tensor: layers.0.self_attn.q_proj.weight\nshape: 1024 x 1024\ndtype: float32"]
T3["Tensor: layers.0.self_attn.k_proj.weight\nshape: 256 x 1024\ndtype: float32"]
TN["... hundreds more tensors ..."]
end
Step 2: Model Weight Formats — GGUF vs SafeTensors
First Break AI — Step 2: Run a Model Locally
This post is part of the First Break AI cohort roadmap. Companion to the main Step 2 guide: Run Qwen3 0.6B in pure C. You do not need to read that guide first, but it helps.
Lesson 0: What is inside a model file?
Before comparing formats, you need to understand what every model file contains. It is simpler than you think.
A trained LLM is a collection of tensors — multi-dimensional arrays of floating-point numbers. Each tensor has:
- A name — like
model.layers.0.self_attn.q_proj.weight - A shape — like
[1024, 1024](a 1024x1024 matrix) - A data type — like
float32(4 bytes per number) orfloat16(2 bytes) - The numbers themselves — millions or billions of them
That is it. A model file is a container that stores these tensors along with some metadata (architecture name, vocabulary size, number of layers, etc.).
Qwen3 0.6B has about 600 million parameters. At 4 bytes each (float32), that is ~2.4 GB of raw numbers. The rest of the file is metadata and tensor names.
Every format we discuss stores exactly this information. The differences are how they store it — and that “how” has major implications for security, speed, compatibility, and quantization.
Lesson 1: PyTorch .bin — the legacy format
The original way PyTorch saves models.
How it works
PyTorch uses Python’s pickle module to serialize model state dictionaries. When you call:
torch.save(model.state_dict(), "model.bin")Python’s pickle serializes the entire dictionary — tensor names, shapes, dtypes, and the raw data — into a binary stream.
To load:
state_dict = torch.load("model.bin")
model.load_state_dict(state_dict)The pickle problem
Pickle can serialize arbitrary Python objects, including executable code. This means a malicious .bin file can execute code on your machine when you load it:
import pickle
import os
class Exploit:
def __reduce__(self):
return (os.system, ("rm -rf /",))
pickle.dumps(Exploit())When pickle deserializes this object, it calls os.system("rm -rf /"). A model file could contain this payload hidden among the tensor data. You would not know until it runs.
This is not theoretical. Security researchers have demonstrated pickle-based attacks against ML model files. It was the primary motivation for creating SafeTensors.
flowchart TD
subgraph safe ["Safe file formats"]
ST["SafeTensors\nno code execution"]
GG["GGUF\nno code execution"]
end
subgraph unsafe ["Formats with code execution risk"]
PB["PyTorch .bin\npickle-based"]
PKL["Raw .pkl files"]
end
PB -->|"can contain\narbitrary code"| RISK["Code executes\non load"]
ST -->|"only contains\nnumbers + metadata"| SAFE["No code execution\npossible"]
GG -->|"only contains\nnumbers + metadata"| SAFE
Every time you download a model from the internet and call torch.load(), you are trusting that the file does not contain malicious code. With pickle-based formats, that trust is not verifiable — you cannot inspect the file without executing it. SafeTensors was created specifically to solve this problem.
Lesson 2: SafeTensors — the secure replacement
SafeTensors was created by HuggingFace as a direct response to the pickle security problem.
Design principles
- No code execution — the format can only store tensors and metadata. There is no mechanism to embed executable code.
- Zero-copy loading — tensors can be memory-mapped directly from disk without copying data into RAM.
- Format validation — the file structure can be fully validated before any data is read.
- Cross-framework — works with PyTorch, TensorFlow, JAX, Flax, and others.
File structure
A SafeTensors file has a dead-simple layout:
┌──────────────────────────────────────────┐
│ 8 bytes: header_size (little-endian u64) │
├──────────────────────────────────────────┤
│ JSON header │
│ { │
│ "tensor_name": { │
│ "dtype": "F32", │
│ "shape": [1024, 1024], │
│ "data_offsets": [0, 4194304] │
│ }, │
│ ... │
│ } │
├──────────────────────────────────────────┤
│ Raw tensor data │
│ (contiguous bytes, no padding) │
└──────────────────────────────────────────┘
The header is JSON — human-readable, parseable, and safe. It contains tensor names, shapes, dtypes, and byte offsets into the data section. The data section is just raw bytes — no structure, no code, no objects. Each tensor’s data starts at the offset specified in the header.
Why “zero-copy” matters
Because the data section is contiguous raw bytes with known offsets, you can mmap the file and point directly at any tensor without reading the entire file into RAM:
from safetensors import safe_open
with safe_open("model.safetensors", framework="pt") as f:
q_proj = f.get_tensor("model.layers.0.self_attn.q_proj.weight")Only the pages containing that specific tensor get read from disk. On a 7B model (14 GB), loading one tensor is almost instant — the OS only reads the relevant pages.
How HuggingFace uses it
HuggingFace Hub now defaults to SafeTensors. When you call:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")The library downloads .safetensors files (not .bin) and loads them with zero-copy memory mapping. This is faster than pickle-based loading and eliminates the security risk.
Loading in Python
from safetensors.torch import load_file
tensors = load_file("model.safetensors")
print(tensors.keys())
# dict_keys(['model.embed_tokens.weight', 'model.layers.0.self_attn.q_proj.weight', ...])
print(tensors["model.embed_tokens.weight"].shape)
# torch.Size([151936, 1024])Lesson 3: GGUF — the local inference format
GGUF (GGML Unified Format) was created by Georgi Gerganov for llama.cpp. It is the standard format for running models locally without Python.
Why a separate format?
SafeTensors solved security. GGUF solves a different set of problems:
- Self-contained — a single GGUF file contains everything needed to run inference: weights, tokenizer vocabulary, architecture config, chat template. No extra files.
- Built-in quantization — GGUF natively supports dozens of quantization formats (Q4_0, Q4_K_M, Q8_0, etc.) that reduce model size and speed up inference.
- C/C++ native — designed to be read by C programs, not Python. No pickle, no JSON libraries, no framework dependencies.
- Memory-mappable — like SafeTensors, tensor data is laid out for direct
mmap.
File structure
┌─────────────────────────────────────────┐
│ Magic number: "GGUF" (4 bytes) │
│ Version: 3 (4 bytes) │
│ Tensor count (8 bytes) │
│ Metadata KV count (8 bytes) │
├─────────────────────────────────────────┤
│ Metadata key-value pairs │
│ "general.architecture": "qwen3" │
│ "qwen3.block_count": 28 │
│ "qwen3.embedding_length": 1024 │
│ "tokenizer.ggml.tokens": [...] │
│ "tokenizer.ggml.merges": [...] │
│ "tokenizer.chat_template": "..." │
│ ... │
├─────────────────────────────────────────┤
│ Tensor info (names, shapes, offsets) │
├─────────────────────────────────────────┤
│ Tensor data (aligned, memory-mappable) │
└─────────────────────────────────────────┘
The metadata section
This is what makes GGUF self-contained. The metadata includes:
- Architecture — model family (llama, qwen3, mistral, etc.)
- Dimensions — embedding size, number of layers, head count, etc.
- Tokenizer — the full vocabulary, BPE merge rules, and special token IDs
- Chat template — the Jinja template for formatting messages
- Quantization info — what precision each tensor uses
In SafeTensors, this information lives in separate files (config.json, tokenizer.json, tokenizer_config.json, etc.). In GGUF, it is all in one file. You can load a GGUF and run inference without any other files.
Quantization support
This is GGUF’s killer feature. Each tensor in a GGUF file can use a different quantization format:
| Format | Bits per weight | Size for 7B model | Quality |
|---|---|---|---|
| F32 | 32 | ~28 GB | Full precision |
| F16 | 16 | ~14 GB | Near-lossless |
| Q8_0 | 8 | ~7 GB | Very good |
| Q4_K_M | ~4.5 | ~4.1 GB | Good for most uses |
| Q4_0 | 4 | ~3.8 GB | Acceptable |
| Q2_K | ~2.5 | ~2.7 GB | Noticeable degradation |
A 7B model that would be 28 GB in float32 can be 4 GB in Q4_K_M — small enough to run on a laptop with 8 GB RAM. The quantization is baked into the file format itself; the inference engine (llama.cpp, qwen3.c) knows how to read and dequantize each format.
SafeTensors does not have built-in quantization — you need separate libraries (GPTQ, AWQ, bitsandbytes) to quantize and load quantized models.
Lesson 4: Side-by-side comparison
Here is the complete comparison:
| Feature | PyTorch .bin |
SafeTensors | GGUF |
|---|---|---|---|
| Security | Unsafe (pickle) | Safe (no code) | Safe (no code) |
| Memory mapping | No | Yes | Yes |
| Self-contained | No (needs config files) | No (needs config files) | Yes (everything in one file) |
| Quantization | External only | External only | Built-in (Q4, Q8, etc.) |
| Tokenizer | Separate file | Separate file | Embedded in metadata |
| Chat template | Separate file | Separate file | Embedded in metadata |
| Primary ecosystem | PyTorch | HuggingFace (multi-framework) | llama.cpp, local inference |
| Language | Python only | Multi-language | C/C++ native |
| Loading speed | Slow (deserialize) | Fast (mmap) | Fast (mmap) |
| File count | 1+ (often sharded) | 1+ (often sharded) | Usually 1 file |
flowchart TD
subgraph training ["Training / fine-tuning"]
HF["HuggingFace ecosystem"]
ST["SafeTensors files\n+ config.json\n+ tokenizer.json"]
end
subgraph conversion ["Conversion"]
CONV["convert script\npython convert.py"]
end
subgraph inference ["Local inference"]
GGUF_FILE["Single GGUF file\nweights + vocab + config\n+ quantization"]
LLAMA["llama.cpp / qwen3.c\nC/C++ inference"]
end
HF --> ST
ST --> CONV
CONV --> GGUF_FILE
GGUF_FILE --> LLAMA
When to use which
SafeTensors — when you are working in Python with HuggingFace Transformers, PyTorch, or any Python ML framework. This is the default for training, fine-tuning, and Python-based inference.
GGUF — when you want to run a model locally without Python. llama.cpp, Ollama, LM Studio, and other local inference tools use GGUF. Also when you need quantized models that fit in limited RAM.
PyTorch .bin — legacy. Avoid for new work. Use SafeTensors instead.
Lesson 5: How GGUF loading works in C
In the Step 2 blog, we ran Qwen3 0.6B using a single C binary (run.c). Here is how it loads the GGUF file.
Memory mapping — zero-copy loading
*data = mmap(NULL, *file_size, PROT_READ, MAP_PRIVATE, *fd, 0);
void* weights_ptr = ((char*)*data) + 5951648; // skip header
memory_map_weights(weights, config, weights_ptr);mmap() tells the operating system: “Map this file into my address space. Do not read it yet — just give me pointers.” The OS creates virtual memory pages that correspond to the file on disk. When the program accesses a pointer, the OS reads that page from disk on demand.
This means:
- Startup is instant — even for a 3 GB file,
mmapreturns in microseconds. No data is read yet. - Only used pages are loaded — if the model only accesses certain layers, only those layers get read from disk.
- The OS page cache helps — on subsequent runs, the data is already cached in RAM.
Weight pointers are just offsets
void memory_map_weights(TransformerWeights *w, Config *p, void *ptr) {
float* fptr = (float*)ptr;
w->token_embedding_table = fptr;
fptr += p->vocab_size * p->dim;
// ...
w->wq = fptr;
fptr += p->n_layers * p->dim * (p->n_heads * p->head_size);
w->wk = fptr;
fptr += p->n_layers * p->dim * (p->n_kv_heads * p->head_size);
// ...
}Each weight pointer (w->wq, w->wk, etc.) is just an offset from the start of the tensor data. No copying, no deserialization, no memory allocation. The pointers point directly into the memory-mapped file.
flowchart LR
subgraph disk ["GGUF file on disk"]
H["Header + metadata\n5.9 MB"]
EMB["embed_tokens\n600 MB"]
WQ["wq tensors\n~900 MB"]
WK["wk tensors\n~225 MB"]
REST["... remaining tensors"]
end
subgraph memory ["C pointers in memory"]
PEMB["w->token_embedding_table"]
PWQ["w->wq"]
PWK["w->wk"]
end
EMB -.->|"direct pointer\nvia mmap"| PEMB
WQ -.->|"direct pointer"| PWQ
WK -.->|"direct pointer"| PWK
Compare this to PyTorch loading, which:
- Opens the file
- Deserializes pickle objects (security risk)
- Copies tensor data into new Python/CUDA tensors
- Allocates GPU memory and transfers data
The mmap approach skips all of that. This is why qwen3.c can start generating text almost instantly.
Lesson 6: Converting between formats
In practice, you will encounter models in different formats and need to convert between them.
SafeTensors to GGUF
This is the most common conversion — taking a HuggingFace model and making it runnable by llama.cpp.
Using llama.cpp’s conversion script:
python convert_hf_to_gguf.py \
--outfile model-f16.gguf \
--outtype f16 \
./Qwen3-0.6B/This reads the SafeTensors files + config.json + tokenizer files, and packages everything into a single GGUF file.
Quantizing a GGUF
Once you have a GGUF in float16 or float32, you can quantize it to smaller sizes:
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_MThis converts from 16-bit to ~4.5-bit quantization. The file shrinks from ~1.2 GB to ~400 MB, and inference gets faster because less data needs to move through memory.
GGUF to SafeTensors
Going the other direction (local format back to HuggingFace):
python convert_gguf_to_safetensors.py model.gguf --output ./model-hf/This extracts tensors from the GGUF and saves them as SafeTensors files with the standard HuggingFace directory structure.
flowchart LR
subgraph hf ["HuggingFace Hub"]
SF["model.safetensors\n+ config.json\n+ tokenizer.json"]
end
subgraph local ["Local inference"]
GF32["model-f32.gguf"]
GF16["model-f16.gguf"]
GQ4["model-q4_k_m.gguf"]
end
SF -->|"convert_hf_to_gguf.py"| GF16
GF16 -->|"llama-quantize"| GQ4
GF32 -->|"llama-quantize"| GQ4
GQ4 -->|"convert_gguf_to_safetensors.py"| SF
Lesson 7: Why this matters for the rest of the roadmap
Understanding model formats is not just trivia. It connects to every step ahead:
| Roadmap step | How formats matter |
|---|---|
| Step 2 (current) | You loaded a GGUF file with mmap in C — now you know why that was fast and safe |
| Step 3: Inference engines | vLLM loads SafeTensors; llama.cpp loads GGUF. Different engines, different formats, same weights. |
| Step 3: Quantization | GGUF’s Q4/Q8 quantization is why models fit on laptops. Understanding the format helps you choose the right quant. |
| Step 4: Training | PyTorch saves checkpoints as SafeTensors. You will convert to GGUF for deployment. |
| Project Watch: Unsloth | Unsloth works with SafeTensors models in Python. The optimization happens at the GPU level, not the file level. |
The key insight: the same model weights can exist in multiple formats. SafeTensors for training and Python inference, GGUF for local C/C++ inference. The numbers are identical — only the container changes.
Why First Break AI starts with pure C
A natural question: why does Step 2 use a raw C binary instead of Python with HuggingFace?
The pedagogical argument
When you run inference in Python:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
output = model.generate(inputs, max_new_tokens=100)Three lines. It works. But you have no idea what happened. The tokenizer, the attention mechanism, the KV cache, the sampling — all hidden behind generate().
When you run inference in C:
float* logits = forward(transformer, token, pos);
int next = sample(sampler, logits);Every operation is visible. You can read forward() and see the matrix multiplications, the RMSNorm, the RoPE rotation, the softmax. There is no abstraction to hide behind.
What pure C forces you to learn
Starting from C means you cannot skip understanding:
- Tokenization — you see the BPE merge algorithm as a loop, not a library call
- Chat templates — you see the exact string concatenation, the special token insertion
- Attention — you see Q @ K^T, softmax, @ V as explicit matrix operations
- KV cache — you see the cache arrays being filled and reused
- mmap — you see how the file becomes pointers into weight arrays
- Sampling — you see temperature scaling and top-p filtering as arithmetic
Every concept lands differently when you have seen the code that implements it. When you later use HuggingFace, vLLM, or Unsloth, you know what they are abstracting over — because you built the raw version first.
The progression
flowchart LR
C["Step 2: Pure C\nSee every operation\nunderstand the math"] --> PY["Step 3: Python + engines\nvLLM, llama.cpp server\nunderstand the systems"]
PY --> OPT["Project Watch: Unsloth\nSee the optimizations\nunderstand WHY they're faster"]
C is the foundation. Once you understand what inference actually does at the lowest level, optimization and systems design make sense. If you start with model.generate(), you are building on sand — you do not know what you are optimizing or why.
This is the same approach Karpathy uses in llama2.c and llm.c — minimal C implementations that strip away all abstractions so you can see the math.
Summary table
| Format | Best for | Security | Quantization | Self-contained | Ecosystem |
|---|---|---|---|---|---|
| SafeTensors | Python ML workflows | Safe | External | No | HuggingFace, PyTorch, JAX |
| GGUF | Local inference, C/C++ | Safe | Built-in | Yes | llama.cpp, Ollama, LM Studio |
| PyTorch .bin | Legacy (avoid) | Unsafe (pickle) | External | No | PyTorch |
The model weights are just numbers. The format is just the container. Understanding the container helps you move fluently between the training world (SafeTensors) and the inference world (GGUF) — which is exactly what you will do as you progress through the roadmap.