the blog
« frontispiece
Mark’s Blog
- 2026 · 06 · 10 Constructing a Frontier Code Pretraining Dataset for Fun and Profit Reconstructing NVIDIA’s 747B-token code dataset from 623M rows of metadata, by way of a 50-billion-node traversal of the Software Heritage Archive.
Plate 4 · Dispatches
1 entry
— 4 —