Constructing a Frontier Code Pretraining Dataset for Fun and Profit
NVIDIA as part of their Nemotron open pretraining
efforts released a “dataset” of 747.4
billion code tokens1. Or, more precisely, they “release metadata to
reproduce a 747.4B token curated code dataset.”
Once you take a look at the code pretraining sample
they’ve so generously provided, you’ll see
this metadata is more a white elephant than a generous
contribution to the open-source community. What they
provide is ~623 million rows of metadata with 3 columns:
the repository slug (e.g.
villemt/Coffeebone), the relative path of
the source file (e.g. app/Bootstrap.java),
and the first 7 characters of the commit hash where the
file can be found (e.g. 5d81d77). Finding
the original source files based on this information is
neither fun nor, as I am sure NVIDIA’s lawyers
would be happy to remind me pursuant to the NVIDIA Data
Agreement, profitable.
Why you can’t just scrape GitHub
NVIDIA describes the dataset as being curated from
GitHub—so surely that’s the right place to
go? Well, you’ll quickly learn that GitHub has
been tightening their rate limits for both
unauthenticated and authenticated users for the past few
years; and even if one felt brave enough to build a
botnet to scrape GitHub, their own downtime would
probably stop you first2, 3, 4. So what about publicly available clones of GitHub
like Google’s BigQuery public dataset or
gharchive5, 6? The former covers only 2.8 million repositories (what
are we, amateurs? come back to me with some real big
data); the latter, only events (e.g. commits, issues,
forks). However, gharchive’s limited scope is
helpful for highlighting that there are actually two
parts to this reconstruction problem. First, identifying
which exact artifact we’re looking for.
villemt/Coffeebone, commit beginning with
5d81d77, is actually not a very helpful
piece of information, for two separate reasons. The repo
slug isn’t unique over time: GitHub usernames can
be reclaimed after 90 days, so a slug like
johnsmith/website could plausibly point to
several different repos. And the short7 prefix of course
isn’t unique either—the same seven
characters show up in many distinct full commit hashes.
So, we need to use both pieces of information together
to disambiguate. Second, once we’ve somehow
identified what artifact we’re looking for,
identifying where we can grab the exact file we’re
looking for—ideally at a very high speed.
gharchive might have the information to solve our first
problem, but we’ll be ratelimited by GitHub the
second we try and solve the second. So, is there some
dataset that’ll allow us to do both?
Enter the Software Heritage Archive
Yes. Enter the Software Heritage Archive—the most awesome (both in the sense of awe-inspiring scale and the valuable work they’re doing) archival project I’ve had the pleasure/pain of learning about over the past few months7. Supported by Inria and UNESCO they are working to “collect, preserve, and share8 all software that is publicly available in source code form.” They work with GitHub (among many others) directly to get access to the code hosted on the platform, and transform it into a Merkle DAG representation with a few node types we care about a lot (see simplified visual representation below or official representation here9). Going from the leaves up we have: contents (i.e. blobs, the raw content of files or more often their hashes), directories, revisions (i.e. commits), snapshots (the revisions of all the branches of the repo when archived), and origins (the place where code was snapshotted from, e.g. github.com/muchanem/flac_mini). Generally, these objects are identified with Software Heritage IDs (SWHIDs), usually their Git SHA-1 hash.
At the repo level in this DAG, it helps to think of what looks like two object types as three. There are directories (the top level directory or subdirectory in a repo), contents (the SWHID—i.e. the hash—of a file, not the file itself), and directory entries, the labeled edges that can point at either a directory or a content. The plaintext file and directory names live on those edges/entries, not on the nodes/directories/contents (nice gun you’ve got there Chekhov…).
The Software Heritage Archive provides us an API (no, the rate limits are not high enough to just use that), ~14TB compressed Boldi-Vigna/WebGraph representation of this DAG, and tables of these node types as ORC files10 where the tables for e.g. just directory entries can be dozens of TBs. All we need to do is… a complex traversal of this graph (a ~50-billion-node DAG) for a set of nodes that are extremely sparse in the graph and have no statistical structure to exploit because of hashing. That sounds like we’ll need 14TB of RAM/SSD to compute in an at all reasonable time11, but rest assured: with a little clever engineering, we’ll only need 1–2TB of RAM, 1–2TB of SSD storage, and ~a dozen TB of spinning disk storage. Got your compute in hand? Let’s build this dataset!
Fig. 1 — a simplified Merkle DAG of the Software Heritage Archive
Part 1: which revision do we even want?
The first part of this problem (identifying which exact repo + commit we want, going forward using the language of the SWH, which revision we want) is not so bad, given you have the hardware described previously. The naive solution would be to load (or really mmap) the graph, and BFS from each origin to its revisions to find the one with the short7 we’re looking for. This approach is wrong because it ignores an important principle repeated throughout this effort: always minimize the number of graph ops to be executed12. The sparse hops here are irregular, low-locality, and—because everything is keyed by hash—effectively random, so caches miss and nearly every hop is a random read to disk; a sequential scan over all nodes of a given type sidesteps all of that, and will always beat the sparse graph operation even when you can fit everything you care about into RAM/SSD. Instead, we want to decompose this into two subparts. First, a sequential scan over all revision nodes to determine which nodes reference commits with short7s in our dataset13. Second, we traverse the transposed graph to the origin nodes and, among the candidates sharing that short7, determine which revision actually belongs to the repo we’re looking for14; call this the winning revision. With my compute, I could get this done in about an hour.
Part 2: walking the path (the headache)
The second part of this problem is where things get
interesting, and by interesting I mean a massive
headache. That principle about minimizing sparse graph
operations really bites when it comes time to traverse
the forward graph, following our relative paths down to
the content nodes. To traverse by plaintext name, we
have to index into
graph-labelled.labels—a 4TB
file—for the entire subtree rooted at each winning
revision (since directory and file names live on the
labeled edges, not nodes). I only had 1.5TB of SSD
before falling back to spinning disks mounted over a
network, and my early estimates had the operation taking
a month of 24/7 compute: the random-access pattern got
me about 5 reads/sec over NFS (I think my slurm
fairshare would’ve been tanked until the heat
death of the universe if I did this).
The workaround is to turn all that random access into a single sequential pass over the labels file, using two small in-memory sets to filter it. First, BFS the entire subgraph rooted at each winning revision (to a maximum depth of the deepest relative path we’re looking for) and store every directory entry—file or subdirectory—by node ID; call this our reachable set. Second, compute all possible path-segment names we care about (e.g. “etc” and “passwd” for “etc/passwd”); call this our requested components. Then scan the 4TB labels file sequentially, keeping only node IDs in the reachable set and labels in the requested components. What survives is small enough to build into a new graph over just the labeled directory entries we care about, which you can finally traverse down to its content leaves. Now the content leaves are actually just the hash of the file, and you’ll need to grab the actual content from S315 (which carries its own challenges), but I’ll leave that as an exercise for the reader :)
In my own run, the path-walk resolved ~99.99% of the rows it reached, and can easily run within a single day. I’ve open-sourced the code for doing this yourself here. Unfortunately, NVIDIA’s license doesn’t allow me to release the raw IDs publicly. Please do reach out with any ideas, questions or feedback!
Notes & references
- huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample ↩
- docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api ↩
- github.blog/changelog/2025-05-08-updated-rate-limits-for-unauthenticated-requests ↩
- mitchellh.com/writing/ghostty-leaving-github ↩
- cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code ↩
- gharchive.org ↩
- softwareheritage.org ↩
- Emphasis on the share part! An archive is only useful if it is held in common and arguably is most likely to survive when disseminated widely. More projects like this and less putting all your code in some glacier for PR please. ↩
- docs.softwareheritage.org/devel/swh-model/data-model.html#data-model. I have some quibbles with how they define their data model because I think it is often obfuscatory of the structure of their artifacts, but that’s a rant for another day. ↩
- The ugly, jealous, less supported and less successful contemporary of Parquet. Can you tell how much time I’ve spent fighting this file format? ↩
- This problem took me places you wouldn’t go with a gun: a graduate course on systems architecture with a former Intel VP where I programmed a toy version of this problem for an experimental graph computer and left with an unreasonable amount of knowledge about graph computation on both commodity and non-commodity hardware. The structure that computers have converged on feels almost adversarially designed to this sort of sparse graph computation. The basic operation we’re doing here is irregular, data-dependent, low-locality, and is literally random because of the hashing. Caches don’t hit, huge cache lines are wasted, branch prediction is useless, prefetching doesn’t help, etc. Even though our units of work are tiny and duplicated (suggesting maybe a GPU is helpful), our data dependency rules them out. Since we’re getting no cache hits we’re bound by the random read latency of where our data is—DRAM is already brutal (~100ns), but in reality storage hierarchies mean for a graph of this size you’re constantly going to HDD. The parallelism story is worse than just data dependence: real graphs are skewed with power law distributions in degree meaning keeping work balanced is hard and the optimal algorithm can change throughout the problem. Scaling out beyond a single node hits even worse communication and synchronization issues. And not to worry, the programming model is almost as adversarial with no way to express latency hiding async patterns cleanly. ↩
- In particular, you will end up traversing a large portion of the subtree rooted at the origin (treating revisions as leaves) for your ~75 million unique repo, short_7 pairs. Not all snapshots have all revisions, so this traversal is worse than having to read in expectation 50% of the revisions under one snapshot. Worse yet, there’s no locality in the data structure—nodes are hashed and ordered to optimize for compressibility not BFS locality. If that wasn’t enough, you don’t have enough memory to keep the data structure to convert from node IDs to SWHIDs resident, so you’re constantly going to disk for your comparisons. ↩
- There’s a 2 ways to go about this: you could go for a more map-reduce flavored approach by splitting the ~5 billion revision nodes into n-shards (I used 256) and then do a join over the shards. Alternatively, since you’ll generally be I/O bottlenecked while reading over all revision nodes and bucketing (in my case since the data lives on a remote mount with at best 300MB/s of throughput—but this operation is generally so compute light that I think this should generally be the case), you can build a bit-vector over your “short7”s on the nemotron side, and just filter the swhids as you scan over the data on disk. ↩
- In practice there can be multiple commits with the same prefix in the same repo—you can be principled and keep both, or just keep the later/main branch/head commit/commit the code sees first and call it a day. ↩
-
The content leaf’s SWHID is a
sha1_git(git’s blob hash), but SWH’s S3 object store is keyed by plainsha1—so you can’t fetch straight from the leaf; you first have to translatesha1_git → sha1via SWH’scontenttable, which isn’t part of the graph. ↩
— 4a —