I built a C++20 zero-copy graph engine to stream 50GB PyTorch datasets using mmap and nanobind.

3 months ago 308

I’m an undergrad CS student and I recently open-sourced GraphZero (v0.2). It's a zero-copy data engine designed to stop PyTorch from crashing out of memory when training massive Graph Neural Networks.

I wanted to share the architecture here because getting a C++20 extension compiling across Windows, Linux, and macOS in CI/CD was an absolute trial by fire.

The Architecture: To bypass Python's memory overhead, the engine compiles raw datasets into a custom binary format. It then uses POSIX mmap (and Windows equivalents) to map the files directly from the SSD. Using nanobind, I take the raw C++ pointers and expose them directly to PyTorch as zero-copy NumPy arrays. The OS handles all the data streaming via Page Faults while PyTorch trains the model.

Under the hood:

Template Dispatching: Used heavily for the feature store to enforce FLOAT32 and INT64 memory layouts natively.
Concurrency: Used OpenMP to multi-thread the graph traversal and neighbor sampling, releasing the Python GIL so the C++ side can saturate the SSD bandwidth.
The Apple Clang Trap: I used C++17's std::from_chars to parse CSVs without heap allocations. It worked perfectly on GCC and MSVC, but I discovered the hard way that Apple's libc++ still hasn't implemented from_chars for floating-point numbers, forcing me to write a compile-time fallback macro just to get the macOS runner to pass.

If anyone here has experience with high-performance C++ Python extensions, I would absolutely love a code review. Specifically, I'm looking for critiques on: