Original post

On January 12th 2020, reposurgeon performed a successful conversion of its biggest repository ever – the entire history of the GNU Compiler Collection, 280K commits with a history stretching back through 1987. Not only were some parts CVS, the earliest portions predated CVS and had been stored in RCS.

I waited this long to talk about it to give the dust time to settle on the conversion. But it’s been 5 weeks now and I’ve heard nary a peep from the GCC developers about any problems, so I think we can score this as reposurgeon’s biggest victory yet.

The port really proved itself. Those 280K commits can be handled on the 128GB Great Beast with a load time of about two hours. I have to tell the garbage collector to be really aggressive – set GOGC=30 – but that’s exactly what GOGC is for.

The Go language really proved itself too. The bet I made on it a year ago paid off handsomely – the increase in throughput from Python is pretty breathtaking, at least an order of magnitude and would have been far more if it weren’t constrained by the slowness of svnadmin dump. Some of that was improved optimization of the algorithms – we knocked out one O(n**2) after translation. More of it, I think, was the combined effect of machine-code speed and much smaller object sizes – that reduced working set a great deal, meaning cache miss penalties got less frequent.

Also we got a lot of speedup out of various parallelization tricks. This deserves mention because Go made it so easy. I wrote – and Julien Rivaud later improved – a function that would run a specified functional hook on the entire set of repository events, multithreading them from a worker pool optimally sized from your machine’s number of processors, or (with the “serialize” debug switch on) running them serially.

That is 35 lines of easily readable code in Go, and we got no fewer than 9 uses out of it in various parts of the code! I have never before used a language in which parallelism is so easy to manage – Go’s implementation of Communicating Sequential Processes is nothing short of genius and should be a model for how concurrency primitives are done in future languages.

Thanks where thanks are due: when word about the GCC translation deadline got out, some of my past reposurgeon contributors – notably Edward Cree, Daniel Brooks, and Julien Rivaud – showed up to help. These guys understood the stakes and put in months of steady, hard work along with me to make the Go port both correct and fast enough to be a practical tool for a 280K-commit translation. Particular thanks to Julien, without whose brilliance and painstaking attention to detail I might never have gotten the Subversion dump reader quite correct.

While I’m giving out applause I cannot omit my apprentice Ian Bruene, whose unobtrusively excellent work on Kommandant provided a replacement for the Cmd class I used in the original Python. The reposurgeon CLI wouldn’t work without it. I recommend it to anyone else who needs to build a CLI in Go.

These guys exemplify the best in what open-source collegiality can be. The success of the GCC lift is almost as much their victory as it is mine.