It’s a fairly well known fact that Subversion performs poorly when it comes to storage compared with all the latest version control systems. Disk space is cheap though, so I simply reserve plenty of it for this purpose. I generally never thought twice about the size of an SVN checkout or the size of any of my own repositories knowing they would be large. This changed when I started a new project that required syncing the entire WordPress plugins SVN repository, and not just because I knew it was in the running for the largest SVN repositories in regards to number of commits.
When I first started syncing the plugins SVN repo, I had already used Mark Jaquith’s plugin directory slurper that downloads the latest copy of every single plugin in the repository so they are readily available to scan for various statistics about core API usage, and general programming habits among plugin authors. This comes out to about 7GB uncompressed. However, this was just the latest copy of all plugins, it didn’t include the code history. Given the typical nature of SVN, and the 7GB figure I had to go off of, I had originally estimated that the synced mirror might be between 40GB and 60GB. This is pretty big, but certainly not a problem for today’s drives.
So I started the sync, and let it run for a couple days. I had the first 80,000 revisions done, so it was time to revise my estimates for a more accurate number. With the first 80,000 revisions, the total repository size on disk was right around 13GB. So, you figure at 8 times as many revisions (640,000 – close to the total revision count currently), you might have somewhere around 104GB. Well, this is certainly much higher than I had anticipated, but this still doesn’t pose a problem.
I let the sync run for a couple more days. I now had 160,000 revisions, and the total repository size on disk was 46GB. Wait a minute, shouldn’t that have been about 26GB? That’s not anywhere close to my estimates. I would expect some deviation to account for better graphics and some other minor things taking up more space, but not anything that big. At this rate, I knew I was looking at a total size probably over 200GB. This kind of behavior doesn’t happen with any of my other SVN repos. It was now time to do some investigation.
Below you can see a graph of the total repo size over the course of adding in each new FSFS pack. One single pack represents 1,000 revisions, so pack 80 would put the repository at revision 80,000.
This paints a much better picture of what is going on, but still doesn’t tell us why. The exponential repository size growth is very obvious now, and what’s more interesting is that it’s a very predictable and stable curve. This has nothing to do with any crazy and reckless commits by destructive plugin authors (you do see that happen in pack 80, but it doesn’t even make a dent in the overall repo growth). Taking this growth into account, my new estimate for total repo size at 650,000 revisions was now 450GB, ten times what I had originally expected.
An experienced Subversion administrator should be able to tell you what’s wrong here. This graph clearly shows consistent growth of pack sizes where there shouldn’t be any (or at least very minimal growth). This blue line should never break above maybe 50MB except for that reckless commit in pack 80, and like pack 80, if it ever does, the next pack should not be affected in any way.
The graph also clearly shows that whatever growth is here, it’s contained within just about every single pack in the SVN repository, and at this point, it’s going to be obvious to find since it accounts for more than 95% of the contents of every commit if we’re looking at any commit beyond revision 60,000 or so. So one way we can find the cause is by simply picking out a single revision preferably with just a one line change, and identify what SVN is storing in that revision that accounts for 95% of the contents that isn’t related to the actual changes made in that revision.
So, I pick out revision 178,012. It’s a single line change to bump the stable tag of the “infinite-scroll” plugin by it’s maintainer, paul.irish. The raw FSFS revision file contents contains 43,657 lines of data for a total of 576KB. There’s about 10 lines including one binary delta identified as the stable tag bump to “/infinite-scroll/trunk/readme.txt”, about 30 lines showing the revision properties for the “/infinite-scroll/trunk” node that this file is contained in – mostly identifying the FSFS node IDs for all files and directories contained in that node including readme.txt, another 20 lines showing the revision properties for the “/infinite-scroll” node – containing 3 revision properties identifying the FSFS node IDs for the branches, tags, and trunk nodes under the plugin node, and finally, a remaining 43,587 lines of data for revision properties of the root repository node (“/”) containing FSFS node IDs for every directory in the root node, which happens to be a list of every single plugin in the SVN repository accounting for 99.99% of the contents of the entire revision.
It turns out that Subversion’s storage mechanism requires naming off any related node properties on any changed nodes with every revision, including sibling nodes. Every single commit is going to be related to the root repository node, so every single commit is going to contain the list of all plugins in the repo. As new plugins are added to the repository, it’s name and FSFS node ID will be added to that list for every new commit from that point forward.
How do we fix it?
Any solution to this problem is going to involve some painstaking infrastructure changes with the way the plugins repository works. So the short answer is that it’s going to take years to fix. However, if nothing is done in the next two years, the SVN repository will double in size to about 900GB, and it’s performance will quickly degrade as the server takes longer to read revisions and the filesystem cache can no longer be used (which I suspect is already the case now). We can continue to toss new hardware at the problem. This is expensive though, and there is a point where that can’t even solve the problem anymore anyway.
Thankfully the svnadmin dump utility typically used to make backups does not have this storage problem. A dump of revision 178,012 mentioned above that’s 576KB in the repo is actually only 661 bytes in a dump. A dump of the entire repository should only be about 30GB. So performing backups does not pose a problem with this repository (other than the length of time it takes to perform a dump with an inefficient repository).
Our first step should at least be setting up new plugins in their own repositories assuming WordPress continues to support SVN for plugins in the future. That way only commits to existing plugins continue being wasteful, and the repository would stop growing exponentially.
We could export existing plugins into new repositories, however, this would have to be a decision made by the plugin maintainer since it will require a new URL to the checkout, and might even require rewriting revision numbers (although it is possible for the SVN dumpfilter utility to leave empty revisions in place to maintain revision numbers). We could probably do this automatically for any plugins without a commit in the last 2 years without any complaints, and with that option, we could even go as far as using dumpfilter to pro-actively remove those nodes from the legacy repo. That could easily cut the repository size in half, speeding it back up significantly.
Just to quickly get this out of the way, I know someone is thinking “what about a migration to git?” Let me just clarify that while I’m all for adding git support to Extend, this is something that can not be forcefully pushed on everyone that already has plugins in the repo, and we certainly shouldn’t only offer git for new plugins either. You are right, it would help a little bit, but it doesn’t solve the problem. It would also take significantly longer to implement than any other solution here, and we don’t have a lot of time to solve this.