It’s great having loads of scanners to turn paper and ink into 1s and 0s, but you gotta process what is digitised before it can go up on the Internet Archive. Turning those massive 600dpi TIFFs into 300dpi JPEGs that go into a PDF for screen viewing, then compressing the original TIFFs as much as possible for long term archival requires a fair amount of CPU grunt.
My little HP ML10v2 with an i7-4770 couldn’t handle all the scanned images I was throwing at it. I’d spend a day scanning, end up with around 200GB of TIFFs and the poor thing would choke and become unresponsive unless I let it process the data in chunks instead of all at once. Sure, I could script it or manually process the chunks then let it go for a few days doing its thing, but the same machine was also my Plex box and file server so I had a great excuse to buy a new computer (I love buying computers) with the best bang for buck CPU performance I could find!
That quest lead me to purchase a HP DL560 G8 rackmount server. It has 4x E5-4610 v2 CPUs in it, each with 8-cores for a total of 32 cores and 64 threads. I got it with 256GB of RAM, dual 10GbE NICs, P420i 1GB RAID controller, dual 1200W PSUs and 5x 2.5″ genuine HP SAS/SATA caddies for a bee’s dick under $2,000 including shipping & GST from Bargain Hardware in the UK. I then added 5x 512GB Silicon Power SATA SSDs for $88 each. All up I spent just under $2500 on this server.
On a core for core basis, I couldn’t find anything better value than the HP DL560 G8. $2000 divided by 32 is $62.50/core. A 4th-gen i7 SFF box (i.e: Optiplex, EliteDesk) sells for around $300, $75/core. Building an 8-core Ryzen 7 1700 box with a mix of new and used parts would be at least $600, also $75/core. Then I’d have to build the machine and it wouldn’t come with any sort of remote management. Not to mention the headache of managing multiple machines versus a single machine.
$2500 is still a lot of money, but it’ll hopefully have solid resale value when I inevitably get bored of it. With all those CPUs and RAM it is a virtualisation beast, perfect for a homelab setup. Shit, I should be using this COVID-19 downtime to upskill and finally get some sort of formal computer qualifications that aren’t over 15 years old. This server could come in handy for that.
Luckily I don’t have buyer’s remorse over the $2500 outlay because DL560 G8 performs better than I expected. Idle power isn’t that bad at around 200W and under full load doing all the image processing, doesn’t hit more than 600W. I keep it in the garage where right now it’s nice and cold so the fans idle very quietly and even under load they don’t crank too hard as the ambient temps are low this time of the year. Summer will be a very different experience.
56 is the most concurrent processing jobs I’ve thrown at it so far (this is the command I run to process the scans), which isn’t even one job for each thread this machine is capable of, so it’s unsurprising that it chewed through all the images easily. What would have taken 3-4 days on the i7-4770 can be done overnight on the DL560 G8. I’ve got some more scanners on the way that’ll generate about 1TB of data a day, so it’ll be interesting to see how many simultaneous commands the DL560 can handle before shitting the bed.
As much as I like the DL560, the thing that shits me the most is that I can’t put an NVMe PCIe SSD in any of the PCIe slots without the server freaking out and ramping the fans up to 50% even if there’s no load. I used to have a DL380 G8 that had an SSD in the PCIe slot and it worked fine, stupid me for thinking the DL560 would work the same. It’s a shame as I would have loved to fill all its slots (eww) by installing 6x 1TB PCIe SSDs for a fucking beefcake RAID setup. Alas, I’m restricted to 5x SATA SSDs. I boot off a USB 3.0 drive and keep the SATA disks as a RAID-0 array just for processing data which I’m not a fan of and will whinge about later.
The other thing that pissed me off with the DL560 G8 is Intel’s fault. My original plan was to buy this server from the UK and then chuck in 4x E7-4880 v2 15-core CPUs for an awesome 60 core/120 thread setup. These CPUs are only $95 ea! Intel’s CPU product database says the E7-4880 v2 and E5-4610 v2 use the same LGA-2011 socket, so as long as HP haven’t done any BIOS fuckery, these super cheap 15-core CPUs should slide right in. No. No they don’t. The E7-4880 v2 is an LGA-2011-1 CPU, (aka R1 socket), not an LGA-2011. They’re ever so slightly different, but Intel’s website doesn’t say that! Fucking pricks. This article on Anandtech explains the difference. Luckily I was able to return the CPUs so I wasn’t out of pocket, but the DL560 doesn’t have as many cores in it as I had hoped. Four E5-4657L v2 CPUs (12-core) are around $680 from Aliexpress with DHL shipping which isn’t obscene, so I might be tempted one day.
There’s two minor bottlenecks with the DL560 in my workflow I’d like to remove. Getting files in and out of the DL560 is kinda slow due to me only having a gigabit network. I’m tempted to buy a little 10GbE switch so bandwidth between my ingestion PC, file server (where the bzip2 archives are copied to) and the DL560 is nice and fast, but then I’d also need to upgrade the disks in my file server as the solo 8TB SMR HDD is barely any faster than gigabit speeds anyway. Now the costs are mounting up and I’ve already spent enough money on this…
The second bottleneck is my internet connection. I’m blessed (for Australia at least) to have NBN FTTP. I currently have a 250/100 connection which is nice but when you’re uploading 40+ 5-10GB files simultaneously, even the Internet Archive’s notoriously slow servers sometimes reckon uploads have time out. I would switch to Superloop’s 200/200 plan but I have heaps of Aussie Broadband credit via referrals I’ve earned. I could spend the $220/m and have two glorious internet connections going for uploads, but once again – “the costs are mounting up and I’ve already spent enough money on this”.
The final thing I should probably look into is some form of backup for the data sitting on the server before it gets processed. It’s usually not living there for more than 1-2 days so it’s not like I’d be losing months of work if something happened, but the disks are in RAID-0 and they are SSDs and it would still be a pain in the arse to lose a day’s work. But RAID-5 was so slow and all the other RAID methods (e.g: RAID10) chew up too much disk space, not leaving enough for storage and processing of a day’s worth of work. I could buy bigger SSDs, 4x 1TB drives in RAID10 would still be 2TB (and I could use a 500GB as a boot drive), however “the costs are mounting up and I’ve already spent enough money on this”. Maybe I’ll use rsync with compression to copy everything across once an hour to an external drive or a different network drive or something, just for peace of mind.
There are some cheap Aussie barebones DL560 G8s on eBay right now that if I could score at the right price could result in some additional processing power at only $30/core!! I’d need to get some centralised storage so all the servers, so that would mean a new file server just for the scans, maybe a DL380 G8 with a few SSDs in the PCIe slots and 3.5″ HDDs in the drive bays. No no no, bad Anthony, bad, “the costs are mounting up and I’ve already spent enough money on this”.