Cryptocurrency derivatives trading platform and exchange BitMEX, known for its transparency and stalwart trading engine, had a rare hiccup the other day when due to a server issue, they were forced to suspend trading on July 5th from 23:30 UTC until 03:45 UTC, for a total suspension of 4 hours and 15 minutes. This was the company’s longest prolonged downtime since their launch in November 2014.
The BitMEX team said:
“Those of you who trade with us know that we take our uptime very seriously, and the record shows it. Before this month, we had not had a single month with less than 99.9% uptime, with our longest 100% streak reaching nearly 300 days.”
What exactly happened?
While BitMEX has developed one of the most sophisticated trading engines in the industry, its focus has always been on exactitude (remaining positions continuously, auditing every trade), rather than speed. They confirm this was a winning strategy from 2014 to 2016, and they’ve never lost an execution, but as trading entered record-setting volume at the beginning of this year, requests started to back up. To this end, BitMEX began trading engine optimization.
It’s web layer, up to the point of the decision to halt trading, hadn’t had any issues, but the engine (at that time) could not be horizontally scaled. You can recall, BitMEX recently partnered with Kx, the makers of kdb+, which powers the engine.
BitMEX then began testing new storage subsystems and server configurations. The company settled on an upgrade plan set for five days from July 11th and began testing the switchover. They simulated the switchover three times, each instance they set a practice timer to best estimate downtime.
The plan was:
- Move to a larger instance with a faster local SSD, and
- Move from bcache + ext4 to ZFS.
Some more details on those actions can be found here.
What went wrong? (timeline as relayed by the BitMEX team)
With the plan ready to go, checklists ready, and conducting a simulated switchover a few times, BitMEX started preparing a zpool for use with the production engine. Here’s where it went wrong…
– 19:47 UTC: BitMEX created a mirrored target zpool that would become the engine’s new storage. In order to not influence I/O performance on the running engine, they took a snapshot of the data storage drive, then remounted it to the instance. This is not something done during test runs.
What happens when you snapshot a volume – bcache superblock and all – and attach it?
Without any interaction, the kernel automatically mounts the drive, figuring it was also the backing device on the existing (running) bcache device, and appeared to start spreading writes randomly across both devices. This became a problem and began to trash the filesystem minute-by-minute. It seemed odd that it mounted a bcache1 drive, but BitMEX was not immediately alarmed. No errors were thrown, and writes continued to succeed. They started migrating data to the zpool.
– 22:09 UTC: A foreign data scraper on the engine instance (BitMEX reads in pricing data from nearly every major exchange) throws an “overlap mismatch”. This means that, when writing new trades, the data on disk did not mesh perfectly with what was in memory. They began investigating and repairing the data from their redundant scrapers, not aware of the bcache issue.
– 23:02 UTC: A read of historical data from the quote table fails. This caused the engine team serious concern. BitMEX began to verify all tables on disk to ensure they match memory. Several do not. They realize we can no longer trust the disk, but we aren’t sure why.
The team then begins snapshotting the volume every minute to aid in a rebuild, and their engine developers start copying all in-memory data to a fresh volume.
– 23:05 UTC: Schedule an engine suspension. To give traders time to react, BitMEX set the downtime for 23:30 UTC and sent out a notice. They initially assume this is an EBS network issue and planned to migrate to a new volume.
– 23:30 UTC: The engine suspends and BitMEX begins shutting down processes, dumping all of their contents to disk. At this point, the team believed they have identified the cause of the corruption (bcache disk mounting).
Satisfied that all data is on multiple disks, BitMEX shut down the instance, flushing its contents to disk and wait for it to come back up.
It doesn’t. They then proceed to unmount the root volume, attach to another instance, check the logs. No visible errors.
– 23:50 UTC: The team decides to move the timetable up on the ZFS and instance migration. It becomes very clear that they can’t trust bcache. BitMEX already has its migration script written – they clone the Testnet engine, which had already been migrated to ZFS, and began copying data to it. The new instance has increased horsepower with 2x the CPU & 4x the RAM, and a 1.7TB NVMe drive.
– 00:30 UTC: Migrates all the init scripts and configuration, then mount a recent backup. BitMEX has trouble getting the bcache volume to mount correctly as a regular ext4 filesystem. The key is recalling the superblock has moved 8kB forward. They then mount a loopback device & start copying.
The team also sets up an sshfs tunnel to Testnet to migrate any missing scraper data. The engine team begins recovering tables.
– 01:00 UTC: Destroyed and remounted the pool to work around EBS. While the files copied, they begin implementing the new ZFS-based backup scheme and replicate minutely snapshots, as they work to another instance. This becomes valuable several times as data is verified.
– 02:00 UTC: The copy finished with the zpool ready to go. Bcache trashed blocks all over the disk, so the engine team begins recovering from backup.
– 03:00 UTC: The backfill is complete and the team works on verifying data. The team has everything looking good now, BitMEX didn’t lose a single execution. Relief starts flooding the room. BitMEX then starts talking timetables. They partition the local NVMe drive into a 2GB ZIL & 1.7TB L2ARC and attach it to the pool to get ready for production trading.
– 03:05 UTC: The team brings the site back online, scheduling unsuspension at 03:45 UTC. The support team begins telling customers the new timeline. Chat comes back on.
– 03:45 UTC: The engine unsuspends and trading restarts. Fortunately for BitMEX, the bitcoin price has barely moved over those four hours.
The BitMEX team said:
“While we prepared for this event, actually experiencing it was quite different.”
“Over the next two days, the team was communicating constantly. We write lists of every thing that went wrong: where our alerting failed, where we could introduce additional checksumming, how we might stream trade data to another instance and increase the frequency of backups. We introduce more fine-grained alerts up and down the stack, and begin testing them.”
“To us, this particular episode was an example of an “unknown unknown”. Modern-day stacks are too large, too complicated, for any single person to fully understand every single aspect. We had tested this migration, but we had failed to adequately replicate the exact scenario. The best game to play is constant defense.”
“As we scale over the coming months, we will be implementing more systems toward this end, toward the eventual goal of having an infrastructure resilient to even multiple-node failures. We want to deploy a Simian Army.”
Improvements are already on the way with some complete including:
- Moving to ZFS itself, which was already a long-planned and significant step that provides significantly improved data consistency guarantees, much more frequent snapshotting, and better performance.
- Developing automated tools to re-check data integrity at intervals (outside of our existing checks + ZFS checksumming), and to identify problems sooner.
- Will review every aspect of its alerting system, reworking several gaps in coverage and implementing many more fail-safes.
- Greatly expanded the number of jobs covered under Dead Man’s Snitch, a service that has proven invaluable over the last few years.
- Implemented additional backup destinations and re-tested. We are frequently replicating data across continents and three cloud providers.
- Will continue to implement new techniques for increasing the repeatability of its architecture, so that major pieces can be torn down and rebuilt at-will without significant developer knowledge.
Other developments from BitMEX will include today at 14:00 UTC, launching of a minor update to the BitMEX API ratelimiter.
- This change should be non-breaking.
- Effective ratelimits are being raised across the board.
- Request “tokens” now refill continuously, rather than all at once every 5 minutes. For example, if a user makes 10 requests in 5 seconds, the resulting limit will be (300 – 10 + 5) = 295.