The Treacherous Tangle of Redundant Data: Resilience for Wallaroo
Introduction: we need data redundancy, but how, exactly? You now have your distributed system in production, congratulations! Your cluster is starting at six machines, but it is expected to grow quickly as it is assigned more work. The cluster’s main application is stateful, and that’s a problem. What if you lose a local disk drive? Or a sysadmin runs rm -rf on the wrong directory? Or else the entire machine cannot reboot, due to a power failure or administrator error that destroys an entire virtual machine?…