So now it has finally happened…your server has crashed beyond repair. It won’t boot, or what it boots has little resemblance with what you expect it to come up with, remote console shows a manual file system check is needed, grub cannot find a kernel, your root partition is gone, Windows says it cannot find any disks anymore, and other nightmares you thought could happen to everyone else but not you.
First of all: DON’T PANIC!
For those of you who are familiar with Adam’s Hitchhiker’s Guide to the Galaxy this advice sounds more than familiar, and it is in fact the very first action to take. Panic will cloud your mind, and you will take much longer for everything you do than when you do it calmly and even take the time to think twice before you do anything at all.
- Assess the situation: Are you able to try fixing it yourself, or are you not familiar enough with the error displayed, or the symptoms coming up?
- Do not mess with the system too much: Even if you do not have a managed server and therefore have to have a look yourself first, you are not on site, and you do not have the means to do a hardware repair, apart from the risk of the damage becoming larger the longer the remote actions take, and the more diversified the approach becomes in an attempt to salvage what is left.
- Ask your provider to step in: If you have a managed server, they will have to handle it anyway, and depending on the SLA in place, will provide you with a new machine in between, a failover solution, etc. If you do not have a managed server, your provider is still your best guess for actual on site operations as they are the ones who have physical access to the server, and if they are not proficient, they should be able to bring someone in faster than you who can have a look at the machine. If you have hired your own sysadmin (who is not on site either, however), your ISP and your sysadmin can communicate to discuss the best course of action.
- In the meantime, have your provider – with or without respective SLA – set up a new server, a replacement VPS, a shared hosting account, in other words, anything that allows you to bring back your site saying “We are performing maintenance / crash recovery / you name it”, i.e. something that brings you back in touch with your customers so they are aware you are on to the situation. Use twitter, facebook, your customer portal (if you have one on another machine), etc. to let your clients know who of them is, and why they are affected.
- Depending on the interim solution, get ready to bring your backups back online (you do have them, don’t you?). In a managed environment, your provider will most likely have them, otherwise (and in fact no matter what) you should have an external backup somewhere as well.
- Once you have a new production server (or the old one repaired), and have set it up with its operating system, updated it to the latest patchset and security fixes, and brought it to a state that matches the environment before the crash, use your backups and perform data recovery. Do not go live again yet, however:
- Test, test, and test again if everything is working according to specs and expectations. Naturally, you will want to be back online as fast as possible. On the other hand, you want to avoid nasty surprises such as inconsistent databases, mismatched orders, invoices, etc. It is up to your judgement to find the right balance here.
- Once everything is back to normal, write up an Incident Report and send it out to all customers who were affected by the outage, and handle compensation as per your own SLA and TOS.