Backups, backups, backups

Recent events have shown that awareness for one’s own data could still do with some improvement.

Everybody knows: backups are important. Then how come a lot of people don’t have any and implicitly rely on their host to handle them? This general assumption probably stems from experience in everyday’s life. If something breaks, we usually have warranty on it, so it gets replaced. If we accidentally delete a file from our home laptop, we recover it from some external drive, and so on. So certainly a company that deals with hosting your data should have backups, right? Or not?

The answer is: NO.

Most ISPs and hosting companies will not back up your data. And again: your service provider will NOT (really, they won’t!) back up your data. They will not have those urgently needed copies of your dedicated server. They will not have point in time copies of your virtual private server. They simply won’t.

I think the main point has now come across. But why not, they are … a host, after all, no? Yes, BUT: backups are not just snapshots of your VM. A proper backup involves a lot of brain because it includes parameters such as:

  • what to backup (which files, databases, …)
  • how to back it up (archives? all files? database dumps?)
  • when to back it up (midnight? early morning? when is it best for you?)
  • how often to back it up (once a day? a week? every hour?)
  • how many generations of backups to keep (1? 3? 10?)
  • how long to keep your backups (a month? half a year? 5 years?)
  • HOW TO RESTORE FILES FROM BACKUP (why is this in capitals: it’s nice to know your files are backed up. Do you know how to retrieve them? Where to place them? Are there any dependencies between files lost and still there?)

When you signed up with your ISP, how many of the above questions were addressed? Or maybe their TOS say “we are not responsible for your data”? Taking proper backups takes a lot of human effort, and considerable IT resources. There are hosts who sometimes do you a favour and take snapshots of your VMs, but their TOS will still say they are not responsible, and they will not know what they back up either.

The only way to get out of this dilemma is to have a managed contract and or a special SLA that lays down all the answers to the abovementioned questions, and defines what sort of responsibility the host has if they do not live up to this agreement (i.e. if they DO lose your data).

Which brings us to the next point: YOU should always! have backups as well Рespecially you. It is your data, after all. If you do not have the resources to back up your data in reasonable intervals, you should come to an agreement with your host for a managed backup service (and still try to at least have off site backups every now and then: have your host send a DVD to you for example). An enterprise host will guarantee that your backups are securely stored on redundant and resilient hardware (a single drive attached to a USB port of a 10 year old server that has trouble booting after every update is not a redundant and resilient backup service), and such host will also regularly check whether your backups are readable and can easily be restored at your convenience.

This isn’t anything that can be included in a single digit per month contract for a run of the mill virtual private server or a dedicated server acquired during a blowout sale. Ultimately, if you lose data, it is your time and money that are at stake. If you have no arrangement with your host, you will take the brunt of any data disaster occurring (and even if you have an SLA with your host that defers responsibility to your host: if your host doesn’t live up to that SLA, you are still going to sweat a lot – along with your host who will be facing a claim for damages).

How important is your data to you – do you care so little about it that you will not ensure that your data can be retrieved any time if force majeure or an accident delete your production setup? We are certain that this is not the case, so please:

Take responsibility for your data. Your data needs you, and vice versa. If you do not have the resources, find someone who has, and someone who can be blamed if they do not live up to the agreement made. Your host is not going to do anything for you unless you have it in writing (and by that we do not mean the shiny ad on a host’s homepage).

This is a rather fervent da capo of our post from 2011 (http://dedicatedservers.castlegem.co.uk/2011/10/backups/). Why this ardour? Because we care for your data. We want you to be able to lean back and enjoy the feeling of your data being reliably secure. Just keep in mind: it isn’t anything that comes for free, and not without asking and specifying.

This post is not directed at any ISP. It is also about ourselves: we have the very same and similar TOS that prevent us from being held responsible for any data loss on clients’ servers unless we have a direct agreement that states the opposite. But we do offer enterprise solutions where we certainly live up to every single letter of the agreement and regularly outperform: just like many other hosts out there as well.

 

My server crashed…what now?

So now it has finally happened…your server has crashed beyond repair. It won’t boot, or what it boots has little resemblance with what you expect it to come up with, remote console shows a manual file system check is needed, grub cannot find a kernel, your root partition is gone, Windows says it cannot find any disks anymore, and other nightmares you thought could happen to everyone else but not you.

Now what?

First of all: DON’T PANIC!

For those of you who are familiar with Adam’s Hitchhiker’s Guide to the Galaxy this advice sounds more than familiar, and it is in fact the very first action to take. Panic will cloud your mind, and you will take much longer for everything you do than when you do it calmly and even take the time to think twice before you do anything at all.

  1. Assess the situation: Are you able to try fixing it yourself, or are you not familiar enough with the error displayed, or the symptoms coming up?
  2. Do not mess with the system too much: Even if you do not have a managed server and therefore have to have a look yourself first, you are not on site, and you do not have the means to do a hardware repair, apart from the risk of the damage becoming larger the longer the remote actions take, and the more diversified the approach becomes in an attempt to salvage what is left.
  3. Ask your provider to step in: If you have a managed server, they will have to handle it anyway, and depending on the SLA in place, will provide you with a new machine in between, a failover solution, etc. If you do not have a managed server, your provider is still your best guess for actual on site operations as they are the ones who have physical access to the server, and if they are not proficient, they should be able to bring someone in faster than you who can have a look at the machine. If you have hired your own sysadmin (who is not on site either, however), your ISP and your sysadmin can communicate to discuss the best course of action.
  4. In the meantime, have your provider – with or without respective SLA – set up a new server, a replacement VPS, a shared hosting account, in other words, anything that allows you to bring back your site saying “We are performing maintenance / crash recovery / you name it”, i.e. something that brings you back in touch with your customers so they are aware you are on to the situation. Use twitter, facebook, your customer portal (if you have one on another machine), etc. to let your clients know who of them is, and why they are affected.
  5. Depending on the interim solution, get ready to bring your backups back online (you do have them, don’t you?). In a managed environment, your provider will most likely have them, otherwise (and in fact no matter what) you should have an external backup somewhere as well.
  6. Once you have a new production server (or the old one repaired), and have set it up with its operating system, updated it to the latest patchset and security fixes, and brought it to a state that matches the environment before the crash, use your backups and perform data recovery. Do not go live again yet, however:
  7. Test, test, and test again if everything is working according to specs and expectations. Naturally, you will want to be back online as fast as possible. On the other hand, you want to avoid nasty surprises such as inconsistent databases, mismatched orders, invoices, etc. It is up to your judgement to find the right balance here.
  8. Once everything is back to normal, write up an Incident Report and send it out to all customers who were affected by the outage, and handle compensation as per your own SLA and TOS.

 

Managed or not?

We have had a similar post back in July 20211 (cf. here) , so why are we bringing this up again? Recently, we have had a large surge in two categories of orders: unmanaged lowend VPS (256MB memory and the likes, for use as DNS server, etc.), and fully managed servers.

Customers are increasingly aware of the need to back up their sites with a well managed server. Typically, the managed option often only extends to managing the operating system (and possibly hardware) of the server in question, i.e. updating the operating system with the latest security patches (something that an “intelligent” control panel, such as cPanel, can handle itself, mostly), latest package upgrades, and generally making sure the server works as intended.

In most cases, managed does not, however, cover application issues. This, however, is a crucial point: You as the customer need to be sure that the server administration side of your enterprise speaks the same language as the application development side. Nothing is worse than an eager sysadmin updating a software package without consulting the developers who, incidentally, depend on the older version for the entire site to run smoothly. With nowadays globalisation, this can cause you additional grief – often your developers are from a different company than your ISP, and often they (as is natural) will defend themselves in taking the blame. It will leave you and your enterprise crippled or hindered.

What do we advise?

  1. Don’t save money on a sysadmin.
  2. Make sure your sysadmin talks to your developers and understands what they need.
  3. Make sure your sysadmin has a basic understanding of your application in case of emergencies.
  4. Make sure your staff: your sysadmin and developers coordinate updates and upgrades.
  5. Make sure you have a working test environment where you can run the updates and upgrades in a sandbox to see if afterwards things still work the way they are expected to run.
  6. Have a teamleader coordinate your sysadmin(s) and developer(s), or take this role upon yourself.

How much is it going to cost you?

Fully managed packages vary in cost Рthe normal sysadmin packages that deal with the operating system only will up your budget by anything between £ 20 to £200 per month, if you want the sysadmin to be an integral part of your team and support your application as well (in terms of coordinated server management), then the price will be more to the higher end of that range, but might possibly also include some support for the application as well already.

Who to hire?

Get someone with experience. There are sysadmins out there who have decades of experience and know the do’s and dont’s, and there are sysadmins who consider themselves divine just because they have been “into linux for 2 years”. A sysadmin is not someone who jumps at the first sight of an available package upgrade and yum installs 200 dependencies to claim he has a system up to date. A sysadmin is someone who understands the implications of a) upgrading and b) not upgrading. A sysadmin will weigh these pros and cons and explain them to you before making suggestions as to what to do. A sysadmin is someone you trust to even take this decision off your shoulder so you can run your business instead of having to worry whether the next admin cowboy is going to blow up your server. A sysadmin is someone who knows not only how to keep a system alive, but also how to bring a failed system back to life.

These are just some general guidelines, contact us for further advice, we are happy to help!