Recent downtime - jwills.co.uk

I would like to apologise to the users of this server (http://malus.exotica.org.uk) for the recent downtime. There has been two very unlucky incidents in the last few weeks that were unfortunately out of my control.

I manage this machine, but it is a virtual server hosted on someone else’s hardware which is managed by them.

We recently moved from a dedicated machine over to a virtual setup in a different location. This was due to the age of the dedicated hardware and some filesystem corruption issues we were having. As we were hosted on a Mac Mini there was only a single disk, and I was worried about it failing completely

At the new location we were put onto a temporary machine as a new machine was being built. The move was to give us more memory / better performance and more reliable storage. Even on the temporary machine we had much better performance and more memory than previously.

Then a couple of weeks ago, during a routine upgrade of the XenServer host software, something went badly wrong, and some LVM meta data was corrupted. This was manually recovered and nothing was lost, however it took a couple of days to restore everything.

Then soon after the system was taken down as it moved over onto the new hardware which would give us more storage, and better performance (SSD for root and 1TB RAID 5 backed additional storage for data). We had a bit of planned downtime during this.

Then yesterday morning (10th August) something horrible happened. Three brand new Seagate drives (~160 power on hours) that were part of the RAID 5 array on the host developed multiple read errors. Some of the read errors were also on the same stripe meaning some data could be lost. The owner of the hardware started recovering everything over to another machine. Our Root file system was known to be fine, however the state of our main storage was not known initially.

When it came to restoring our virtual server, we got lucky – none of the damaged areas of the disks affected us – so there was no data loss. Had there have been some – there wouldn’t have been any risk to users data on this server though – as an offsite backup was made just a few hours before the crisis – it just may have taken longer to recover.

Data is backed up from my server daily, and multiple snapshots are kept (I use https://attic-backup.org/ which allows us to store multiple encrypted backups, with de-duplication. I have backups from yesterday going back to last year)

We are now up and running on a temporary machine again with a little less storage space than before – but everything is working.

Apologies again for any inconvenience caused. I would also like to thank the owner of the hardware who is still recovering other systems from the machine, and has been working without sleep for the last two days to fix it. That on top of having a summer cold. Thoughts are with you!