1 September 2006. If you have sent an urgent message to someone with an email account at FastMail and they haven’t replied, you might want to try the phone or snail mail. FastMail’s having mailserver trouble.
In fact, FastMail seems to have had problems throughout August.
Early on the 9th, IMAP6 took a powder. Elapsed time to replica switchover: Half an hour. But ticky tack problems with spam, filtering and forwarding rules ate the rest of the day.
In the wee hours of the 12th Server3 spontaneously rebooted. As Server3 isn’t a Windows machine, this was not considered normal. Users on Server3 experienced slow response times on the 16th. FastMail techs surmised that the sluggishness was caused by processes that were unable to finish the night before.
Four days later, Server3 again performed its “slo-mo” act. The day after, imap5 and imap6 shut down. Techs cut over to two different servers, and crossed their fingers.
It all came to a head when Server3 died, on 30 August. At first, the network manager thought it was simply a filesystem bug, meaning the beast would be online again within an hour. Then he reported that Server3 would need a full filesystem check, which would take another two hours. In the meantime, all the backend servers were restarted.
Mail delivery was suspended for most of the day, while incoming mail was being delayed “by around an hour.”
At 6:10 pm, techs figured that it would take 30 hours to finish checking the filesystem.
So let’s see… figuring in the upcoming long Labor Day weekend, I’ll see any messages you sent me on Thursday… sometime next Tuesday. And that’s OK.
After all the shots I’ve taken at freemail services like GMail, Yahoo, MSN, et al, I have some serious crow munching to do… especially since I still like those FastMail guys. They’ve given me great service for years.
Sigh. Pass the ketchup, would yuh?
Update: A colleague here in my office can access his FastMail account, but I still get : “The account you are trying to login to is in the process of being recovered from a disk failure and is currently unavailable. This process will take around a day. Please see the Fastmail status blog for more details and updates. We apologise for the inconvenience.” Nuts.
Update: 3 September 2006, 2pm CST: If you’ve been waiting for a response from me, take heart. When I try to login, FastMail responds, “Your account is already a priority restore request. These requests are processed in the order in which they were created by login attempts. There are 1837 requests in the queue before you.”
At this rate, I hope to be accessible again on Tuesday (after Labor Day).
Update: 4 September 2006, 7:45am CST: “Your account is already a priority restore request. These requests are processed in the order in which they were created by login attempts. There are 6 requests in the queue before you.”
Forensics:6 September 2006: Fastmail’s description of what will be known hereafter as “The Server3 Disaster”:
The problem began on Thursday, when a partition on server3 started giving errors.On finding that the affected partition failed to mount, we needed to run a filesystem check (’fsck’). This would hopefully fix up any errors in the filesystem, and we’d then be able to continue to use that partition. In any case, we would certainly be unable to use the partition until the fsck was completed, so we started the fsck.
We set the status of the affected users to a special value (”Being moved”) in order to stop the mailboxes being used, and then changed the configuration on the server to allow normal access for all the other users on server3. Once this was done delivery of messages was able to proceed again.When it became apparent that the fsck would take a long time (it ended up taking two and a half days), with no guarantee of fixing the problem, we decided to restore our latest backups of the affected users onto some spare disk space. So, in the worst case situation where the fsck failed to make the drive readable, we would have a significant head start in getting email up and running again. This wasn’t our first preference, though, because it would mean that any emails received since the latest backups would be lost.
Meanwhile we were monitoring the other servers, and making sure that all the messages banking up didn’t overwhelm our incoming mail servers. The best strategy turned out to be to move messages which were addressed to the affected users to a `hold’ queue, where they wouldn’t time out and generate bounce messages to the sender.
The fsck finally completed, and told us that there were no problems with the file system, and everything was clean. However, whenever we tried to mount the infamous partition, the filesystem driver in the kernel would give an error. Curses! After some investigation we found that we were able to mount the partition read-only, so at least we could read the most up to date version of the mailboxes.
So, the plan then became to synchronise the already restored backups with the read-only partition. This was relatively fast, because the lion’s share of the data was already in place, only modifications that had happened since the last daily backup would need to be copied.
We started this process, and set it up so that each user would become active as soon as that user was synchronised. We prioritised the users so that the users who were currently trying to log in and access their mail would be restored before users who were making no attempts to log in.
Once this process had begun, affected users began to have service restored to them. So the first users suffered an outage of a bit under 3 days.
After we verified that this was working, we began to release queued mail from the `hold’ queue on our incoming mail servers, where the destination user was now active.
At a bit over 5 days since the original problem, all users were on line again, apart from a handful (less than 10) that required manual intervention. All users are now online and working again.
Reliability and the future
We take reliability seriously, and we are taking steps to prevent something like this happening again.
We have been engaged for some time in a programme to move our users onto servers which are replicated in real time, so that in the event of a problem, we can just switch over to using the replica server with no loss of data, and only a very short interruption to service. Unfortunately, it’s not as easy as buying twice as many servers and announcing “We’re replicated”. There have been a number of software and hardware problems, and we obviously don’t want to move users onto a system which is less reliable than the current setup.
Some users were already on replicated servers, and we were in the process of setting up and testing another pair of servers when this problem happened. The quickest way to get the affected users running again was to restore them onto the new pair of servers. This forced us to spend a lot of time (while the fsck was running) making sure the new pair was configured, stable and ready to go. When users were restored, they were moved to these new servers.
So the upshot is that the restored users are now enjoying real time replication. We have more servers on order, and are waiting for them to arrive so that we can move all of our users to this replicated setup.
Moral: Backup your backups. Then back them up.

1 comment
Comments feed for this article
September 21st, 2006 at 2:08 am
Owen
It seems that gmail is having delayed message delivery too. Two of my friends have been affected in the last few days, and others may not yet know. There have been 40 posts to a recent thread on the topic; I don’t know if this is new or if this is typical, but I hadn’t heard of it on gmail before.
See:
http://groups.google.com/group/Gmail-Problem-solving/browse_thread/thread/6389a2e559f05cd3/#