Last week saw both Microsoft and Google online office solution go down.
Both were dealt with openly, ish, but whilst Microsoft are pointing the finger towards network issues (also DNS), Google are pointing the finger at themselves.
Google Broked It
Google are keen to always point out that they use their own products internally and are affected themselves by the same issues seen by customers. They call it “Dog Fooding“.
But that is poor consolation when all your private or business documents are unavailable.
What is worse about this is that Google say the issue was caused by change Google made to how realtime collaboration works. The blog lists the order of events but essentially some of the servers in their environment had to repeatedly reboot as they ran out of memory processing the new load.
You have to give them credit for being so honest though.
The Big Question
Well, of course, “How did that happen?”
But I prefer “What kind of testing was done?”
Having used, and been fairly happy with, Google Docs for a while I find they (Google) are normally the source of the issue. Labs, updates and straight-to-live changes seem to be badly tested and/or poorly scoped/scaled.
it can’t be a lack of time, after all they allow the Devs 1 day a week “Blue Sky Thinking” time to come up with the Next Big Thing. GMail itself came from just such a session.
But, wait a moment. I think that that is exactly the issue. I look at the apps available in the Google Multiverse and the array is staggering. This, I believe, has left them incapable of completing any of the projects successfully, focusing on non-core issues and potentially under-scrutinising their changes and the impact or scope of such changes.
Personally, I love the products. The core elements anyway.
Please, please, please though concentrate – yes that means stop work on other “stuff” – on getting the basics right.
Your update should have been better tested. The Blog states that Google were able to roll out TWICE the capacity originally in place (for the failed process) in double quick time. This implies there isn’t a lack of resources at Google. Perhaps more World Modelling to tests should be applied?
I haven’t even started on the continue ‘minor’ changes to the system that cause businesses such headaches and constant Tech Support calls.
The recent rollout of “Call Phone” with no opt out available at the domain level was another example.
These show an extremely poor approach to changes.
Businesses are way more understanding of SaaS outages caused by external factors than where the provider themselves took the system down due to a lack of planning. How long will businesses stick it out? Depends on the contract I suppose.
Microsoft’s service is still a babe in arms so some teething pains are expected but Google has been there a long, long time now.
Google Enterprise Blog (What Happened…)
– Posted using BlogPress