Well, the official word is out from Skype and it can be summarized: the reboots from Microsoft patches triggered a previously-undetected condition and crashed out network.
Skype PR staffer Villu Arak writes in “What happened on August 16“:
On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.
The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.
Okay… I can buy that this type of thing could trigger some kind of chain reaction, but I don’t understand why this month was different than any other month. For.. what? two or three years now (more?) Microsoft patches have been coming out like clockwork on the second Tuesday of each month. Each second Tuesday or Wednesday, the millions of computers set to auto-update do so. All those zillions of computers restart automatically. Each and every month. What was so special about this August that was different from every other month? Was the number or restarts in a short period of time really that much different from other months? Why? Is the issue that there are so many more Windows Skype users than in previous months and years? Was this just the so-called “tipping point” when there were enough Windows Skype users that the normal restarts triggered this chain reaction?
The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users’ security was not, at any point, at risk.
In other words, it was not a DDoS by Russian hackers, as one rumor had it (which had actually already been dismissed by every security researcher who looked at the alleged exploit code).
This disruption was unprecedented in terms of its impact and scope. We would like to point out that very few technologies or communications networks today are guaranteed to operate without interruptions.
Fair enough statement – if you are looking at data or web technologies… but the PSTN, to which Skype would seem to like to be compared, is designed to operate without interruptions (or with as minimal as possible). You know, there is this wee little market for “carrier-grade” equipment/software/etc. that is designed to be highly available without downtime. If a carrier’s network were down for over 48 hours, there would be a zillion lawsuits, intense government inquiries and more. The carriers that make up what we call the “PSTN” put an incredible effort into ensuring availability. If Skype wants to play in that game, they have to be ready to play at the same level.
Skype has now identified and already introduced a number of improvements to its software to ensure that our users will not be similarly affected in the unlikely possibility of this combination of events recurring.
Good. We would expect that.
I appreciate that Skype has been as communicative as they have through their blog and heartbeat site. Thank you, Skype, for communicating – and leaving the comments open. However, to me the information provided today is still lacking one key piece:
Why were the mass restarts associated with the August 2007 Microsoft updates different from the mass restarts associated with any other month’s Microsoft updates
(Cross-posted from my Disruptive Telephony blog where I’ve been tracking the Skype outage.)