What causes all these outages on the Web and Internet?
Another large service outage, this time FaceBook, for over two hours has highlighted the fact that even the largest infrastructure can be either flawed or assaulted.
Techies like myself are striving constantly to prevent system and service outages. We, hopefully, pride ourselves on designing systems from Day One that are as resilient as we can. Most times the Admins stay ahead of a tide of failures, attacks, faults, etc.
But we do fail, like anyone else.
Firstly, I’d like to point out quite how complex a computer, let alone the systems and software running on them are. Someone once said that a computer is a tool like an oven, capable of producing and processing many things in many ways but essentially relies on similar items as input (food), some external tools (pots, pans) and requires the Operator (cook) to be working within certain parameters.
It’s a reasonable metaphor but really a computer is a little more, rather like an entire kitchen. You have many elements (fridge, oven, microwave, table, sink etc) all working together and providing items to each other. But it absolutely and definitely has limitations and relies on the “cook”.
Then add in the complexity of a Data Centre, maybe a thousand servers all performing various tasks, like a big restaurant kitchen. Starters, soups, main courses, desserts as well as beverages etc. All have to be prepare, stored and of course served. Similarly, audio, visuals, data processing, database storage, web pages, reports, etc. And so forth.
So be gentle with the Sys Admins, remember they probably love the system they work on even more than the users; it’s their baby!
So here we go with a run down of the most common outages and failures.
Denial of Service. Well known now but getting rarer. DOS attacks start from a distinct source with the sole intention of flooding a system with either so many requests or so much information that the system cannot cope.
Every system has a finite number of resources and one of the most limited is an element called “sockets”. This is literally how many doors on to the system are available at one time, thus how many users can use the system concurrently.
Some systems are push and pause kind of service and other are continuous streams. Static web pages (so called brochure and informational sites) are the easy example. The Server receives a request, it processes that request and returns the appropriate information. The Socket is only open for the period that it takes from the beginning of the request to all the information being received by the requestor. Usually the user is then required to read/process the returned information before they ask for more information, if they do at all. Thus a pause ensues. The actual time that the socket is open depends solely on how complex the original request is to process.
Dynamic content provides pages of information in the same manner but the same page contains different information at different times. They are usually populated from databases of information and thus can be more susceptible to assault as there is more processing involved.
Streamed content quite simply provides information constantly to the connected user. So the socket stays open whilst the user is still using the system. A lot of online games work like this, constantly updating the information the user receives as the environment in the game changes.
However, some items seem like streams but are really many static calls sent/received quickly. The BBC iPlayer is an example of this.
So if a malcontent can get enough requests from him to the service eventually the service will run out of sockets. If the sent requests are complex then this a bonus to the attacker; as more complex requests are received then each request slows down (limited ability to process) and the sockets stay open longer and thus the attack can be smaller.
Similarly, streamed services are easier to perform DOS on as the socket stays open once the connection is made.
Note that the systems ability to process and return information efficiently is essential to limiting the effectiveness of DOS. If you have a huge database that require difficult queries to get information from it but have severely limited the processing power available to that DB system then the whole system is prone to DOS. Very bad planning.
Distributed Denial Of Service. Quite simply (historically) a concerted effort by many individuals to perform DOS at the same time. DOS is less popular now as it is easy to trace the individual or source an simply block it. With DDOS this is extremely difficult due to there being many more sources. It is usually impossible. The most common form of DDOS now is via malware, where a Trojan/Worm/Virus infects a perfectly innocent machine and simply sits there inactive until the controller/author sends the trigger and ALL the infected machine start the attack all at the same time.
So this turns into the most popular attack format. Terribly destructive when the hyper intelligent hackers get involved (there are even communities of malcontents that organise and concentrate efforts) but even so called Script Bunnies can cause massive problems.
The Internet is full of information and some, in the name of openness, publish flaws and attack vectors on all sorts of systems. On the back if these, unquestionably intelligent person’s work, others elect to manifest the information into assaults on systems and servers.
A simple flaw this one but an be exploited in an attack.
Basically, feedback occurs when a request not only generates a response but also a similar request again. Thus an apparently completed request can hold resources over an extended period. Thus the effect is similar to a stream, holding resources (semi) permanently.
As more requests are received so the resources diminish and are not returned.
If a feedback loop generates, even if not every time, more than one secondary request then a single request can cause a cascade that by itself can bring the system to a complete halt.
Delayed feedback loop is similar but the secondary request is not generated for a moment or two, possibly a few hours.
Think of a service that receives an email and actions it. If another email is generated as a result then it is possible that that second mail eventually (possible through forwards and filters) gets returned to the originating system. It then generated another secondary mail. And so on. There could be a significant gap between the original and the next self created mail being received. It can therefore be difficult to diagnose the root cause.
The majority of attacks are usually a result of the above mechanisms and a few others of a similar nature. Occasionally, however, as Sys Admins deliberately remove a service temporarily.
Standard maintenance outage. Simply put a point in time that is prearranged where the service will be significantly impacted. Usually a requirement for significant security upgrades, called patching, or major change of the service, such as an upgrade.
These never take place without significant testing but still occasionally cause secondary outages if a change fails or causes another related system performance issues.
Standard maintenance plan/window. Similar to the SMO except specific times at which there may be an outage for minor changes or upgrades. Occasionally used for backing up a system too. Usually out of normal peak hours.
Principle period of maintenance. The same as a SMP except very formal. Usually used in service agreements with payed for services, business connectivity and such. If a service is supposed to be available 99.9% of the time the PPM is usually excluded from such calculations.
Emergency maintenance outage. Nasty. The service owners have basically decided that an event or flaw within the service is so bad that they have no choice but to make significant changes with little or no notice. Usually requires the knowledge and permission of the executives or their representatives. There is always a risk with EMO of unforeseen secondary issues as the timescales usually does not allow as much testing as the SMO or SMW.
As I said, nasty.
Reduced service plan/window. Very similar to the SMO and SMW except not all of the service is affected at the same time. Perhaps you can read old entries on a site but cannot post new ones. Usually the required for a minor backend change, the addition of equipment or system backup.
Outages cannot always be avoided. There are far too many malcontents and script bunnies, let alone the normal issue of hardware and software reliability. Be patient with us, we’ll get you going ASAP!
If you have any topic you want me to cover let me know. Also, if you see any mistakes , inaccuracies or such simply pop in a comment.
– Posted using BlogPress from my iPhone