What causes the data center to break the net?

From： web name Date：2013-03-26 12:01:41 Hits：435 Belong to：Industry Trends

One way of Thailand embarrassed embarrassed under the bed at the end of 2012 Chinese film box office records, but also refresh the record. But in the IT world, the frequent outbreaks of data center security incidents have also hit the enterprise users' psychological defence once and again. Just look, the security problem of the data center, don't be embarrassed again. “ Thailand embarrassed ”.
Cloud computing services are being touted as IT saints in this era, and all services can be made by “ cloud ” But when many companies are the first to eat the first crab, they find that they are often the most vulnerable. In recent years, the cloud service disconnection events emerge in an endless stream, so that the industry was terrified.

People are getting back to their ideals and see clearly the true face of the Chu cloud computing. It can be said that no matter how lofty dreams or to find a firm foothold, cloud services will eventually have to be transmitted from one data center to another data center, collaborative work between this process still can not get rid of people, computer, network, power, storage etc.. As a result, the whole process of errors and vulnerabilities can hardly be avoided, coupled with natural calamities and man-made misfortunes. So, you have to be ready to start the cloud service and have a secondhand solution to deal with.
The editor is here to review the reasons behind a series of broken network events that have occurred in recent years. From 2009 -2012. Maybe you can see that even computer errors seem unavoidable. Reinsurance measures seem to control security incidents in a small probability range.
Fault network type 1: system failure
Typical event 1: Amazon AWS safe night break net
Cause of failure: malfunction of elastic load balancing service
In December 24, 2012, just the Christmas Eve of the past, Amazon did not let their customers go too well. Amazon AWS is located in the eastern 1 area of the data center failure, its Elastic Load Balancing Service interruption, resulting in Netflix and Heroku and other websites are affected. Among them, Heroku has also been affected in the previous AWS regional service failures in the eastern United States. However, some coincidence is Netflix's competitor. Amazon's own business Amazon Prime Instant Video is not affected by this failure.
In December 24th, the Amazon AWS interrupted service was not the first, and certainly not the last time.
In October 22, 2012, Amazon's network service in North Virginia, AWS, was also interrupted once. The reason is similar to that of the last time. The accident affects major sites, such as Reddit, Pinterest, and so on. The effect of elastic Beanstalk interrupt service, followed by the elastic Beanstalk service console, relational database service, elastic buffer, EC2 Elastic Compute Cloud, and cloud search. The accident has led many to believe that Amazon should upgrade its infrastructure in the North Vigny data center.
In April 22, 2011, a large area of the Amazon cloud data center server downtime, this event is considered to be the most serious in the history of the Amazon cloud computing security event. Because Amazon in northern Virginia's Cloud Computing Center downtime, including answering some web service Quora, Reddit, Hootsuite news service and location tracking service, affected by FourSquare. Amazon official report claims that the event is due to flaws and design flaws in its EC2 system design, and is constantly repairing these known vulnerabilities and defects to improve the competitiveness of EC2 (Amazon ElasticComputeCloud services).
In January 2010, almost 68 thousand of the Salesforce.com user experience for at least 1 hours of downtime. Salesforce.com due to the &ldquo of its own data center; a systematic error of ” a brief paralysis of all services, including backup. It also reveals a lock - up strategy that Salesforce.com does not want to be open to: its PaaS platform and Force.com can't be used outside of Salesforce.com. So once the Salesforce.com has a problem, Force.com is also going to be a problem. So the service has been interrupted for a long time, and the problem will be tricky.

Two: natural disaster
1 typical events: Amazon Northern Ireland Berlin data center downtime
Cause of the failure: lightning strikes the transformer in the Berlin data center
August 6, 2011, in Northern Ireland Dublin lightning caused by Amazon and Microsoft in Europe because of cloud computing network data center outage and massive downtime. Lightning struck a transformer in the vicinity of the Dublin data center, causing it to explode. The explosion caused a fire, so that all public service work temporarily interrupted, resulting in the entire data center downtime.
This data center is Amazon's only data store in Europe. That is to say, EC2 cloud computing platform has no other data center available for temporary use during the accident. The number of outages using Amazon's EC2 cloud service platform site for up to two days long interruption time.
Typical event 2: Calgary data center fire accident
Cause of failure: a fire in the data center
July 11, 2012 Calgary data center fire accident: Canadian communications service provider ShawCommunicationsInc in Calgary Alberta data center fire broke out, causing hundreds of surgical delays in local hospitals. As the data center provides management emergency services, the fire event affects the backup system that supports key public services. This incident has sounded the alarm for a series of government agencies. We must ensure timely recovery and possession of the failover system, and combine disaster management plan.
Typical event 3: the super hurricane Sandy attack data center
Cause of failure: storm and flood caused the data center to stop running
In October 29, 2012, super hurricane Sandy: New York and New Jersey data center are affected by the hurricane, the adverse effects include stops in lower Manhattan area floods and some of the facilities, the area around the data center operation arrhythmia. The impact of Hurricane Sandy is more than a single interruption, bringing an unprecedented disaster to the data center industry in the affected areas. In fact, diesel has already become the lifeline of data center's recovery. As a standby power system, it has taken over the load of the whole area, prompting special measures to keep the generator's fuel. As the immediate work is shifting to post disaster reconstruction, it is necessary for us to discuss the location, engineering and disaster recovery of data center for a long time. This topic may last for months or even years.

Three causes of broken net: human factors
Typical event 1:Hosting.com service interruption accident
Cause of failure: the service provider closes the UPS caused by incorrect operation sequence of the circuit breaker
July 28, 2012 Hosting.com outage event: human error is often considered to be one of the leading factors in the downtime of the data center. The July Hosting.com interruption caused 1100 customer service interruptions as an example. The outage occurred accident is because the company is located in Delaware, Newark data center is the UPS system of preventive maintenance, “ service providers to implement the circuit breaker operation is not in the correct order caused by the closure of UPS is one of the key factors causing the data center room facilities loss. ” Hosting.com chief executive ArtZeile said. “ the failure of any important power system or standby power system is caused by a human error. ”
Typical event 2: Microsoft outbreaks of BPOS service interruptions
The cause of the failure: an uncertain setting error in Microsoft's data center in the United States, Europe and Asia.
In September 2010, Microsoft apologized to users for at least three trusteeship service interruptions within a few weeks of the western United States. This is the first major cloud computing event in Microsoft.
When the user visited BPOS (Business Productivity Online Suite) service at the time of the accident, if the customer who used Microsoft North American facility access service might encounter problems, the failure lasted for two hours. Though Microsoft engineers claimed to solve this problem later, they did not solve the fundamental problem, which resulted in the interruption of service in September 3rd and September 7th.
Microsoft's Clint Patterson said the data breakthroughs were due to the fact that Microsoft did not have a definite setup error in data centers in the US, Europe and Asia. The offline address book in BPOS software is in “ in very special cases, ” provided to unauthorized users. This address book contains information on the contacts of the enterprise.
Microsoft said the mistake was repaired two hours after the discovery. Microsoft says it has a tracking facility that enables it to get in touch with those who have wrongly downloaded the data in order to clear the data.

Broken net inducement four: system failure
Typical event 1:GoDaddy web site DNS server interruption
Cause of failure: network interruption caused by data tables of a series of routers in the system
In September 10, 2012, the DNS server interruption on GoDaddy website: domain name giant GoDaddy is the most important DNS server supplier. It has 5 million websites, and manages over 50 million domain names. This is why the September the 10th interruption of the accident would be the most devastating event in 2012.
Some hype even thought that the 6 hour interruption was due to the denial of service attack, but GoDaddy later said it was caused by the damaged data of the router's table. “ service interruptions are not caused by external effects. ” GoDaddy's interim chief executive, Gao Gerd Wagner. “ this is not a hacker attack or a denial of service attack (DDoS). We have determined that the service interruption is due to the network event damage caused by the data table of a series of routers inside. ”
Typical event 2: Shanda cloud storage network
Cause of failure: a physical server disk damage to a data center
At 8:10 on August 6, 2012, Sheng Dayun issued a public statement on its official micro-blog on the loss of user data caused by a cloud host failure. The statement said: in August 6th, Sheng Dayun's data center in Wuxi was damaged due to the damage of a physical server disk, resulting in &ldquo, &rdquo of individual users and data loss. Sheng Dayun has been doing his best to help users recover the data.
For a “ the physical server disk is damaged, resulting in ” “ individual user ” data lost, Sheng Dayun technical staff give their own interpretations: virtual machine disk has two kinds of mode of production, is a physical disk directly using the host. In this case, if the physical disk host failure, cloud hosting will inevitably cause data loss, which is why this event is generated; another is the use of remote storage, is also the grand hard products, this way is actually the user data stored in a remote cluster, and at the same time do many copies, even if the host fails not to host cloud data. Because the damage of the physical machine is difficult to avoid, in order to avoid the unexpected loss, we suggest that you do a good job of data backup outside the cloud host.
Typical event 3:Google App Engine interrupt service
Cause of failure: network delay
Google App Engine:GAE is a platform for developing and hosting WEB applications. The data center is managed by Google. The interruption time is October 26th, which lasts for 4 hours, because it suddenly becomes slow and error. As a result, 50% of the GAE requests failed.
Google says there is no data loss, and application behavior also has a backup to restore. Sorry, Google announced in November Google users can say they are strengthening their network services to cope with the network delay, “ we have enhanced flow routing capabilities, and adjust the configuration, which will effectively prevent the recurrence of such problems ”.

Broken net inducement five: system Bug
Typical event 1:Azure global interrupt service
Cause of the accident: the software Bug causes the inaccurate calculation of the leap year time
In February 28, 2012, as a result of “ leap year bug&rdquo, Microsoft Azure was disrupted in a large area of services worldwide, interrupting more than 24 hours. Although Microsoft said the software BUG was caused by incorrect calculation of the leap year time, the incident aroused many users' strong reactions. Many people asked Microsoft to make a more reasonable and detailed explanation for this.
Typical event 2:Gmail electronic mailbox broke out of global failure
The cause of the accident: the data center of routine maintenance, side effects of the new program code
In February 24, 2009, Google's Gmail mailbox broke out of a global failure, and the service was interrupted for up to 4 hours. Google explained the cause of the accident: in the European data center routine maintenance, some new code (tryThe geographical proximity of the data set to all people) some side effects, lead to another European data center and overload, the ripple effect will be extended to other interface data center, and ultimately lead to global disconnection, leading to other data center may not work properly.
Typical event 3:“ 5.19 broken net event ”
The cause of the accident: the client software Bug, the Internet terminal frequently initiates the domain name resolution request, triggering the DNS congestion
In May 19, 2009, Jiangsu, Anhui, 21:50, Guangxi, Hainan, Gansu, Zhejiang province six users visit the web site reported slow or inaccessible. After the Ministry of the relevant units of the national survey bulletin said, six provincial network interruption, reason is the defect of a domestic company to launch the client software in the company domain authorization server and abnormal circumstances, lead to the Internet terminal to install the software to initiate frequent DNS requests, causing DNS congestion, resulting in a large number of users to access the site slow or not open the page.
Among them, DN SPod is a well-known N SPod company, one of the famous domain name resolution service providers, and the domain name resolution service of several well-known websites. The attack resulted in the paralysis of 6 DNS domain name servers belonging to DN SPod, which directly caused the breakdown of the domain name system of many Internet service providers including storm and video, causing network congestion, resulting in a large number of users unable to access the Internet normally. According to the Ministry of industry and commerce, the domain name resolution service has become the weak link of network security, indicating that all units should enhance the security protection of DNS service.
Summary
Companies that enable cloud services are largely considered to be more edited and cost-effective. However, if such considerations are at the cost of reducing security, it is estimated that many companies will not agree. The emerging cloud service broken network events have caused concerns about the security of the cloud.
At present, the solution can start from several angles. For enterprise customers, it is necessary to regularly backup cloud data while using cloud services, and have second sets of solutions to prepare for contingencies. For a cloud service provider, since all kinds of network break events are unavoidable, we must consider a countermeasure to minimize the loss of our users, and to improve the response efficiency to the event of network interruption.
Government departments are responsible for monitoring and reminding. Cloud services related laws and regulations should be introduced and improved continuously, and remind users that one hundred percent reliable cloud computing services are not yet available.

Tags：Data center broken net server

Previous：The industry has said Baidu will accelerate the merger and acquisition of UCweb

Next：Qihoo 360 reached a cooperation agreement with Google search