阿里云香港机房为何瘫痪

阿里云香港机房为何瘫痪

2015-06-25    02'01''

主播: 财新金融英语

4026 70

介绍:
Aliyun, the cloud-computing arm of Alibaba, saw its data center go offline for 14 hours in Hong Kong on June 21 and could not provide customers with a clear explanation for the problem. At 9:30 a.m., Aliyun corporate clients began noticing that they could not access their data. Shortly after, Aliyun announced via Sina Weibo, China’s Twitter, that the problem was due to a power outage. Later, however, Aliyun recanted, stating that the issue was due to a severed telecommunications cable between Hong Kong and the mainland. When clients questioned the fact that there was only one cable, Aliyun once again said that a power outage was to blame. The reason for this ambiguity is unknown. Aliyun, which launched the Hong Kong center in 2014, services more than 1.4 million clients on the mainland, according to its most recent report. Customers can rent servers at Aliyun centers to store data. In response to an inquiry by Caixin, Aliyun stated that it noticed access to servers at the Towngas Hong Kong facility stopped at 9:37 a.m. on June 21, and that when contacted, employees at Towngas stated that the problem was due to a power outage that also set off a fire alarm. However, both Towngas and the Hong Kong fire department denied that a fire alarm went off at the facility. According to Aliyun, power returned at 9:22 p.m., and full data service resumed at 11:39 p.m. An IT engineer at a large internet company said that even if external power supplies were disrupted, data centers should have generators as a back-up service. He explained that the Aliyun disruption could be a result of a failure with the electricity distribution system at the data center, or problems with the generator system, but also pointed out that no matter what, a 12-hour shutdown is unacceptable. Zhang Guangbin, the founder and CEO of Dongli Technology Company Limited, which provides a directory of companies and their contact information, echoed this sentiment, saying that a disruption lasting several hours is a serious breach for a data center, especially one as large as Aliyun. 阿里云在香港中止服务14小时。 事故发生在6月21日上午9点到10点之间,一些使用阿里云香港数据中心的用户发现服务出了问题,此后,阿里云方面通过博客公告称由于运营商电力问题造成香港机房故障,正在抢修。 此后阿里云的客服给用户的解释几易其口。有用户告诉财新记者,事故后致电阿里云官方客服,得到的解释是香港和大陆地区所连光缆挖断,但用户质疑“难道只有一条光缆”后,阿里又回复称是因为运营商的电力故障。 6月23日,阿里云方面向财新记者发来声明对事故作出说明:6月21日上午9点37分,阿里监控系统发现阿里云香港IDC运营商名气通机房出现访问异常,名气通反馈原因为供电系统故障导致数据中心大楼整体断电,并触发消防报警。根据当地的消防规定,必须彻底排查隐患并完全消除后,才能获准进场做电力抢修。21点22分机房正式恢复稳定供电,阿里立即执行既定预案逐项恢复服务,21点32分安全防护服务恢复正常,各项服务陆续恢复,截至23点39分全部服务恢复。 按照这份说明,阿里云香港服务瘫痪12小时主要是因为机房建设方和运营商名气通电力故障,阿里云直到电力故障发生近12个小时后才得以进入机房抢修。从阿里云的解释来看没能及时抢修成功是因为整栋大楼都断电而且无法切换备用电源,并且消防隐患排查耽误了抢修时间。阿里云只用了10分钟(21时22 分到21时32分)恢复了安全防护服务,继而用了两个小时恢复全部服务。 但事实并非如此。一时间,阿里云香港合作伙伴运营商名气通被推上风口浪尖,6月23日下午,名气通方面经财新记者多次询问发来了官方回应,承认电力故障,但只字未提大楼整体断电,以及消防火警等因素影响。 在现场参与抢修的不愿具名人士告诉财新记者,当时并未有火警以及消防等因素干扰,整栋大楼并未断电,只有部分用户受影响。“消防封楼因素造成客户无法进入数据中心处理事故是谣传。”该人士明确表示。 一位国内大型互联网公司IT基础设施技术的负责人向财新记者分析了造成断电的可能原因。他指出,国内互联网公司在国内的机房就算外部电力全部切断,也都有备用的柴油发动机,能够维持的时间也比较久。在他看来,香港机房断电最可能的原因是控电设备故障,接入强的外部直流电源无法实现切换,因而服务器大面积断电,导致服务器宕机。这类维修耗时较长。另一种可能的原因则是柴油发电机的供电模组不可用或者故障。 无论哪一类事故原因,断电12小时在业内都被认为是”太久了“。 企事录创始人张广彬告诉财新记者,一般数据中心的最高标准是Tier 4,即保证99.995%的可用性,这意味着一年里的中断时间不能超过半小时。这是很高的标准了,能达到Tier 4标准的数据中心不是很多。 香港名气通2号数据中心的标准不低于Tier 3,理论上全年中断时间不应超过1.6小时。