Write a Blog >>
Wed 11 Nov 2020 01:35 - 01:36 at Virtual room 2 - Cloud / Services 2

In large-scale cloud systems, unplanned service interruptions and outages may cause severe degradation of service availability. Such incidents can occur in a bursty manner, which will deteriorate user satisfaction. Identifying incidents rapidly and accurately is critical to the operation and maintenance of a cloud system. In industrial practice, incidents are typically detected through analyzing the issue reports, which are generated over time by monitoring cloud services. Identifying incidents in a large number of issue reports is quite challenging. An issue report is typically multi-dimensional: it has many categorical attributes. It is difficult to identify a specific attribute combination that indicates an incident. Existing methods generally rely on pruning-based search, which is time-consuming given high-dimensional data, thus not practical to incident detection in large-scale cloud systems. In this paper, we propose MID (Multi-dimensional Incident Detection), a novel framework for identifying incidents from large-amount, multi-dimensional issue reports effectively and efficiently. Key to the MID design is encoding the problem into a combinatorial optimization problem. Then a specific-tailored meta-heuristic search method is designed, which can rapidly identify attribute combinations that indicate incidents. We evaluate MID with extensive experiments using both synthetic data and real-world data collected from a large-scale production cloud system. The experimental results show that MID significantly outperforms the current state-of-the-art methods in terms of effectiveness and efficiency. Additionally, MID has been successfully applied to Microsoft's cloud systems and helped greatly reduce manual maintenance effort.

Conference Day
Wed 11 Nov

Displayed time zone: (UTC) Coordinated Universal Time change

01:30 - 02:00
01:30
2m
Talk
A Principled Approach to GraphQL Query Cost AnalysisACM SIGSOFT Distinguished Paper Award
Research Papers
Alan ChaIBM Research, USA, Erik WitternIBM, USA, Guillaume BaudartIBM Research, USA, James C. DavisPurdue University, USA, Louis MandelIBM Research, USA, Jim A. LaredoIBM Research, USA
DOI Pre-print Media Attached
01:33
1m
Talk
Block Public Access: Trust Safety Verification of Access Control Policies
Research Papers
Malik BouchetAmazon, USA, Byron CookAmazon, Bryant CutlerAmazon, USA, Anna DruzkinaAmazon, USA, Andrew GacekAmazon, USA, Liana HadareanAmazon, Ranjit JhalaAmazon, USA, Brad MarshallAmazon, USA, Dan PeeblesAmazon, USA, Neha RungtaAmazon Web Services, Cole SchlesingerAmazon, USA, Chriss StephensAmazon, USA, Carsten VarmingAmazon, USA, Andy WarfieldAmazon, USA
DOI
01:35
1m
Talk
Efficient Incident Identification from Multi-dimensional Issue Reports via Meta-heuristic Search
Research Papers
Jiazhen GuFudan University, China, Chuan LuoMicrosoft Research, China, Si QinMicrosoft Research, n.n., Bo QiaoMicrosoft Research, China, Qingwei LinMicrosoft Research, China, Hongyu ZhangUniversity of Newcastle, Australia, Ze LiMicrosoft, USA, Yingnong DangMicrosoft, USA, Shaowei CaiInstitute of Software at Chinese Academy of Sciences, China, Wei-Cheng WuUniversity of Southern California, USA, Yangfan ZhouFudan University, China, Murali ChintalapatiMicrosoft, n.n., Dongmei ZhangMicrosoft Research, China
DOI
01:37
1m
Talk
Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis
Industry Papers
Xiaofeng GuoFudan University, China, Xin PengFudan University, China, Hanzhang WangeBay, Wanxue LieBay, USA, Huai JiangeBay, USA, Dan DingFudan University, China, Tao XiePeking University, Liangfei SueBay, USA
DOI
01:39
1m
Talk
Real-Time Incident Prediction for Online Service Systems
Research Papers
Nengwen ZhaoTsinghua University, Junjie ChenTianjin University, China, Zhou WangBizSeer, China, Xiao PengBeijing University of Posts and Telecommunications, China, Gang WangChina EverBright Bank, Yong WuChina EverBright Bank, Fang ZhouChina EverBright Bank, Zhen FengEverBright Bank, China, Xiaohui NieEverBright Bank, China, Wenchi ZhangTsinghua University, China, Kaixin SuiBizSeer, Dan PeiBizSeer, China
DOI
01:41
1m
Talk
Scaling Static Taint Analysis to Industrial SOA Applications: A Case Study at Alibaba
Industry Papers
Jie WangPeking University, China / Ant Group, China / Alibaba Group, China, Yunguang WuAnt Group, China, Gang ZhouAnt Group, China, Yiming YuAnt Group, China, Zhenyu GuoAnt Group, China, Yingfei XiongPeking University
DOI
01:43
17m
Talk
Conversations on Cloud / Services 2
Paper Presentations
Alan ChaIBM Research, USA, Andrew Gacek, Jiazhen Gu, Jie WangInstitute of Software, Chinese Academy of Sciences, Nengwen ZhaoTsinghua University, Xiaofeng GuoFudan University, China, M: Satish ChandraFacebook, USA