Write a Blog >>
Wed 11 Nov 2020 01:13 - 01:14 at Virtual room 2 - Cloud / Services 1

The management of cloud service incidents (unplanned interruptions or outages of a service/product) greatly affects customer satisfaction and business revenue. After years of efforts, cloud enterprises are able to solve most incidents automatically and timely. However, in practice, we still observe critical service incidents that occurred in an unexpected manner and orchestrated diagnosis workflow failed to mitigate them. In order to accelerate the understanding of unprecedented incidents and provide actionable recommendations, modern incident management system employs the strategy of AIOps (Artificial Intelligence for IT Operations). In this paper, to provide a broad view of industrial incident management and understand the modern incident management system, we conduct a comprehensive empirical study spanning over two years of incident management practices at Microsoft. Particularly, we identify two critical challenges (namely, incomplete service/resource dependencies and imprecise resource health assessment) and investigate the underlying reasons from the perspective of cloud system design and operations. We also present IcM BRAIN, our AIOps framework towards intelligent incident management, and show its practical benefits conveyed to the cloud services of Microsoft.

Materials of ESEC/FSE 2020 industry paper "Towards Intelligent Incident Management: Why We Need It and How We Make It" (fse20ind-p61-p_materials.zip)4.0MiB

Wed 11 Nov

Displayed time zone: (UTC) Coordinated Universal Time change

01:00 - 01:30
01:00
2m
Talk
Beware the Evolving ‘Intelligent’ Web Service! An Integration Architecture Tactic to Guard AI-First Components
Research Papers
Alex Cummaudo Deakin University, Australia, Scott Barnett Deakin University, Australia, Rajesh Vasa Deakin University, Australia, John Grundy Monash University, Australia, Mohamed Abdelrazek Deakin University, Australia
DOI
01:03
1m
Talk
Efficient Customer Incident Triage via Linking with System Incidents
Industry Papers
Jiazhen Gu Fudan University, China, Jiaqi Wen Peking University, China, Zijian Wang Fudan University, China, Pu Zhao Microsoft Research, China, Chuan Luo Microsoft Research, China, Yu Kang Microsoft Research, China, Yangfan Zhou Fudan University, China, Li Yang Microsoft Azure, USA, Jeffrey Sun Microsoft Azure, USA, Zhangwei Xu Microsoft, China, Bo Qiao Microsoft Research, China, Liqun Li Microsoft Research, China, Qingwei Lin Microsoft Research, China, Dongmei Zhang Microsoft Research, China
DOI
01:05
1m
Talk
How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems
Industry Papers
Jiajun Jiang Tianjin University, China, Weihai Lu Peking University, China, Junjie Chen Tianjin University, China, Qingwei Lin Microsoft Research, China, Pu Zhao Microsoft Research, China, Yu Kang Microsoft Research, China, Hongyu Zhang University of Newcastle, Australia, Yingfei Xiong Peking University, Feng Gao Microsoft, China, Zhangwei Xu Microsoft, China, Yingnong Dang Microsoft, USA, Dongmei Zhang Microsoft Research, China
DOI
01:07
1m
Talk
Identifying Linked Incidents in Large-Scale Online Service Systems
Research Papers
Yujun Chen Microsoft Research, China, Xian Yang Hong Kong Baptist University, China, Hang Dong Microsoft Research, China, Xiaoting He Chinese Academy of Sciences, China, Hongyu Zhang University of Newcastle, Australia, Qingwei Lin Microsoft Research, China, Junjie Chen Tianjin University, China, Pu Zhao Microsoft Research, China, Yu Kang Microsoft Research, China, Feng Gao Microsoft, China, Zhangwei Xu Microsoft, China, Dongmei Zhang Microsoft Research, China
DOI
01:09
1m
Talk
Mono2Micro: An AI-Based Toolchain for Evolving Monolithic Enterprise Applications to a Microservice Architecture
Tool Demos
Anup K. Kalia IBM Research, USA, Jin Xiao IBM Research, USA, Chen Lin IBM Research, USA, Saurabh Sinha IBM Research, John Rofrano IBM Research, USA, Maja Vukovic IBM Research, USA, Debasish Banerjee IBM, n.n.
DOI
01:11
1m
Talk
Threshy: Supporting Safe Usage of Intelligent Web Services
Tool Demos
Alex Cummaudo Deakin University, Australia, Scott Barnett Deakin University, Australia, Rajesh Vasa Deakin University, Australia, John Grundy Monash University, Australia
DOI
01:13
1m
Talk
Towards Intelligent Incident Management: Why We Need It and How We Make It
Industry Papers
Zhuangbin Chen Chinese University of Hong Kong, China, Yu Kang Microsoft Research, China, Liqun Li Microsoft Research, China, Xu Zhang Microsoft Research, China, Hongyu Zhang University of Newcastle, Australia, Hui Xu Fudan University, China, Yangfan Zhou Fudan University, China, Li Yang Microsoft Azure, USA, Jeffrey Sun Microsoft Azure, USA, Zhangwei Xu Microsoft, China, Yingnong Dang Microsoft, USA, Feng Gao Microsoft, China, Pu Zhao Microsoft Research, China, Bo Qiao Microsoft Research, China, Qingwei Lin Microsoft Research, China, Dongmei Zhang Microsoft Research, China, Michael Lyu CUHK
DOI Media Attached File Attached
01:15
15m
Talk
Conversations on Cloud / Services 1
Paper Presentations
Alex Cummaudo Deakin University, Australia, Anup K. Kalia IBM Research, USA, Jiajun Jiang Tianjin University, China, Zhuangbin Chen Chinese University of Hong Kong, China, M: Satish Chandra Facebook, USA