How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems
In recent years, more and more traditional shrink-wrapped software is provided as 7x24 online services. Incidents (events that lead to service disruptions or outages) could affect service availability and cause great financial loss. Therefore, mitigating the incidents is important and time critical. In practice, a document describing a mitigation process, called a troubleshooting guide (TSG), is usually used to reduce the Time To Mitigate (TTM). To investigate the usage of TSGs in real-world online services, we conduct the first empirical study on 18 real-world, large-scale online service systems in Microsoft. We analyze the distribution and characteristics of TSGs among all incident records in the past two years. According to our study, 27.2% incidents have TSG records and 36.2% of them occurred at least twice. Besides, on average developers spend around 36.3% of the entire mitigation time on locating the desired TSGs.
Our study shows that incidents could occur repeatedly and TSGs could be reused to facilitate incident mitigation. Motivated by our empirical study, we propose an automated TSG recommendation approach, DeepRmd, by leveraging the textual similarity between incident description and its corresponding TSG using deep learning techniques. We evaluate the effectiveness of DeepRmd on 18 online service systems. The results show that DeepRmd can recommend the correct TSG as the Top 1 returned result for 80.3% incidents, which significantly outperforms two baseline approaches.
Wed 11 NovDisplayed time zone: (UTC) Coordinated Universal Time change
01:00 - 01:30 | Cloud / Services 1Industry Papers / Research Papers / Paper Presentations / Tool Demos at Virtual room 2 | ||
01:00 2mTalk | Beware the Evolving ‘Intelligent’ Web Service! An Integration Architecture Tactic to Guard AI-First Components Research Papers Alex Cummaudo Deakin University, Australia, Scott Barnett Deakin University, Australia, Rajesh Vasa Deakin University, Australia, John Grundy Monash University, Australia, Mohamed Abdelrazek Deakin University, Australia DOI | ||
01:03 1mTalk | Efficient Customer Incident Triage via Linking with System Incidents Industry Papers Jiazhen Gu Fudan University, China, Jiaqi Wen Peking University, China, Zijian Wang Fudan University, China, Pu Zhao Microsoft Research, China, Chuan Luo Microsoft Research, China, Yu Kang Microsoft Research, China, Yangfan Zhou Fudan University, China, Li Yang Microsoft Azure, USA, Jeffrey Sun Microsoft Azure, USA, Zhangwei Xu Microsoft, China, Bo Qiao Microsoft Research, China, Liqun Li Microsoft Research, China, Qingwei Lin Microsoft Research, China, Dongmei Zhang Microsoft Research, China DOI | ||
01:05 1mTalk | How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems Industry Papers Jiajun Jiang Tianjin University, China, Weihai Lu Peking University, China, Junjie Chen Tianjin University, China, Qingwei Lin Microsoft Research, China, Pu Zhao Microsoft Research, China, Yu Kang Microsoft Research, China, Hongyu Zhang University of Newcastle, Australia, Yingfei Xiong Peking University, Feng Gao Microsoft, China, Zhangwei Xu Microsoft, China, Yingnong Dang Microsoft, USA, Dongmei Zhang Microsoft Research, China DOI | ||
01:07 1mTalk | Identifying Linked Incidents in Large-Scale Online Service Systems Research Papers Yujun Chen Microsoft Research, China, Xian Yang Hong Kong Baptist University, China, Hang Dong Microsoft Research, China, Xiaoting He Chinese Academy of Sciences, China, Hongyu Zhang University of Newcastle, Australia, Qingwei Lin Microsoft Research, China, Junjie Chen Tianjin University, China, Pu Zhao Microsoft Research, China, Yu Kang Microsoft Research, China, Feng Gao Microsoft, China, Zhangwei Xu Microsoft, China, Dongmei Zhang Microsoft Research, China DOI | ||
01:09 1mTalk | Mono2Micro: An AI-Based Toolchain for Evolving Monolithic Enterprise Applications to a Microservice Architecture Tool Demos Anup K. Kalia IBM Research, USA, Jin Xiao IBM Research, USA, Chen Lin IBM Research, USA, Saurabh Sinha IBM Research, John Rofrano IBM Research, USA, Maja Vukovic IBM Research, USA, Debasish Banerjee IBM, n.n. DOI | ||
01:11 1mTalk | Threshy: Supporting Safe Usage of Intelligent Web Services Tool Demos Alex Cummaudo Deakin University, Australia, Scott Barnett Deakin University, Australia, Rajesh Vasa Deakin University, Australia, John Grundy Monash University, Australia DOI | ||
01:13 1mTalk | Towards Intelligent Incident Management: Why We Need It and How We Make It Industry Papers Zhuangbin Chen Chinese University of Hong Kong, China, Yu Kang Microsoft Research, China, Liqun Li Microsoft Research, China, Xu Zhang Microsoft Research, China, Hongyu Zhang University of Newcastle, Australia, Hui Xu Fudan University, China, Yangfan Zhou Fudan University, China, Li Yang Microsoft Azure, USA, Jeffrey Sun Microsoft Azure, USA, Zhangwei Xu Microsoft, China, Yingnong Dang Microsoft, USA, Feng Gao Microsoft, China, Pu Zhao Microsoft Research, China, Bo Qiao Microsoft Research, China, Qingwei Lin Microsoft Research, China, Dongmei Zhang Microsoft Research, China, Michael Lyu CUHK DOI Media Attached File Attached | ||
01:15 15mTalk | Conversations on Cloud / Services 1 Paper Presentations Alex Cummaudo Deakin University, Australia, Anup K. Kalia IBM Research, USA, Jiajun Jiang Tianjin University, China, Zhuangbin Chen Chinese University of Hong Kong, China, M: Satish Chandra Facebook, USA |