Microsoft Azure Products

Incident Management Process (IcM) is a term describing the activities of an organization to identify, analyze, and correct hazards to prevent a future re-occurrence. If not managed, an incident can escalate into an emergency, crisis or a disaster.

The first goal of the IcM was to restore a normal service operation as quickly as possible and to minimize the impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained.

THE PROBLEM

Microsoft Azure has a few thousand services that are linked in a complex hierarchy within its organization across the globe. These services are used both internally and externally and their outages can cause major disruptions to its customers.

The current IcM solutions have a few issues:

Lack of context about the incident
No suggestions on how to fix the issues based on similar incidents
No prioritization or severity levels
No escalation of a incident to a crisis
No visibility to the leadership teams

THE SOLUTION

Improving communications between teams
Managing on call solutions across multiple teams
Enhanced analytics to identify bottlenecks in mitigating incidents
Setup operational rules & patterns that would monitor and inform of potential outages

MY ROLE

Contextual walkthrough to learn about how Directly Responsible Individuals (DRIs) resolve incidents
Evaluation of ITIL methodologies to identify best practices and UX opportunities
Collaboration with key stakeholders and input from SME’s to identify Microsoft service processes and third party integration points
Execution of complex interactive prototypes to simulate system interactions, incident resolutions, and configuration
Worked towards a new visual language (Fluent) informed from existing design patterns
Rigorous user research and testing to validate visual metaphors and conceptual frameworks