The Value of Automation
Problem statement: manual tasks, takes long time to TS, increased downtime, not enough information for RCA
It’s a long holiday weekend, most of the support staff have taken off and are unreachable, leaving a “skeleton” support team on call and available in case of emergencies. A major storm sweeps through the region taking lots of your customers offline, due to the network outages.
On a “normal” day the support staff would have seen the alerts and alarms coming in from the network and would have started troubleshooting the issues. But this is not that “normal” day.
The network team has SLA obligations to their customers to minimize downtime, thus restoration of service needs to be timely. What do you do? Along comes Operational Process Automation (OPA).
Operational Process Automation is about getting the right systems and workflows in place to automate repetitive operational tasks, like troubleshooting steps, to improve efficiency and ensure consistency in operations teams.
So how does OPA work:
- Need to be able to DETECT incidents that require actions by processing the data from the environment, looking for symptoms indicative of introducing operational risk. Identify the alarms generated and reported through the event management tools that relate to incident characteristics.
- Determine what steps are normally, and sometimes routinely, taken to DIAGNOSE and troubleshoot the detected incidents. Document the troubleshooting workflow based on the information collected, identifying tasks necessary to view and ultimately resolve the incident.
- Determine what ACTIONS are required to resolve the incident or implement workarounds: Identify common tasks and steps to complete, based on the troubleshooting workflow, and use the Opmantek software to perform as many steps as possible (ex. ssh into router and issue and capture “show” commands).
As you introduce automated remediating actions into the operational environment, technical reviews and required change approvals will be required to ensure the accuracy and consistency of the steps taken. As the corrective actions are proven over time to be stable, accurate, consistent and more mature, they can become more automatic and eventually integrated in with other processes and tools, like the incident ticketing system.
Operational Process Automation not only improves Mean Time to Restore Service, but it also drives down the cost of service delivery and fosters higher network availability. Automating routine tasks increases productivity in the technical and operations team, produces consistent and accurate results, increases staff job satisfaction, educates the inexperienced support staff, and ultimately increases customer satisfaction.
So back to our major storm outage. Thank goodness OPA was put into place a few weeks ago by the technical teams. By automating and remediating the necessary troubleshooting steps, led to a quicker resolution of the network incidents, minimizing downtime and keeping within the SLA guidelines established with the customers.
Put the knowledge of the engineers and operations teams into the “hands” of the machines