Check out the on-demand sessions from Low-Code/No-Code Summit to learn how to innovate successfully and achieve efficiency by upskilling and scaling citizen developers. Watch now.
When we settle into that time of year when we reflect on what we’re grateful for, we tend to focus on important basics like health, family, and friends.
But on one professional level, IT operations people (ITOPs) are grateful for avoiding catastrophic outages that can cause confusion, frustration, loss of revenue, and reputational damage. The very the last thing that ITOps, network operations center (NOC) or site reliability engineering (SRE) teams want while eating turkey and enjoying family time is to receive notification of a power-off. These can be extremely expensive — 12,913 USD per minutein practice and up to $1.5 million per hour for larger organizations.
To understand the peace of mind of avoiding downtime, however, you must first endure the pain and anxiety of a power outage. Here are some horror stories that ITOps experts are grateful to avoid this season.
A case of janky command structure
A longtime IT professional was in the middle of a shift with three others when 7 p.m. passed. The team received a warning about a problem affecting the user interface user interface for its global traffic management device. Thankfully, there’s already a manual for it in the database, so it looks like the problem will be resolved quickly. One of the team members saw two things to enter: A command and an extra input. He’s entered commands and, based on the runbook’s interface, is waiting for the command line to ask for input, such as “what do you want to restart?”
How to set up command structure, if you don’t provide input, the device will restart by itself. He entered the command he thought was correct – “big boot, reboot” – and the entire front-end global traffic manager was taken down.
Again, this took place early in the evening. The client is a financial company, and the system shuts down at the exact moment the businesses close and try to complete the books and other financial related tasks. Terrible timing, to say the least.
Five minutes into the shutdown, the ITOps team realized what had happened: The tool they use for their runbooks uses text wrapping by default, so what looks like two separate commands is actually just Is one. Although the outage was relatively brief, it came at a critical time and created a chain reaction that gave users headaches. Lessons learned? Make sure your command structure is optimized.
When Google is your best friend in the middle of the night
For an IT veteran of more than 15 years, what seemed like a quiet night shift quickly turned into an anxious nightmare. “I have never found myself panicking as quickly as when the remote terminal I was using suddenly went blank,” he said.
What he was trying to do was restart a service while working on a remote machine, but he accidentally disabled the network connector in the process. Calling someone and waking them up in the middle of the night to tell them he “stolen” a network adapter isn’t ideal, so he and his teammates set out to do it. some digging.
After what he called a “negligible amount of Google,” he was able to find his way to the Dell server and reboot the network adapter from there. It took longer than necessary to fix, but the problem was finally solved.
His pro tip: “Don’t turn off the network adapter on the machine you remotely control in the middle of the night.” That may sound obvious, but the basic lesson is to have a backup plan in place should something go wrong.
ITOps: Relying on email is great — until it’s gone
Back when email was the primary way NOC teams received notifications, one longtime IT professional recalls having a teammate whose sole job was essentially coordinating: Tracking emails and creating tickets for members. issues that need attention now and others for problems they can handle later. The system works fine, but it’s really a ticking time bomb waiting to explode if this is a large multinational corporation.
That fear was realized when the entire company data center downward.
This was a set of issues on its own, but the problem generated so many email alerts that it also crashed the company’s Outlook server. “At that point, you were really blind,” the IT hero recalls.
The event happened in the middle of the night, so the crew had to reluctantly wake the teammates up. After the problem was finally resolved, the team developed a sense of humor about it. As they recall: “We used to joke that we DDoSed with our own alarm noise. Good times!”
Ultimately, the overarching moral of the story is this: Whenever a hand touches the keyboard, there is a risk of something going wrong. Of course, sometimes this is unavoidable, but teams that have the ability to automate and simplify their IT operations processes as much as possible gives them the best chance to avoid these pitfalls. costly downtime — so they can enjoy their Thanksgiving without interruption.
Mohan Kompella is vice president of product marketing at BigPanda.
Welcome to the VentureBeat community!
DataDecisionMakers is a place where experts, including those who work with data, can share data-related insights and innovations.
If you want to read about cutting-edge ideas and updates, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You can even consider contribute an article your own!