Network outages have become a dreaded reality, disrupting businesses, personal lives, and communication channels. While no network is immune to this unfortunate event, the recent Australian telecom outage serves as a stark reminder of the impact such disruptions can have. The outage, which lasted for several hours, caused nationwide disruptions to Australian businesses, essential services, and daily life.
This example highlights the complex nature of modern telecommunications networks and the potential for disruptions to occur. Even with the most sophisticated infrastructure and robust redundancy measures in place, unforeseen events such as software glitches, hardware failures, or natural disasters can bring networks down.
Network disruptions can happen to the best of us. So, here’s a look into what causes such outages and how to safeguard your network from them.
Understanding the root cause of the Australian telecom outage
The root cause of the outage was a complex interplay of technical issues, primarily centered around a software upgrade and the excessive routing information it introduced.
- Excessive routing information destabilizes the Border Gateway Protocol (BGP)
The root cause of the outage stemmed from changes made during a routine software upgrade. Specifically, these changes inadvertently disconnected a core router, which introduced an excessive amount of routing information into the telecom network. The excessive routing information caused the BGP to become unstable.
- Overwhelmed routers and safety thresholds
- The routing issue placed an immense load on key routers within the telecom provider’s network. These routers, tasked with processing and managing the vast amounts of routing data, became overwhelmed and exceeded preset safety thresholds. These thresholds define acceptable limits for the amount of routing data that can be processed by the network’s routers.
- Router’s default configuration and protective mechanisms In response to the exceeded safety thresholds, around 90 affected provider edge (PE) routers activated a vendor default protective mechanism, disconnecting themselves from the telecom provider’s IP core network. This self-isolation mechanism effectively severed the routers’ ability to participate in routing data, causing a disruption in network connectivity.
- Cascading failure impacts the entire network infrastructure The disconnection of these critical routers, particularly those responsible for core network routing, triggered a cascading failure, causing widespread disruption across the entire telecom infrastructure.
What prolongs network downtime?
Restoring massive network outages can prove to be a complex and time-consuming endeavor. Key factors that can exacerbate situations like the Australian telecom outage and prolong the restoration process include:
- Lack of robustness: In IP routing issues like the above, networks require sufficient safeguards to prevent such a large influx of routing information from overloading the routers.
- Inadequate monitoring: Without effective network monitoring systems to detect the issue promptly, network admins can face delays in identifying the root cause and initiating corrective actions.
- Manual restoration: Without configuration management tools, the restoration process might involve manually reconfiguring the affected routers, which is time consuming and labor intensive.
7 best practices to safeguard your network from outage mishaps
While network outages are an unfortunate reality, there are steps that individuals and organizations can take to minimize their impact. Here are seven key considerations:
- Implement a robust network monitoring system: A comprehensive network monitoring system provides centralized visibility and control over your network infrastructure. It enables you to monitor network performance, identify potential issues, and take corrective actions promptly.
- Establish clear configuration management procedures: This includes version control, change management, and documentation. Proper configuration management helps prevent unauthorized changes and ensures that configurations are consistent across the network.
Be mindful of the default vendor configurations of your routers and take measures accordingly to avoid issues when updates are deployed in your network infrastructure. For instance, to avoid router self-isolation, network admins can create compliance rules in ManageEngine Network Configuration Manager to ensure that the maximum prefix configuration (i.e., safety threshold) logs only a warning message and does not completely isolate the router.
- Traffic engineering and capacity planning:
Employ traffic engineering techniques to manage network traffic effectively and ensure routers can handle peak loads and unexpected spikes in data traffic. This involves analyzing traffic patterns, identifying potential bottlenecks, and implementing congestion control mechanisms. Conduct capacity planning exercises to ensure network infrastructure can support anticipated growth and traffic demands.
- Implement a comprehensive backup and recovery plan:
This ensures that you can quickly restore your network to a working state in the event of an outage or disaster. This plan should include regular backups of critical data, procedures for restoring network configurations and automation, and a process for testing your recovery procedures.
- BGP configuration and troubleshooting: Implement rigorous configuration management practices for BGP, ensuring proper route redistribution, loop prevention, and community filtering. Maintain up-to-date knowledge of BGP vulnerabilities and implement appropriate mitigation measures to protect against routing attacks.
- Redundant network infrastructure: Design and implement redundant network infrastructure, including multiple core routers, to provide resilience against failures and allow for quicker recovery in the event of outages. This includes redundancy at the device level, link level, and path level to ensure continuous connectivity in the face of hardware or network disruptions. Network admins should also enable diverse communication carrier options for network management and communication.
- Conduct regular network assessments and vulnerability scans: Regularly scheduled network assessments and vulnerability scans can help identify weaknesses and vulnerabilities in your network infrastructure that could be exploited by attackers or lead to accidental outages. These assessments should cover both physical and logical security aspects of your network.
Even best-in-class networks can fall victim to routing and configuration issues, and the Australian telecom outage stands as a sobering example. The vulnerabilities within modern network infrastructures makes it imperative for businesses to fortify their network infrastructure against mishaps. Implementing a comprehensive network monitoring system, clear configuration management procedures, traffic engineering, and capacity planning are paramount.
One powerful solution to enhance network resilience and mitigate risks is ManageEngine OpManager Plus. Ensure uninterrupted connectivity and swift recovery from unexpected challenges. Get in touch with our product experts for a quick capability walk-through today.