Brief Summary of the Rogers Outage RFI
Late last night the CRTC posted the reply from Rogers to the request for information (RFI) it sent last week. In the RFI, the CRTC asked Rogers to respond to 55 questions about the multi-day Canada-wide service outage that started early in the morning of July 8th. Beyond the 55 questions in the RFI, the Commission clearly stated to Rogers that "In light of the immense public interest in understanding what happened, Commission staff expects Rogers to disclose information on the public record to the maximum extent possible." However, in reading the RFI response, it's clear Rogers didn't understand the assignment. The level of redaction in the document is mind-boggling - there are pages and pages with nothing but redacted text - hardly the maximum extent possible of disclosure. As an example, when it came to the question of in the impact to 911 services, Rogers redacted the call volumes for 911 calls - hardly something that any normal person would concider to be a trade secect:
From the non-redacted portions of the document there are 3 key findings we can take away from the limited data that is on the public record so far - Rogers had a massive internal routing failure, they failed to have an proper out of band managment network, and why 911 calls failed. There is also the curious statement that at 6AM on the day of the outage the CTO informed Bell and Telus of a potential cyber attack:
This tells us that for at least the first 90 minutes of the outage Rogers didn't know what was going on. If they had known during this time it was an issue with route filters they wouldn't have warned Bell and Telus since it clearly wasn't a cyber attack but rather a misconfiguration of core equiment by Rogers staff.
The Root Cause - Removal of a Routing Filter
The root cause of the outage apperas to be was a the removal of a routing filter in the Rogers core network, which as a result, "Certain network routing equipment became flooded". This flood caused downstream network elements to "exceeded their capacity levels and were then unable to route traffic, causing the common core network to stop processing traffic". However, when it comes to details of how this routing filter was removed and how the subsuquent network failure happened, it's all redatacted from the CRTC filing:
For those who are not technical, route filtering is a very common practice in networks. The global routing table, is a list of every IP on the planet and the "route" to reach it over the internet. As you can imagine, this table is very large and always changing. Because of the size of this table, many routers do not have sufficient amounts of main memory to hold the full global BGP table. A simple and common work-around is to perform input filtering, thus limiting the local route database to a subset of the global table.
From what is on the record, it's clear Rogers was using internal route filtering to limit what internet routes were sent from the core to downstream routers - a common practice in the industry to prevent memory exhaustion and other issues on downstream devices that don't neeed a full global BGP table. However, what they didn't do on the downstream devices was filter what routes to receive. RFC 7454 - BGP Operations ans Security clearly covers this case for leaf networks, such as Rogers, that "From peers, it is RECOMMENDED to have a limit lower than the number of routes in the Internet. This will shut down the BGP peering if the peer suddenly advertises the full table. Network administrators can also configure different limits for each peer, according to the number of routes they are supposed to advertise, plus some headroom to permit growth." Had Rogers properly implemented route filtering and limiting on the downsteam routers the devices would have just shutdown BGP instead of exceeding their capacitly levels and failing to pass traffic. While this still would have caused an outage, having devices online but without BGP would have been far faster to recover from.
No True "Out of band" Management Network
When outages happen (and they do in any network), an "out of band" (OOB) network is used to connect to the managment interfaces of the impacted devices and restore services. In most carrier grade networks I've worked on, this OOB network is 100% seperate from the network itself and often provided by a third-party carrier to allow access in the event of a network failure, like what happened in the case of the Rogers outage. However, it apperas Rogers did not have this type of OOB network in place - in the RFI response Rogers states "At the early stage of the outage, many Rogers’ network employees were impacted and could not connect to our IT and network systems. This impeded initial triage and restoration efforts as teams needed to travel to centralized locations where management network access was established.". If Rogers had a true OOB managment network from a third party carrier staff would not need to travel to centralized locations to access it - they could just use the internet. If I were the CRTC or INDU I'd be asking Rogers some very pointed questions about how it handles OOB device management.
911 Call Failures
On a positive note, Rogers did confirm why 911 calls failed during the outage, which reaffirmed what I thought happened - because wireless devices could still see the Rogers network they did not attempt to failover to another network when a customer attempted to dial 911.
Fixing this failure scenario will not be easy and will require work with handset manufactures to change the behaviour of how the handset handles 911 when the primary network is "visable" but not functional.
Conclusions
So what can we conclude from all of this? Not much based on the data presented to the public so far. We know it was a BGP failure, but we don't really know why it took so long to restore, nor why there was no downstream route filtering in place. We also don't know why there wasn't a better OOB managment network in place and what (if any) steps Rogers is taking to correct this in the future. We also need to better understand why it took so long to identify the issue - all of the information on the timeline of events is redacted from public view. Hopefully the CRTC reviews the redacted protions and puts more data on the public record - until then we are really left with more questions than answers.