Last week wasn't great for the Canadian Public Switched Telephone Network (PSTN).  Two of Canada's largest competitive carriers suffered outages - Iristel was victim of a DDoS attack that lasted over 24 hours, between November 15th and 16th.  At the same time, Thinktel experienced issues related to fiber cuts caused by the extreme weather events and flooding in BC. Other carriers were impacted as well - Bell, Rogers, and Telus all had some sort of impact from the events in BC and Telus continues to have outages in BC from the events last week, impacting both Telus and LECs that rely upon them.

The impact of these two outages wasn't just on customers of Iristel and Thinktel - because of the way the PSTN is designed, downstream carriers such as Microsoft, Bandwidth.com, Telnyx, and others were also impacted as those providers rely upon the underlying services of Thinktel and Iristel to deliver service.  

While invisible to most end users, calls between two numbers on the PSTN generally don't go from A->B, they often go from A->B->C or A->B->C->D, as shown below:

As a quick aside, you may notice the terms "LEC" and "TSP" used above - without getting to all the details of how the PSTN functions you can think of a LEC as a carrier with direct access to the PSTN and the underlying legacy SS7 network, whereas someone who is just a TSP doesn't have their own connection to the legacy PSTN network and instead relies upon a LEC to provide that connection for them.

So why was Microsoft impacted by an Iristel outage?

In the case of Microsoft, they are not a LEC - they are a TSP who relies carriers like Iristel and Thinktel to provide the LEC services for them.  So when the LEC has an issue, like they did this week, calls can't complete even though Microsoft themselves were not having an issue.

In the internet world, we solve this with protocols like BGP which do dynamic real-time changes to packet flows to route around failures.  But that doesn't exist in the PSTN world - because phone numbers belong to local exchange carriers (LECs), calls to a number must always traverse the LEC associated with that number regardless of the TSP that is actually servicing the number.  

RFC 2916 (E.164 numbers and DNS) - Kinda the solution

Over 20 years ago, RFC 2916 was released, which defined the use of the Domain Name System (DNS) for storage of phone numbers and the available services associated with that number through a system that is now commonly known as ENUM.  When a phone number is queried against an ENUM database the resulting looking returns a set of records referred to as Naming Authority Pointer (NAPTR) records.    For example, if you wanted publish a record for 416-867-5309 it would look something like this:

$ORIGIN 9.0.3.5.7.6.8.6.1.4.1.e164.arpa.

NAPTR 100 10 "u" "sip+E2U" "!^.*$!sip:\\[email protected]!"

Once ENUM is widely deployed, the idea is that you should be able to dial any phone number, and have your SIP infrastructure look it up using public DNS.  If a SIP URI exists for that phone number, the call could be routed directly via IP.  If no URI exists, the call would have to be routed over the normal PSTN.  Taking the example call flow from above, if ENUM was deployed a call flow might now look like this:

On the surface, this seems perfect - the calls now go directly from A->D, bypassing the intermediary LECs and increasing the resiliency of the network.  But while ENUM sounds like a perfect solution it comes with its own host of problems.

First - The ITU-T has traditionally had control over the use of telephone numbers, and this control is delegated to nation administrators.  North America and the Caribbean (17 countries) belong to country code 1, so before we could even being setting up and delegating the root zone for North America it would require agreement among the 17 countries about how that root DNS zone for +1 would be managed.

Second - Service Providers don't want to trust public DNS and they don't want the information about their SIP endpoints publicly published.

Third - Pure ENUM using Public DNS leads to "Spam over Internet Telephony" or SPIT, which is unsolicited, automatically dialed telephone calls using VoIP.  If the public SIP address is in public DNS it makes it trivial for spammers to bypass the traditional PSTN and call end-users directly which is highly undesirable.

A Hybrid Approach Is Key

The key is to create a national "Next Generation Network Registry" (NGN Registry) built using technologies like ENUM, but as a closed rather than open system.  This registry would be a centralized system, like we have today for number portability, but rather than using public ENUM DNS, be open only to participating carriers. A NGN Registry for telephony routing ensures that routing information is current and in sync, that appropriate business processes have been followed, and that records exist to show the history of changes.  If we take our existing call flow diagram, we add a new element for the NGN Registry and get this:

In this example, data from the NGN TN Registry is synchronized with a local ENUM server at the originating TSP and then used for lookups before placing calls.  If a record wasn't found, the call could take the traditional A->B->C->D path, but if a result is found, then direct A->D routing can still be achieved.  There may be other flows, for example if the terminating TSP wasn't participating in the registry, but the terminating LEC was, then you would see a call flow like this:

In this case, because the originating TSP and the Terminating LEC are participants  in the NGN Registry the call will route directly between them, and then take traditional routing to get to the Terminating TSP.

The benifits of this model are:

  • Increased network resilancy - calls will take the most efficient path to get from A->B
  • Lower regulatory and business costs;
  • Increased industry agility, and
  • Putting all players in the telephony ecosystem on equal footing, be they ILEC, CLEC, TSP, WSP, or other.  

Conclusion

The benifits of a solution such as an NGN Registry are clear - it isn't a "rip and replace" solution but rather an overlay solution on top of the existing legacy processes.  When you introduce an NGN Registry the current system keeps working - if carrier A doesn't want to participate, calls go over traditional routing as they do today.  If both the originating and terminating TSP want to participate, then call will go direct.  Adoption can be market driven - carriers can choose to participate to lower costs and increase reliaiblity and consumers and businesses can push for it to lessen the impact of outages.  It also resolves many of the issues found with SHAKEN/STIR and "number ownership" since the number ownership could now be driven by the NGN Registry and not legacy systems that don't really have a way for TSPs to participate on an equal footing but that's a whole blog post on its own.

While the proposed NGN Registry solution may not be the final solution, we need to start a conversation about how we start to modernize the Canadian PSTN and start removing dependancies on legacy technologies, business models, and processes and bringing them in line with the 21st century.