Nue Routing: fast, 100% fault-tolerant, 100% applicable, 100% deadlock-free

By blog

The OFA just released a new Open Subnet Manager version (v3.3.21) for InfiniBand, including many interesting features:

  • Support for HDR link speed and 2x link width
  • New routing algorithm: Nue routing
  • Support for ignoring of throttled links for Nue [1,2] and (DF)SSSP [3,4] routing
  • …and many more internal enhancements to OpenSM.

Nue Routing

Deadlock-freedom in general, but also the limited amount of virtual channels provided in modern interconnects, has been a long-standing problem for network researchers and engineers.
Nue routing is not just yet another new algorithm for statically routed high-performance interconnects, but a revolutionary step with respect to deadlock-freedom and fault-tolerance.

Our goal was to combine advantages of existing routing algorithms, primarily the flexibility of Up/Down routing and outstanding global path balancing of SSSP routing [5], while guaranteeing deadlock-freedom regardless of number of virtual channels/lanes or network type or size.
The incarnation of this effort, called Nue routing, derived from the legendary Japanese chimera, is the first algorithm capable of delivering high throughput, low latency, fast path calculation, and 100% guaranteed deadlock-freedom for any type of topology and network size.
All of this is enabled by the fundamental switch from calculating the routing within a graph representing the network to a new graph representation: the complete channel dependency graph.

Without going into detail about the inner workings, which can be found in our HPDC’16 publication [1] and Jens’ dissertation [2; Chapter 6], we will highlight Nue’s capabilities with the next two figures.

The figure below compares many existing routing algorithms of the OpenSM (we excluded MinHop and DOR, since these are only deadlock-free under certain constraints) to our Nue routing for a variety of network topologies, hosting roughly between 1000 and 2000 compute nodes each. We have been using a cycle-accurate InfiniBand simulator to obtain these results. Each bar represents the simulated communication throughput for a MPI_Alltoall operation (2KB payload per node) executed on all compute nodes of the topology, and hence a pretty accurate estimate of the capabilities of the network and how well the routing is able to utilize the available resources. For many subgraphs only a subset of OpenSM’s routing engines are shown alongside Nue, because we filtered instances where the routing engine was not able to create valid routing tables. Above each bar we list the amount of virtual channels this routing will consume to achieve a deadlock-free routing configuration.

Furthermore, the achievable network throughput under the given traffic pattern is shown for Nue routing with different numbers of virtual channels, ranging from 1 (equivalent to the absense of VCs) to 8.

nue-perf

In summary, the figure shows that Nue routing is competitive to the best performing routing for each individual topology, and offers between 84% for the 10-ary 3-tree and 121% throughput for the Cascade network in comparison. Occasionally, depending on the given number of virtual channels, Nue is able to outperform the best competitor. While our original design goals never included the ambition to beat each and every other routing on its home turf, we are glad to see that we can outperform most of them given a sufficient number of channels. However, this figure also demonstrates the high flexibility w.r.t the given number of channels.

Take for example the Kautz network (left; middle row), were Nue can create a decent deadlock-free routing configuration without virtual channels, while DFSSSP needs 8 VCs and LASH needs at least 5 VCs, but Nue is also able to outperform both with just 5 VCs.

The next figure demonstrates Nue’s fault-tolerance as well as the relatively fast path calculation in comparison to other topology-agnostic routing engines (DFSSSP/LASH) and the topology-aware Torus2QOS engine. For this test we used regular 3D tori networks of different sizes and randomly injected 1% switch-to-switch link failures into each topology.

The runtime for calculating all n-to-n paths in the network was measured for each routing engine and plotted, but only in cases where the engine was capable of producing a valid routing within the realistic 8VC constraint.

nue-runtime

Thanks to its O(n2 * log n) runtime complexity and efficient implementation, Nue is starting to outperform DFSSSP and LASH with respect to runtime already for relatively small tori.
But more importantly, Nue can always create deadlock-free routing tables, while all other engines (even the semi-fault-tolerant and topology-aware Torus2QOS) eventually fail for larger networks.

Overall the advantages of Nue routing are manifold:

  • Allowing “fire-and-forget” approach for network administration, i.e., works 100% regardless of network failures which is ideal for fail-in-place networks
  • Low runtime and memory complexity (O(n2 * log n) and O(n2), respectively)
  • Guaranteed deadlock-freedom and highly configurable in terms of VC usage
  • VCs not necessary for deadlock-freedom, which extends possible application to NoC and other interconnects which don’t support virtual channels
  • Completely topology-agnostic and yet very good path balancing under the given deadlock-freedom constraint
  • Support for QoS and deadlock-freedom simultaneously (both realized in InfiniBand via VCs)
  • Theoretically applicable to other (HPC) interconnects: RoCEv2, NoC, OPA, …

and everyone can now test and use Nue routing with the opensm v3.3.21 release by either choosing it via command line option:

--routing_engine nue   [and optionally: --nue_max_num_vls <given #VCs>]

or via OpenSM configuration file:

routing_engine nue
nue_max_num_vls <given #VCs>

The default nue_max_num_vls for Nue is assumed to be equal to 1 to enforce deadlock-freedom even if QoS is not enabled.

For less advantageous admins ☺, or systems with specifically optimized routing, we still recommend to always use Nue as fallback (in case the primary routing fails) via:

routing_engine <primary>,nue

to ensure maximum fault-tolerance and uninterrupted operation of the system until the hardware failures are fixed (which is definitely better than the default fallback behavior to the deadlock-prone MinHop by OpenSM).

A more detailed description of OpenSM’s options for Nue is provided in the documentation and for more fine-grained control over the virtual channel configuration we recommend to read our previous blog post for the DFSSSP routing engine.
(Note: it is HIGHLY advised to install/use the METIS library with OpenSM (enforced via --enable-metis configure flag when building OpenSM) for improved path balancing in Nue.)

Avoiding throttled links

Our second new feature, we were able to push upstream, is designed to ease the job of system admins in case of temporary or long-term link degradation.

More often than one would wish, one or multiple links in large-scale InfiniBand installations get throttled from their intended speed (eg. 100Gbps EDR) to much lower speeds, like 8Gbps SDR. While this IB feature is designed to keep the fabric and connectivity up, we argue that such a throttled link will be a major bottleneck to all application and storage traffic, and hence should be avoided. Usually, HPC networks, especially fat-trees, have enough path-redundancy, such that moving all paths from the affected link(s) and distributing them to other links should have less performance degradation effects than keeping the link in low speed.

However, identifying, disabling, and ultimately replacing “bad” cables takes time.

So, we added a check to the SSSP, DFSSSP, and Nue routing engines to identify such degraded links, which prevents these routings from placing any path onto the links, essentially instantly “disabling” the link and issuing a warning in the logs for the system admin.
This feature can be turned on or off in the configuration file of the subnet manager by switching the avoid_throttled_links parameter to TRUE or FALSE, respectively.

Nue and DFSSSP were developed in collaboration between the main developer Jens Domke at the Matsuoka Laboratory, Tokio Institute of Technology, and Torsten Hoefler of the Scalable Parallel Computing Lab at ETH Zurich.
We would like to acknowledge Hal Rosenstock, the maintainer of OpenSM, who is always supportive of new ideas, and we greatly appreciated his comments and help during the integration of Nue into the official OpenSM.

[1]: J. Domke, T. Hoefler and S. Matsuoka: Routing on the Dependency Graph: A New Approach to Deadlock-Free High-Performance Routing
[2]: J. Domke: Routing on the Dependency Graph: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks (Dissertation)
[3]: J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies
[4]: Our prev. DFSSSP blog post: DFSSSP: Fast (high-bandwidth) Deadlock-Free Routing for InfiniBand Networks
[5]: T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks