Title: Low jitter guaranteed-rate communications for cluster computing systems
Authors: Ted H. Szymanski, Dave Gilbert
Addresses: Department of Electrical & Computer Engineering, McMaster University, Hamilton, ON L8S 4K1, Canada. ' Department of Computing and Software, McMaster University, Hamilton, ON L8S 4K1, Canada
Abstract: Low latency high bandwidth networks are key components in large scale computing systems. Existing systems use dynamic algorithms for routing and scheduling cell transmissions through switches. Due to stringent time requirements, dynamic algorithms have suboptimal performances, which limit throughputs to well below peak capacity. It is shown that Guaranteed-Rate communications can be supported over switch-based networks with 100% throughput and very low delay jitter, provided that each switch has the capacity to buffer a small number of cells per flow. An algorithm is used to reserve guaranteed bandwidth and buffer space in the switches, resulting in the specification of a doubly stochastic traffic rate matrix for each switch. Each switch schedules the Guaranteed-Rate traffic for transmission according to a resource reservation algorithm based on Recursive Fair Stochastic Matrix Decomposition. Very low delay jitters can be achieved among all simultaneous flows while simultaneously achieving 100% throughput in each switch. When receive buffers of bounded depth are used to filter residual network jitter at the destinations, end-to-end traffic flows can be delivered with essentially zero delay jitter. The algorithm is suitable for the switch-based networks found in commercial supercomputing systems such as Fat Trees, and for silicon Networks-on-a-Chip.
Keywords: low latency networks; high bandwidth networks; switching; scheduling; low jitter; guaranteed rate communications; cluster computing; stochastic matrix decomposition; quality of service; QoS; guaranteed bandwidth; buffer space; supercomputing; network-on-a-chip.
DOI: 10.1504/IJCNDS.2008.020258
International Journal of Communication Networks and Distributed Systems, 2008 Vol.1 No.2, pp.140 - 160
Published online: 10 Sep 2008 *
Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article