Adventures in HPC: RDMA and Erlang

I recently attended the SC13 conference where one of my goals was to learn about InfiniBand. I attended a full day tutorial session on the subject, which did a good job of introducing most of the concepts, but didn’t really delve as deep as I had hoped. That’s not really the fault of the class; InfiniBand, and the larger subject of remote direct memory access (RDMA), is incredibly complex. I wanted to learn more.

Now, I’ve been an Erlang enthusiast for a few years, and I’ve always wondered why it doesn’t have a larger following in the HPC community. I’ll grant you that Erlang doesn’t have the best reputation for performance, but in terms of concurrency, distribution, and fault tolerance, it is unmatched. And areas where performance is critical can be offloaded to other languages or, better yet, to GPGPUs and MICs with OpenCL.

But compared to its competition, there are areas where Erlang is lacking, for example, in distributed message passing, where it still uses TCP/IP. So in an effort to learn more about RDMA and in hopes of making Erlang a little more attractive to the HPC community, I set out to write an RDMA distribution driver for Erlang.

RDMA is a surprisingly tough nut to crack for its maturity. Documentation is scarce. Examples are even more so. Compared to TCP/IP, there is a lot more micro-management: you have to set up the connection; you have to decide how to allocate memory, queues, and buffers; you have to control how to send and receive; and you have to do your own flow control, among other complications. But for all that, you get the possibility of moving data between systems without invoking the kernel, and that promises significant performance gains over TCP/IP.

In addition, it would almost seem like RDMA was made for Erlang. RDMA is highly asynchronous and event-driven, which is a nearly perfect match for Erlang’s asynchronous message passing model. Once I got my head around some Erlang port driver idiosyncrasies, things sort-of fell in to place, and here is the result:

RDMA Ping Pong

pong. I’ve never been happier to see such a silly word.

Of course, the driver works for more than just pinging. It works for all distributed Erlang messages. In theory, you can drop it in to any Erlang application and it should just work.

The question is: how well does it work? Is it any better than the default TCP/IP distribution driver? For that, I devised a simple benchmark.

RDMA Benchmark Diagram

For each in a given set of nodes, the program will spawn a hundred processes that sit in a tight loop performing RPCs. The number of RPCs is counted and can be compared between different network implementations.

The program was tested on four nodes of a cluster, each with:

  • 2 x Intel Xeon X5560 Quad Core @ 2.80 GHz
  • 48 GB memory
  • Mellanox ConnectX QDR PCI Gen2 Channel Adapter
  • Red Hat Enterprise Linux 5.9 64-bit
  • Erlang/OTP R16B03
  • Elixir 0.12.0
  • OFED 1.5.4

The results are summarized as follows:

RDMA Benchmark

The RDMA implementation offers around a 50% increase in messaging performance over the default TCP/IP driver in this test. I believe this is primarily explained by the reduction in context switching. Where the TCP implementation has to issue a system call for every send and receive operation, requiring a context switch to the kernel, the RDMA implementation only calls into the kernel to be notified of incoming packets. And if packets are coming in fast enough, as they are in this test, then the driver can process many packets per context switch. The RDMA driver stays completely in user-space for send operations.

You may be wondering why the TCP driver performed about the same over the Ethernet and InfiniBand interfaces. These RPC operations involve very small messages, on the order of tens of bytes being passed back and forth, so this test really highlights the overhead of the network stacks, which is what I intended. I would imagine increasing the message size would make the InfiniBand interfaces take off, but I’ll leave that for a future test. Indeed, there are many more benchmarks I should perform.

Also, for now I’m avoiding the obvious comparison between Erlang and MPI. MPI libraries tend to have very mature, sophisticated RDMA implementations that I know I can’t compete against yet. I’d rather focus on improving the driver. I’ve started a to-do list. Feel free to pitch in and send me some pull requests on GitHub!

One last thing: Thank you The Geek in the Corner for your basic RDMA examples, and thank you Erlang/OTP community and Ericsson for your awesome documentation. As for my goal of wanting to learn about InfiniBand, I’d say goal accomplished.