[knem-devel] The unexpected performance of DMA memcopy with knem

Brice Goglin Brice.Goglin at inria.fr
Mon Jul 8 18:37:12 CEST 2013

Le 08/07/2013 17:36, Zheng Da a écrit :
> Hello,
> I just started to use knem and am interested in DMA memory copy with
> knem. To test its performance, I copied some code from
> tools/knem_status_test.c and it copies 64MB each time.
> I measure its performance on a NUMA machine with 4 Xeon E5-4620
> processors and I use knem v1.0.0. The best performance I got is about
> 5GB/s for both local memory copy and remote memory copy. On the same
> machine, the rate of memory copy (with CPU involved) is 9.5GB/s and
> 3.7GB/s for local copy and remote copy respectively. The result is
> very different from what I expected. I thought DMA memory copy should
> be faster than CPU memory copy for both cases (local and remote copy)
> and that DMA memory copy should also have different performance in
> local and remote copy. Does the result sound reasonable?


Such results highly depend on the architecture. I have never tested on
E5 46xx but I am not very surprised.

> In my Linux box, each DMA device has only one channel for memory copy.

I see below in your mail that you have 16 devices. E5-4620 seems to have
8 cores. So it looks like you have twice more cores than devices?
Interesting, I thought Intel was trying to have one device per core.
That looks true on 2-sockets system. Not on 4-socket then. Can you send
the output of lstopo (from hwloc) to see how these channels are
connected to sockets?

> Is it possible to enable 4 channels for memory copy?

KNEM doesn't support this natively. All available channels are currently
distributed among all processors. A single channel is used by each KNEM
context (the one attributed to the processor that runs the process when
it opens /dev/knem).

A process could open multiple KNEM contexts in order to use multiple
channels. However, making sure that you get different hardware channels
may be hard.

I have never seen anybody combine multiple channels, I don't know if we
can expect large benefits from this.

> Another problem I found is that per-CPU channel table doesn't always
> contain the DMA channels in the local processor. For example, in my
> machine (16 DMA devices and 1 channel on each device), a half of
> channels in the per-CPU channel table in processor 0 are in a remote
> processor. It seems the performance is much worse with a channel in
> the remote processor. I don't know if it is caused by some wrong
> configurations on my machine. Does anyone encounter the similar
> problem?

This is caused by the way the Linux kernel distributes the channels
among the processors. KNEM just sits on top of it and cannot do much
about it with the current stack. I've seen bad performance when doing a
local copy with a remote processor, indeed.

The way to fix this is to stop using the netdma channels and directly
acquire physical channels from KNEM. That will require a large rework of
the KNEM driver code, but it will enable other optimizations such as
getting interrupts on copy completion. Unfortunately, I don't expect
having much time to do this in the near future. If I were sure that I/O
AT DMA will be useful in the future, I would do my best to find some
time to implement all this. However, it's not clear if Intel is trying
to make I/O AT DMA useful again, or if it may become almost useless
performance-wise as it did a couple years ago in some generations of Xeons.


More information about the knem-devel mailing list