[knem-devel] The unexpected performance of DMA memcopy with knem
zhengda1936 at gmail.com
Mon Jul 8 20:57:57 CEST 2013
Thank you for your reply. It makes a few things much more clear now.
>> In my Linux box, each DMA device has only one channel for memory copy.
> I see below in your mail that you have 16 devices. E5-4620 seems to have
> 8 cores. So it looks like you have twice more cores than devices?
> Interesting, I thought Intel was trying to have one device per core.
> That looks true on 2-sockets system. Not on 4-socket then. Can you send
> the output of lstopo (from hwloc) to see how these channels are
> connected to sockets?
lstopo doesn't show DMA channels, but cat
/sys/class/dma/*/device/numa_node shows all devices are in NUMA node 0
and 1 (each processor is a NUMA node).
>> Is it possible to enable 4 channels for memory copy?
> KNEM doesn't support this natively. All available channels are currently
> distributed among all processors. A single channel is used by each KNEM
> context (the one attributed to the processor that runs the process when
> it opens /dev/knem).
> A process could open multiple KNEM contexts in order to use multiple
> channels. However, making sure that you get different hardware channels
> may be hard.
> I have never seen anybody combine multiple channels, I don't know if we
> can expect large benefits from this.
I tried to copy with one channel and two channels, and it seems two
channels (both from the local processor) can deliver slightly higher
The reason I ask this is that the paper "Designing Efficient
Asynchronous Memory Operations Using Hardware Copy Engine: A Case
Study with I/OAT" compares performance with one channel and four
channels and four channels can perform much better. But I'm not sure
if they mean 4 channels in one device or in four devices.
Unfortunately, I couldn't reach the authors.
>> Another problem I found is that per-CPU channel table doesn't always
>> contain the DMA channels in the local processor. For example, in my
>> machine (16 DMA devices and 1 channel on each device), a half of
>> channels in the per-CPU channel table in processor 0 are in a remote
>> processor. It seems the performance is much worse with a channel in
>> the remote processor. I don't know if it is caused by some wrong
>> configurations on my machine. Does anyone encounter the similar
> This is caused by the way the Linux kernel distributes the channels
> among the processors. KNEM just sits on top of it and cannot do much
> about it with the current stack. I've seen bad performance when doing a
> local copy with a remote processor, indeed.
> The way to fix this is to stop using the netdma channels and directly
> acquire physical channels from KNEM. That will require a large rework of
How to acquire physical channels? Currently, KNEM use
dma_find_channel() to get a DMA channel. Does it return a netdma
> the KNEM driver code, but it will enable other optimizations such as
> getting interrupts on copy completion. Unfortunately, I don't expect
> having much time to do this in the near future. If I were sure that I/O
> AT DMA will be useful in the future, I would do my best to find some
> time to implement all this. However, it's not clear if Intel is trying
> to make I/O AT DMA useful again, or if it may become almost useless
> performance-wise as it did a couple years ago in some generations of Xeons.
The IOAT DMA engine is to offload memory copy from CPUs, so its
benefit is to reduce CPU utilization. Do you mean you expect the DMA
engine to deliver better memory copy throughput than CPU? Is IOAT DMA
a dead technique?
More information about the knem-devel