[knem-devel] The unexpected performance of DMA memcopy with knem

Brice Goglin Brice.Goglin at inria.fr
Mon Jul 8 22:00:43 CEST 2013

Le 08/07/2013 20:57, Zheng Da a écrit :
> lstopo doesn't show DMA channels

(I assume you don't have PCI devices at all then, DMA channels are
supposed to show up as soon as lstopo has PCI support)

> but cat /sys/class/dma/*/device/numa_node shows all devices are in
> NUMA node 0 and 1 (each processor is a NUMA node). 

Interesting. I wonder why Intel didn't any device in the other sockets.
Or it could be the BIOS failing to initialize these devices properly
(I've seen some Dell boxes with no IOAT device at all until the BIOS was

> I tried to copy with one channel and two channels, and it seems two
> channels (both from the local processor) can deliver slightly higher
> throughput.
> The reason I ask this is that the paper "Designing Efficient
> Asynchronous Memory Operations Using Hardware Copy Engine: A Case
> Study with I/OAT" compares performance with one channel and four
> channels and four channels can perform much better. But I'm not sure
> if they mean 4 channels in one device or in four devices.

This used very old hardware, basically first generation I/OAT. I am
pretty sure there was a single device with multiple channels back there.
These devices evolved quite a lot since then, so it's hard to know if
the same result would be the same.

> How to acquire physical channels? Currently, KNEM use
> dma_find_channel() to get a DMA channel. Does it return a netdma
> channel?

Sorry, I confused with the even older DMA API in Linux. netdma is what
KNEM used before 2.6.29. Now we acquire channels from
dma_find_channel(), which returns the DMA channel that was associated to
the current CPU. But we could also try to look at all available channels
and get a more appropriate one, one close to our NUMA node for real, for

> The IOAT DMA engine is to offload memory copy from CPUs, so its
> benefit is to reduce CPU utilization. Do you mean you expect the DMA
> engine to deliver better memory copy throughput than CPU? Is IOAT DMA
> a dead technique?

Ideally people would indeed look at overlap. Unfortunately may people
still only look at raw pingpong performance when deciding what to enable
and/or what to work on. So they don't use IOAT DMA much, so people
developing MPI libs don't work on improving/testing that case much, and
I don't get much DMA users in KNEM. Chicken and egg problem. The other
problem is how to decide when you start using DMA in your MPI lib. How
do you know if the application really wants overlap and therefore should
rather use DMA even if its perf may be lower than CPU copy? Hard question.

IOAT is not a dead technique. It was promising when introduced, but
almost killed in Nehalem because Intel dramatically increase CPU
performance without modifying DMA much. Things are better balanced on
todays processors. We'll see.


More information about the knem-devel mailing list