[knem-devel] The unexpected performance of DMA memcopy with knem

Zheng Da zhengda1936 at gmail.com
Mon Jul 8 22:29:00 CEST 2013


Hello,

On Mon, Jul 8, 2013 at 4:00 PM, Brice Goglin <Brice.Goglin at inria.fr> wrote:
> Le 08/07/2013 20:57, Zheng Da a écrit :
>> lstopo doesn't show DMA channels
>
> (I assume you don't have PCI devices at all then, DMA channels are
> supposed to show up as soon as lstopo has PCI support)
There are PCI devices. Here is the output of lstopo on my machine.

zhengda at ubuntu:~$ lstopo
Machine (512GB)
  NUMANode L#0 (P#0 128GB)
    Socket L#0 + L3 L#0 (16MB)
      L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#4)
      L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#8)
      L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#12)
      L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#16)
      L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#20)
      L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#24)
      L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#28)
    HostBridge L#0
      PCIBridge
        PCI 14e4:165f
          Net L#0 "eth2"
        PCI 14e4:165f
          Net L#1 "eth3"
      PCIBridge
        PCI 14e4:165f
          Net L#2 "eth0"
        PCI 14e4:165f
          Net L#3 "eth1"
      PCIBridge
        PCI 1000:005b
          Block L#4 "sda"
      PCIBridge
        PCI 1000:0087
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI 102b:0534
      PCI 8086:1d02
        Block L#5 "sr0"
  NUMANode L#1 (P#1 128GB)
    Socket L#1 + L3 L#1 (16MB)
      L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#1)
      L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#5)
      L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#9)
      L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#13)
      L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#17)
      L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#21)
      L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#25)
      L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#29)
    HostBridge L#9
      PCIBridge
        PCI 1000:0087
          Block L#6 "sdb"
          Block L#7 "sdc"
          Block L#8 "sdd"
          Block L#9 "sde"
          Block L#10 "sdf"
          Block L#11 "sdg"
          Block L#12 "sdh"
          Block L#13 "sdi"
      PCIBridge
        PCI 1000:0087
          Block L#14 "sdj"
          Block L#15 "sdk"
          Block L#16 "sdl"
          Block L#17 "sdm"
          Block L#18 "sdn"
          Block L#19 "sdo"
          Block L#20 "sdp"
          Block L#21 "sdq"
  NUMANode L#2 (P#2 128GB) + Socket L#2 + L3 L#2 (16MB)
    L2 L#16 (256KB) + L1 L#16 (32KB) + Core L#16 + PU L#16 (P#2)
    L2 L#17 (256KB) + L1 L#17 (32KB) + Core L#17 + PU L#17 (P#6)
    L2 L#18 (256KB) + L1 L#18 (32KB) + Core L#18 + PU L#18 (P#10)
    L2 L#19 (256KB) + L1 L#19 (32KB) + Core L#19 + PU L#19 (P#14)
    L2 L#20 (256KB) + L1 L#20 (32KB) + Core L#20 + PU L#20 (P#18)
    L2 L#21 (256KB) + L1 L#21 (32KB) + Core L#21 + PU L#21 (P#22)
    L2 L#22 (256KB) + L1 L#22 (32KB) + Core L#22 + PU L#22 (P#26)
    L2 L#23 (256KB) + L1 L#23 (32KB) + Core L#23 + PU L#23 (P#30)
  NUMANode L#3 (P#3 128GB) + Socket L#3 + L3 L#3 (16MB)
    L2 L#24 (256KB) + L1 L#24 (32KB) + Core L#24 + PU L#24 (P#3)
    L2 L#25 (256KB) + L1 L#25 (32KB) + Core L#25 + PU L#25 (P#7)
    L2 L#26 (256KB) + L1 L#26 (32KB) + Core L#26 + PU L#26 (P#11)
    L2 L#27 (256KB) + L1 L#27 (32KB) + Core L#27 + PU L#27 (P#15)
    L2 L#28 (256KB) + L1 L#28 (32KB) + Core L#28 + PU L#28 (P#19)
    L2 L#29 (256KB) + L1 L#29 (32KB) + Core L#29 + PU L#29 (P#23)
    L2 L#30 (256KB) + L1 L#30 (32KB) + Core L#30 + PU L#30 (P#27)
    L2 L#31 (256KB) + L1 L#31 (32KB) + Core L#31 + PU L#31 (P#31)

>
>> but cat /sys/class/dma/*/device/numa_node shows all devices are in
>> NUMA node 0 and 1 (each processor is a NUMA node).
>
> Interesting. I wonder why Intel didn't any device in the other sockets.
> Or it could be the BIOS failing to initialize these devices properly
> (I've seen some Dell boxes with no IOAT device at all until the BIOS was
> fixed).
It's a Dell box. After I enabled IOAT explicitly, I saw 16 DMA
devices. I don't know if there are other configurations to get 32 DMA
devices.
>
>> The IOAT DMA engine is to offload memory copy from CPUs, so its
>> benefit is to reduce CPU utilization. Do you mean you expect the DMA
>> engine to deliver better memory copy throughput than CPU? Is IOAT DMA
>> a dead technique?
>
> Ideally people would indeed look at overlap. Unfortunately may people
> still only look at raw pingpong performance when deciding what to enable
> and/or what to work on. So they don't use IOAT DMA much, so people
> developing MPI libs don't work on improving/testing that case much, and
> I don't get much DMA users in KNEM. Chicken and egg problem. The other
> problem is how to decide when you start using DMA in your MPI lib. How
> do you know if the application really wants overlap and therefore should
> rather use DMA even if its perf may be lower than CPU copy? Hard question.
>
> IOAT is not a dead technique. It was promising when introduced, but
> almost killed in Nehalem because Intel dramatically increase CPU
> performance without modifying DMA much. Things are better balanced on
> todays processors. We'll see.
Thanks for explaining this. My original motivation of using the DMA
engine is to reduce CPU utilization and hopefully improve the
performance of inter-processor memory copy. It seems the DMA engine
does increase the throughput of inter-processor memory copy on my
machine. It seems to work for me.

But I'm not quite sure if lowering CPU utilization can increase
performance when running a real application. I guess the DMA engine
and CPUs share the same link of accessing memory. If the memory
bandwidth is saturated by the DMA engine, CPU can't access memory
either. If this is the case, I guess the benefit brought by the DMA
engine is very limited in many applications even if it can lower the
CPU utilization.

Thanks,
Da



More information about the knem-devel mailing list