Problems with GM

We have a performance problem depending on the number and size of blocks used in the RU, as seen for example on the performance vs. fragment size plots for different block sizes. [eps] [gif] (These plots were made on the Grand Champion machine, while the following tests were done on the Dell.)

The fact that for 2500 32k blocks the problem is only present for fragment sizes above 28k suggests that the crytical factor is (related to) the number of 32k pages actually used.

I have eliminated the fedkit and all the custom hardware components, and noticed that the problem appears only for a large number of blocks. For small block sizes the problem appears with a larger number of blocks, about 17000 for 4k blocks, compared to 2000 32k blocks.

I tried a test eliminating GM, by not sending the data: In the first test the memory was only allocated, the headers set up, but the message was dropped instead of sending it. This showed the XDAQ overhead of about 3 microseconds, with a slight dependence on the number of blocks involved, but nothing like a sudden strong drop.

In a second test without GM I actually touched the data (memset) instead of sending, and arrived at a bandwidth of about 230 MB/sec, independent of the number of blocks.

To test GM alone, I modified the GM test program gm_allsize to cycle through many blocks. (The program basically sends messages as fast as it can in one direction and repeats this with different message sizes but the same block size.) The original version used 8 blocks in LIFO order.

My modification includes two things: increase the parameter SEND_BUFFERS, and change the order in which they are reused from lifo to fifo. I also had to limit the number of cuncurrent sends to 8, corresponding to the default case, to avoid a restiction in the program.

I see a sudden performance drop at 17000 blocks of 4k, [pdf] and a message-size dependent behaviour [pdf] with hysteresis in some cases for more than 2000 blocks of 32k, with jumps at multiples 4k. Some more selected data are given in the table below.
blocks drop at message size corresponding 4k pages
2200 28k 17600
2400 24k-28k hysteresis 16800-19200
2500 24k 17500
2800 20k-24k hysteresis 16800-19600
3300 20k (slight hysteresis) 19800
3400 16k 17000
The data seem to favor the hypothesis that the drop is related to the total memory (number of 4k pages) actually involved in the transfer, with the critical amount of memory somewhere above 64MB. This value reminds me of the 64 MB PCI address space of the Lanai9 card. Can this be at the root of the problem?

The requirement that the RU has to buffer in the order of one second of data means that in the order of 200MB memory will be used, so this problem clearly needs a solution.

We can not perform a measurement with more than 4000 blocks of 32k, because the size of the executable becomes bigger than the available physical memory (512 MB).