The fact that for 2500 32k blocks the problem is only present for fragment sizes above 28k suggests that the crytical factor is (related to) the number of 32k pages actually used.
I have eliminated the fedkit and all the custom hardware components, and noticed that the problem appears only for a large number of blocks. For small block sizes the problem appears with a larger number of blocks, about 17000 for 4k blocks, compared to 2000 32k blocks.
I tried a test eliminating GM, by not sending the data: In the first test the memory was only allocated, the headers set up, but the message was dropped instead of sending it. This showed the XDAQ overhead of about 3 microseconds, with a slight dependence on the number of blocks involved, but nothing like a sudden strong drop.
In a second test without GM I actually touched the data (memset) instead of sending, and arrived at a bandwidth of about 230 MB/sec, independent of the number of blocks.
To test GM alone, I modified the GM test program gm_allsize to cycle through many blocks. (The program basically sends messages as fast as it can in one direction and repeats this with different message sizes but the same block size.) The original version used 8 blocks in LIFO order.
My modification includes two things: increase the parameter SEND_BUFFERS, and change the order in which they are reused from lifo to fifo. I also had to limit the number of cuncurrent sends to 8, corresponding to the default case, to avoid a restiction in the program.
I see a sudden performance drop at 17000 blocks of 4k, [pdf] and a message-size dependent behaviour [pdf] with hysteresis in some cases for more than 2000 blocks of 32k, with jumps at multiples 4k. Some more selected data are given in the table below.
blocks | drop at message size | corresponding 4k pages | |
2200 | 28k | 17600 | |
2400 | 24k-28k hysteresis | 16800-19200 | |
2500 | 24k | 17500 | |
2800 | 20k-24k hysteresis | 16800-19600 | |
3300 | 20k (slight hysteresis) | 19800 | |
3400 | 16k | 17000 |
The requirement that the RU has to buffer in the order of one second of data means that in the order of 200MB memory will be used, so this problem clearly needs a solution.
We can not perform a measurement with more than 4000 blocks of 32k, because the size of the executable becomes bigger than the available physical memory (512 MB).