Experience with the DAQ Column

Author: Akos Csilling
Start date: 24.01.2003
Version 1.03, 17.02.2003
Added Xerces memory leak and pthreads error. Changed halwish status.
Version 1.04, 19.02.2003
Added general description, run control. Cosmetic changes.
Version 1.05, 19.02.2003
Changed halwish status.
Version 1.06, 24.02.2003
Added references to sourceforge support requests.

This is a summary of the most important experience obtained with the column, selected on the basis of interest for the group as a whole.

Working information about the column can be found on the web: http://csilling.home.cern.ch/csilling/column/

The working files can be found in /users/daqcolumn/V1_2/TriDAS with the specific code under daq/CRUDE/. (CVS root: :kserver:cmscvs.cern.ch:/cvs_server/repositories/DAQColumn )

General description

The DAQ Column is the integration of one of each DAQ component. Each component used is described below in detail.

We use the indirect mode, where messages are sent in a loop:
BU -> EVM: available resources;
EVM -> RU : request fragment;
RU -> BU: fragment data.

In this simplified scheme there is no need for event ID.

Main results

Trigger control

We have a hardware trigger generator, distributed through a TTCvi card. This card is controlled by the 'Trigger' application - it can stop and start the HW trigger distribution. This application is not directly involved in the DAQ, but it is essential to control the column.

The specific hardware-related parts are in the TTCviCard class, reasonably well separated.

Trigger readout (FLT)

The trigger information is read by a TTCrx readout card, which is controlled by the 'FLT' application. This application reads the trigger words from the card as fast as it can and passes the info to the EVM via an i2o message in bundles. If we can not read fast enough, the HW fifo gets full and HW backpressure is generated.

When reading from the HW, we read the content of the HW fifo into a SW fifo, but at once no more than the max. bundling parameter.

When sending, we send what we have up to the max. bundling parameter. The i2o message from the FLT to the EVM is defined in evbim/include/i2oimevmmsg.h We send a variable number of triggers. (In our implementation no check is made on the bundling parameter to ensure that the message fits in a single block as it should.)

On a single-processor machine the HW fifo is not long enough to keep all events during a timeslice used by another thread, limiting the performance, therefore a dual CPU machine is needed.

A credit-based system could perhaps be implemented by only reading out as much events as we have credits, then backpressure would be automatically generated after the HW fifo is full. The depth of the HW fifo must be taken into account. In the column we do not need this.

A separate class, TTCrxCard, does the HW access. Perhaps some of the functionality from the FLT class should have been moved here to make FLT independent of the HW.

Event manager

The routine matching triggers to resources has been rewritten for better configurability.

Introduced lots of exported parameters that change the strategy. (E.g. alternate between two destinations (balance load of two BUs), send partial bundles when no more triggers/requests pending, block thread if other thread is currently matching, recurse if last minute requests arrive when exiting, etc.)

When using two BUs, the EVM had to be hacked to provide a balanced load on them (send events alternatively to the two), otherwise all the resources of one BU were coming before the other BU, effectively using them one at a time. In the future the strategy of the EVM for selecting one of the available resources (and BUs) must be defined.

The bundling has been isolated into a separate set of functions.

Fedkit interface (RUI)

We use the fedkit in the buffer-loaning scheme, now with the native driver. At the same time we access some column-specific registers via HAL.

The RuiDriver class manages the fedkit, with a separate class GIIIFed to do the direct access to our special hardware. It reads the fedkit and sends the data to the RU via i2o messages. Also builds the chain of blocks.

RuiDriver is the biggest class that we have, mostly because of all the debugging we did in it. Part of the functionality is controlling the data generation in the GIII card, this could/should be moved to a separate class.

We had some confusion about which header is included in which size. This is not yet completely cleaned up in the code, but it is understood.

We spend a lot of CPU time setting up the i2o header. This could perhaps be optimised by doing a memcopy of a predefined header, then changing only what is different.

RU, BU

In the RU only parts of the header are updated when the same block is sent on to the BU.

When sending to two destinations, having two threads doesn't improve sending (TCP and GM).

The BU was basically not modified. It checks the trigger number and discards the data, optionally checking the content. (Not the slink header and trailer.)

Run control

We mainly used xdaqWin, with scripting for batch measurements, editing XML configuration files by hand.

We tried RCMS, it worked fine, but we didn't spend enough time to compare ease of use.

RCMS tried to create a lockfile in the current working directory, which failed on our nfs filesystem, generating errors but no other consequences.

Scripting in xdaqWin

The only major problem is that scripts can not be stopped. Our workaround: check if trigger is increasing and generate an error if not. This way stopping the trigger stops the script. (We anyway read the trigger counter periodically so that we know if we have enough statistics.)

Reading out performance parameters with ParameterGet worked best to get measurement results. This way they can be written to a file from within the script.

Memory management

I played a lot with the block-request scheme used to keep track of how many blocks the fedkit wants. Finally I replaced the fifo with a counter, to count how many blocks we shoud give to the Fedkit when we have them available (below the threshold).

The way available(size) works is (I think) pointless. We always use overThreshold(size).

Myrinet

When using more than one card, 'unit' must be defined in the XML configuration. This is not there in the examples. (No docs, either.)

A bug for running with more than one card in the ptGM was found, reported, corrected, but not yet checked in. Sourceforge support request 653224.

Alias is undocumented.

Bigphys/nobigphys change needs different makefile for ptGM. Not documented.

Performance problem when using more than 64 MB memory. (This is has been traced to GM, reported to Myricom.)

The situation with the 8 bytes internally used by GM is unclear. We never check the slink header and trailer, so we don't know whether they get corrupted or not. We also do not check for writing outside the block.

Organization

We moved code from .h to .cc and sometimes renamed .h to .hh so that it is recognised by editors as C++, not C.

Implemented several checks for multiple thread interference by incrementing/decrementing a counter at the begining/end of functions. No surprises found.

Reconfiguring

Not always possible. Should move most things from constructors to Configure, then undo everything in Halt. (But remember that Halt can be called without Configure.) Instead we restarted the executables whenever changing such a parameter. (Mainly blocksize.)

Multiple threads

Care must be taken to avoid race conditions when stopping multiple threads.

The best solution we found was to set a variable in 'Halt', which is checked in the polling loop, to exit the loop. When done, set another variable, which is in turn polled in Halt to do the final cleanup. While this was not consistently implemented everywhere, I believe this can be done cleanly.

We tried to check the state of the application in the polling loop, but the state is changed only after return from Enable, so the polling exited immediately. Now we have private variables for this.

PTHREAD_CREATE ERROR was observed when running long batch measurements. Each point requires a Halt, Configure, Enable sequence, which involves stopping a thread and starting a new one. Starting fails after about 1000 such cycles. Sourceforge support request 692243.

Polling

In every polling loop do a sched_yield() when nothing better to do.

The possibility to run everything in one thread was not investigated. My private opinion is that since a P4 contains in itself two processors, using one thread only will reduce performance.

Destroying things

Set pointer to 0 after delete, delete[], close, etc. Check for 0 when trying to use it. (E.g. in parameterSet/Get, which may be called in any state.)

Pay attention to match new[] with delete[]. (Valgrind helps.) Sourceforge support request 653230.

The STL containers may call the empty constructor, the copy constructor and the assignment operator, so these must be implemented for all non-trivial classes.

User soap messages

We implemented a few at the beginning, (like single trigger) but now we never use them.

Memory leak in xerces

There seems to be a memory leak in xerces, leading to a loss of 4kB memory per soap message under certain conditions. May have been observed already on VxWorks. Was never investigated in detail.

Halwish

I wrote an extension to 'wish' called 'halwish' to use the HAL to control VME cards from tcl/Tk scripts. There is an example program to control the TTCvi, which we sometimes use in parallel with the Trigger application. Halwish is used extensively in the muon trigger test software.

This is now checked in the CVS repository under HAL. PCI support not implemented, can be done if needed.