Posts by Pierre Chatelier

    I have a performance problem with a Shading applet on MA-VCX-QP, when adding CoefficientBuffer operators (inspired by the Shading example of VA install folder)
    I take care of the bandwidth and there should be no problem, but my design encounters a very low functional limit.


    [Context]
    The board is a Marathon MA-VCX-QP,

    The camera is a Mono8 5120x5120@80fps, but I only target 50 fps for this board.


    The targeted camera bandwidth is 1250MB/s ~ 1.23 GB/s
    After the CXPQuadCamera operator, the native 8b@32x is downcast to 8b@16x after the input InfiniteSource ImageBuffer.
    Then the VA design contains a shading algorithm.

    If I submit dummy constants to the shading algorithm instead of reading CoefficientBuffers, it works @50fps.

    Now, I want to read shading coefficients in a CoefficientBuffer.
    According to the "Shared memory" documentation of the MA-VCX-QP, the optimal bandwidth should be of width 256b, so I first configured the CoefficientBuffer to output 64b@4x, with proper CastToParallel/ImageFifo/ParallelDn to transform the input file from TIFF 16b to 16b@16x shading information.

    In that case, the design won't run at more than 20fps, which is far from what the MA-VCX-QP RAM bandwidth could sustain, even with shared memory.

    I tried several variants, boosting the CoefficientBuffer output to 64b@8x, or reducing the data to 8b@16x shading information. I tried to add additional ImageBuffers as FIFOs. I tried many things, but I always have that ~20fps limit,

    I will post soon screenshots of the different failing designs .

    I definitely have the same problems as in this thread : Bandwidth statistics not available

    I have here a design that should work easily, even with shared memory. I am currently building a dozen different designs to show you that I can never achieve a correct performance as soon as I add a CoefficientBuffer.
    I will post screen capture of all the different tries on monday.
    I hope there will be a solution.


    I think you can close this thread, a similar problem is now reported in MA-VCX-QP performance problem when using CoefficientBuffer

    I have a 5120x5120@80fps Mono8 camera.
    With the standard Acq_SingleCXP6x4AreaGray, I can run it without overflow at ~65fps, which is coherent with the MA-VCX-QP bandwidth.
    But if I try to build the simplest Visual Applet, even if simulation and compilation are OK, I only get less that 1fps of corrupted output.

    Is there a specific trick to handle the fact that the MA-VCX-QP won't output more than 8b@16x, while the camera has a minimum parallelism of 20x ?


    Link to the applet CXP-Mono8-5120x5120.VA


    CXP-Mono8-5120x5120.JPG


    In the above example, pixels are split between 2x4 bits, and an ImageFiFo has been added before the ParallelDn.

    According to the sample code of such an applet, they are not necessary, but I tried, since I have no clue yet.

    Does your answer also cover the following question :

    In that design, the bandwidth analysis is over 3000 MB/s (for the part where the info is available), which is quite good.

    However, when I use it on a 4672x3416@148fps camera, the output is ok for low fame rates (~100fps) and becomes corrupted near 122fps.

    That limit of 120 fps is ~1900 MB/s, which is largely under 3000 MB/s.

    Do I have any tool to understand if there is a bottleneck ? How can I find a strategy to fix that ?


    The limitation does not come from the PC, since the simplest design (just camera->image buffer->DmaOutput) runs OK at 148fps.

    Do you mean that when using a camera ROI, all the following operators anywhere in the design must be adapted manually at run-time ?

    -SelectROI.XLength/SelectROI.YLength

    -ImageBuffer.XLength/ImageBuffer.YLength

    This is a pain that it cannot be factorized in some single variable.


    I don't understand that limitation. Even if the ImageBuffer has an XLength greater that the real ROI, since it is smart enough to handle EOL and keep correct frame dimensions, why does it limit the performance ?

    Hi,
    Yes I am using Microdisplay.
    Here is the doc "This parameter is used to start loading of coefficient images into the buffer before the image acquisition starts. The loading is triggered by a write cycle of value one to this parameter. Writing value 0 does not cause the loading of the coefficient files."

    It is not clear that loading is triggered when changing the value from 0 to 1. I just thought that the parameter had to be 1 all the time, and that 0 had no meaning.

    Apart from that, case solved.

    I have a design that runs for a CXP camera, Mono8, 4672x3416@148fps, externally synchronized through CoaXPress on an OptoTrigger.

    This is a classic design : camera->split in low/high bytes->store in two Infinite imagebuffers->merge pixel->SelectROI operator->DmaOutput


    https://seefastechnologies.com…416-synchro%2Btrigger.zip


    The problem occurs when I set the camera to 1024x1000@1000fps. I also set the SelectROI Width/Height to 1024x1000

    In that case, which has a *lower* datarate than 4672x3416@148fps, the image buffers are filled at 75% and there are overflows. I have to reduce the sync signal to ~830fps to get a flawless behaviour.


    But I can make it work if I also manually set both image buffers XLength and YLength to 1024x1000.

    I would like to get rid of this manual operation, since I expect Image Buffers to automatically adapt themselves to smaller images.


    I tried different other strategies without success:

    -put the SelectROI before the Image buffers (it's even worse)

    -split the pixels in 4 images buffers of 2 bits instead if 2 image buffers of 4 bits (it does not change anything)

    -limit the camera parallelization to x20 instead of x32 (it's obviously worse, but I tried)

    Hello,

    I do not understand how to use the CoefficientBuffer properly to allow the applet user to dynamically load *any* file at run-time. It seems to me that:

    -a hard-coded path must be set as CoefficientFile0 and/or CoefficientFile1 before compiling the applet

    -the applet will only load those path when used in MicroDisplay RT (after the applet being flashed on the board of course)

    -changing the CoefficientFile0 and CoefficientFile1 path values through MicroDisplay does nothing, the framegrabber still uses the original hard-coded path


    Is there anything special to allow dynamically reloading coefficient files in a compiled applet ?

    Thanks for the detailed explanations. It will be useful.


    But for the current OTSU design, I only compared the figures given by the bandwidth test of VA itself, not a real measurement. For this test, without the "WaitBuffer" I can set Mean and Peak to 800 MB/s and get a green light (801 will give red light). With the "WaitBuffer", 640 for both Avg and Peak is the max.

    I wondered why adding that ImageBuffer had a so huge effect on simulated bandwidth.

    Hi,

    Here is the new design : Otsu-optimized.zip

    With the "Waitbuffer" (that avoids the deadlock), the simulation goes up to 640MB/s.

    Without the "WaitBuffer" (but that did not work on a real board), the simulation went up to 800MB/s.

    I would like to understand why, so that I can take that in account in future designs and expected performance when dimensioning the required resources.

    This is the first time I use SyncToMax, now I understand better what it does !


    As you mention, I have replaced the "pixel replicator" instances by isFirstPixel+Register after a SyncToMax.

    The algorithm still works, but the simulation shows the exact same bandwidth.

    It does not seem to optimize the badnwidth. It even consumes a little more FPGA resources.

    I tried to split the Wait Buffer in "low" and "high" bits, but is still the same, it does not help increasing the bandwidth.

    I have added an ImageBuffer on the "original image" branch before the last sync operator. It seems to remove the deadlock that was not shown by the simulation.
    However, the bandwidth test now drops from 800MB/s to 600MB/s.

    I finally succeeded.


    For instance, in the example of the previous message, where I wanted to multiply each pixel of the histogram by the sum of all values in the histogram, I found out that I can ;

    -on an out branch of the historgam, perform RowSum

    -on that branch, use "RemovePixel" for pixels 0-254, and keep pixel 255

    -replicate that last pixel 256 times

    -sync with the regular histogram output

    It seems to do exactly what I want.


    Now, I have a design that works (I have implemented an adaptive binarization based on histogram analysis).
    Simulation works, visual result on dummy images is OK, bandwidth analysis works up to 800 MB/s (for max and mean)

    But guess what ? Once compiled on the board, with a real camera, I just get no images. The DmaToPC never gets a frameIndex > 0.


    The board is a MicroEnable5 VD8-PoCL

    The camera runs at 1280x1024@50fps


    How can I investigate where the problem occurs ? Since the simulation works perfectly, I am very surprised.

    Thanks.

    However, I still don't understand something about synchronization.

    Imagine (for the sake of simplicity, this is just an example) that I want to multiply each "pixel" of the histogram by the sum of all the values in the histogram.

    I can make a FIFO of 256 values to store the incoming histogram values, while making a "framesum" on a parallel link to compute the sum of all histogram values.
    But then, how can I apply the final sum to all the FIFO values ? I can't find which operator to use to "block" the FIFO output 255 times, so that once unblocked, the total sum is up-to-date and will be properly applied to the 256 stored values.
    I guess I have to use IsLastPixel and RemovePixel, but I can't make it work.