Detect min/max position

  • Hello,
    I can't find a way to find the location of a min/max data in a data1D or data2D.

    For instance, imagine I want to implement an OTSU binarization on 8-bit images. The idea is to compute an histogram, then find the good threshold maximizing the separation of pixels into two classes. I could just compute in parallel the 256 possible threshold separations, and select the one maximizing some formula. But I would need to detect a max and apply it to some branch select. Is it possible ?

    Other use case : I perform a convolution with a kernel, and want to find the location of the maximum result of the convoution to perform some work around that location.

    Is it possible with standard operators ?

  • MaxPos.pngHello Pierre,


    To get the location of a global/local Max or Min value you can use RowMax, ColMax, FrameMax. These operators have a second output that is binary 1(true) in case a new extreme value is found. Use Register to keep the X/Y coordinate of the new extreme value.


    If you want to get a Max peak in a histogram simply use the Histogram(preset: 1 line) output as input for FrameMax.


    In case a new extreme value is found use the operators Coordinate_X/Y and register to keep those values in case.

    Some additional details will be necessary for further processing: Please have a look the attached VA example design.


    Same approach for Col- and Row-Max/Min.

  • Thanks.

    However, I still don't understand something about synchronization.

    Imagine (for the sake of simplicity, this is just an example) that I want to multiply each "pixel" of the histogram by the sum of all the values in the histogram.

    I can make a FIFO of 256 values to store the incoming histogram values, while making a "framesum" on a parallel link to compute the sum of all histogram values.
    But then, how can I apply the final sum to all the FIFO values ? I can't find which operator to use to "block" the FIFO output 255 times, so that once unblocked, the total sum is up-to-date and will be properly applied to the 256 stored values.
    I guess I have to use IsLastPixel and RemovePixel, but I can't make it work.

  • I finally succeeded.


    For instance, in the example of the previous message, where I wanted to multiply each pixel of the histogram by the sum of all values in the histogram, I found out that I can ;

    -on an out branch of the historgam, perform RowSum

    -on that branch, use "RemovePixel" for pixels 0-254, and keep pixel 255

    -replicate that last pixel 256 times

    -sync with the regular histogram output

    It seems to do exactly what I want.


    Now, I have a design that works (I have implemented an adaptive binarization based on histogram analysis).
    Simulation works, visual result on dummy images is OK, bandwidth analysis works up to 800 MB/s (for max and mean)

    But guess what ? Once compiled on the board, with a real camera, I just get no images. The DmaToPC never gets a frameIndex > 0.


    The board is a MicroEnable5 VD8-PoCL

    The camera runs at 1280x1024@50fps


    How can I investigate where the problem occurs ? Since the simulation works perfectly, I am very surprised.

  • This is the first time I use SyncToMax, now I understand better what it does !


    As you mention, I have replaced the "pixel replicator" instances by isFirstPixel+Register after a SyncToMax.

    The algorithm still works, but the simulation shows the exact same bandwidth.

    It does not seem to optimize the badnwidth. It even consumes a little more FPGA resources.

    I tried to split the Wait Buffer in "low" and "high" bits, but is still the same, it does not help increasing the bandwidth.

  • In certain streams of the VA design the operator ParallelDN is/was used.

    That operator is a definition of a bottle-neck.

    SelectFromParallel is sometimes a different/better solution.


    Please put the parts of the modifed/new VA design here and I will have a look at it.

  • Hi,

    Here is the new design : Otsu-optimized.zip

    With the "Waitbuffer" (that avoids the deadlock), the simulation goes up to 640MB/s.

    Without the "WaitBuffer" (but that did not work on a real board), the simulation went up to 800MB/s.

    I would like to understand why, so that I can take that in account in future designs and expected performance when dimensioning the required resources.

  • Hi,


    The design itself is connected by parallelism 8 (P=8) at 8 bit (B=8) with the DMA using a system clock of 125 MHz (S).

    In theory this provides up to 1 GB/s = P * B * S = 8 * 8 bit * 125 MHz.

    In practice the DMA will work at a lower bandwidth due to certain overhead and the used mainboard chipset.


    Here the number of connected PCIe lanes, the used PCIe Gen and the trained PCIe playload size are of interest.

    Our runtime includes a driver that supports our DMA900 or DMA 3600 engine which in general tries to deliver the maximum possible bandwidth.


    In case the PCIe payload size is >= 256 the bandwidth should be around 900 MB/s in case 4 PCIe Gen1 lanes are used.

    If the PCIe payload size is < 256 the bandwidth will be less than 900 MB/s in case 4 PCIe Gen1 lanes are used.

    A lot of additional and chipset related details would come into the discussion.

    But to make this post short: A bandwidth of 640 MB/s may be related to a PCIe payload size < 256.


    If you want to check this in runtime please consult microDiagnostics and look into the log on the start page: DMA.png

    In this case we can see:


    PCIe Performance: PCIe is highspeed capable


    That is an indicator that the DMA engine can work at up to full performance:

    PCIe x4 Gen1 = DMA900 ok = up to 900 MB/s possible

    PCIe x4 Gen2 = DMA1800 ok = up to 1800 MB/s possible

    PCIe x8 Gen8 = DMA3600 ok = up to 3600 MB/s possible


    If the PCIe Performance is stated NOT highspeed capable the performance will be less.

    PCIe x4 Gen1 = no DMA900 = less than 900 MB/s possible

    PCIe x4 Gen2 = no DMA1800 = less than 1800 MB/s possible

    PCIe x8 Gen8 = no DMA3600 = less than 3600 MB/s possible


    I do not have typical PCIe performance measurements for this case,

    but a bandwidth of 640 MB/s could be caused by a PCIe payload that is < 256.


    But we have a performance test you can use in microDiagnostics:

    Details on microDiagnostics : PCIe Performance Test

    Those tests are related to different image sizes. Please check this for your case!


    Additional reasons could be multiplexed/shared or less connected PCIe lanes.

    A PCIe slot may support less PCIe lanes that the physical connector.

    Or the design expects PCIe Gen2 but the mainboard only supports Gen1.

  • Thanks for the detailed explanations. It will be useful.


    But for the current OTSU design, I only compared the figures given by the bandwidth test of VA itself, not a real measurement. For this test, without the "WaitBuffer" I can set Mean and Peak to 800 MB/s and get a green light (801 will give red light). With the "WaitBuffer", 640 for both Avg and Peak is the max.

    I wondered why adding that ImageBuffer had a so huge effect on simulated bandwidth.

  • Hi Pierre,


    Please calculate the expected performance of a link on the basis of its properties like parallelism 8 (P=8) and bit depth of 8 bit (B=8) and the defined system clock of 125 MHz (S).


    In theory such a link provides up to 1 GB/s = P * B * S = 8 * 8 bit * 125 MHz.

    How to calculate easily ...


    ALL O-type, all P-type, MOST M-type perform at an efficiency of 100%.


    Exeptions are:

    Memory (ImageBuffer and friends) : in case of non-linear addressing

    Memory (ImageBuffer and friends) for mE5 marathon : shared memory

    DMAtoPc : mainboard/chipset related

    All camera operators can only deliver at the used interface maximum speed.


    Side effects of SYNC, InsertImage, InsertLine cause limitations and even deadlocks in case of wrong usage.

    The VA "Bandwidth Analysis" is only usefull for very simple designs representing a single straight stream.