Detect min/max position

Pierre Chatelier · Feb 8th 2019

Hello,
I can't find a way to find the location of a min/max data in a data1D or data2D.

For instance, imagine I want to implement an OTSU binarization on 8-bit images. The idea is to compute an histogram, then find the good threshold maximizing the separation of pixels into two classes. I could just compute in parallel the 256 possible threshold separations, and select the one maximizing some formula. But I would need to detect a max and apply it to some branch select. Is it possible ?

Other use case : I perform a convolution with a kernel, and want to find the location of the maximum result of the convoution to perform some work around that location.

Is it possible with standard operators ?

B.Ru · Feb 11th 2019

MaxPos.pngHello Pierre,

To get the location of a global/local Max or Min value you can use RowMax, ColMax, FrameMax. These operators have a second output that is binary 1(true) in case a new extreme value is found. Use Register to keep the X/Y coordinate of the new extreme value.

If you want to get a Max peak in a histogram simply use the Histogram(preset: 1 line) output as input for FrameMax.

In case a new extreme value is found use the operators Coordinate_X/Y and register to keep those values in case.

Some additional details will be necessary for further processing: Please have a look the attached VA example design.

Same approach for Col- and Row-Max/Min.

Pierre Chatelier · Feb 11th 2019

Thanks.

However, I still don't understand something about synchronization.

Imagine (for the sake of simplicity, this is just an example) that I want to multiply each "pixel" of the histogram by the sum of all the values in the histogram.

I can make a FIFO of 256 values to store the incoming histogram values, while making a "framesum" on a parallel link to compute the sum of all histogram values.
But then, how can I apply the final sum to all the FIFO values ? I can't find which operator to use to "block" the FIFO output 255 times, so that once unblocked, the total sum is up-to-date and will be properly applied to the 256 stored values.
I guess I have to use IsLastPixel and RemovePixel, but I can't make it work.

Pierre Chatelier · Feb 12th 2019

I finally succeeded.

For instance, in the example of the previous message, where I wanted to multiply each pixel of the histogram by the sum of all values in the histogram, I found out that I can ;

-on an out branch of the historgam, perform RowSum

-on that branch, use "RemovePixel" for pixels 0-254, and keep pixel 255

-replicate that last pixel 256 times

-sync with the regular histogram output

It seems to do exactly what I want.

Now, I have a design that works (I have implemented an adaptive binarization based on histogram analysis).
Simulation works, visual result on dummy images is OK, bandwidth analysis works up to 800 MB/s (for max and mean)

But guess what ? Once compiled on the board, with a real camera, I just get no images. The DmaToPC never gets a frameIndex > 0.

The board is a MicroEnable5 VD8-PoCL

The camera runs at 1280x1024@50fps

How can I investigate where the problem occurs ? Since the simulation works perfectly, I am very surprised.

B.Ru · Feb 13th 2019

The observed behaviour (no frames) may be caused by a deadlock in the VA design.

If possible attach the VA design here and I will do a review of it.

Pierre Chatelier · Feb 13th 2019

Here is the design

Otsu.zip

Pierre Chatelier · Feb 14th 2019

I have added an ImageBuffer on the "original image" branch before the last sync operator. It seems to remove the deadlock that was not shown by the simulation.
However, the bandwidth test now drops from 800MB/s to 600MB/s.

B.Ru · Feb 19th 2019

Attached you can find the design version without dead-lock.

Dead-locks do not get found by the simulation itself.

While the histogram is generated and the "Otsu" result is calculated the image needs to wait in the second RAM Buffer.

B.Ru · Feb 19th 2019

The reduced bandwidth is related to the selected approach increasing the image size of the thresholded image before sync'ing it back to the image.

I would recommend to sync only the resulting threshold value as single pixel value/image and stretching it after the SYNC(SyncToMax) by using Register and IsFistPixel operator.

Pierre Chatelier · Feb 19th 2019

This is the first time I use SyncToMax, now I understand better what it does !

As you mention, I have replaced the "pixel replicator" instances by isFirstPixel+Register after a SyncToMax.

The algorithm still works, but the simulation shows the exact same bandwidth.

It does not seem to optimize the badnwidth. It even consumes a little more FPGA resources.

I tried to split the Wait Buffer in "low" and "high" bits, but is still the same, it does not help increasing the bandwidth.

B.Ru · Feb 25th 2019

In certain streams of the VA design the operator ParallelDN is/was used.

That operator is a definition of a bottle-neck.

SelectFromParallel is sometimes a different/better solution.

Please put the parts of the modifed/new VA design here and I will have a look at it.

Pierre Chatelier · Feb 25th 2019

Hi,

Here is the new design : Otsu-optimized.zip

With the "Waitbuffer" (that avoids the deadlock), the simulation goes up to 640MB/s.

Without the "WaitBuffer" (but that did not work on a real board), the simulation went up to 800MB/s.

I would like to understand why, so that I can take that in account in future designs and expected performance when dimensioning the required resources.

B.Ru · Feb 25th 2019

Hi,

The design itself is connected by parallelism 8 (P=8) at 8 bit (B=8) with the DMA using a system clock of 125 MHz (S).

In theory this provides up to 1 GB/s = P * B * S = 8 * 8 bit * 125 MHz.

In practice the DMA will work at a lower bandwidth due to certain overhead and the used mainboard chipset.

Here the number of connected PCIe lanes, the used PCIe Gen and the trained PCIe playload size are of interest.

Our runtime includes a driver that supports our DMA900 or DMA 3600 engine which in general tries to deliver the maximum possible bandwidth.

In case the PCIe payload size is >= 256 the bandwidth should be around 900 MB/s in case 4 PCIe Gen1 lanes are used.

If the PCIe payload size is < 256 the bandwidth will be less than 900 MB/s in case 4 PCIe Gen1 lanes are used.

A lot of additional and chipset related details would come into the discussion.

But to make this post short: A bandwidth of 640 MB/s may be related to a PCIe payload size < 256.

If you want to check this in runtime please consult microDiagnostics and look into the log on the start page: DMA.png

In this case we can see:

PCIe Performance: PCIe is highspeed capable

That is an indicator that the DMA engine can work at up to full performance:

PCIe x4 Gen1 = DMA900 ok = up to 900 MB/s possible

PCIe x4 Gen2 = DMA1800 ok = up to 1800 MB/s possible

PCIe x8 Gen8 = DMA3600 ok = up to 3600 MB/s possible

If the PCIe Performance is stated NOT highspeed capable the performance will be less.

PCIe x4 Gen1 = no DMA900 = less than 900 MB/s possible

PCIe x4 Gen2 = no DMA1800 = less than 1800 MB/s possible

PCIe x8 Gen8 = no DMA3600 = less than 3600 MB/s possible

I do not have typical PCIe performance measurements for this case,

but a bandwidth of 640 MB/s could be caused by a PCIe payload that is < 256.

But we have a performance test you can use in microDiagnostics:

Details on microDiagnostics : PCIe Performance Test

Those tests are related to different image sizes. Please check this for your case!

Additional reasons could be multiplexed/shared or less connected PCIe lanes.

A PCIe slot may support less PCIe lanes that the physical connector.

Or the design expects PCIe Gen2 but the mainboard only supports Gen1.

Pierre Chatelier · Feb 25th 2019

Thanks for the detailed explanations. It will be useful.

But for the current OTSU design, I only compared the figures given by the bandwidth test of VA itself, not a real measurement. For this test, without the "WaitBuffer" I can set Mean and Peak to 800 MB/s and get a green light (801 will give red light). With the "WaitBuffer", 640 for both Avg and Peak is the max.

I wondered why adding that ImageBuffer had a so huge effect on simulated bandwidth.

B.Ru · Feb 25th 2019

Hi Pierre,

Please calculate the expected performance of a link on the basis of its properties like parallelism 8 (P=8) and bit depth of 8 bit (B=8) and the defined system clock of 125 MHz (S).

In theory such a link provides up to 1 GB/s = P * B * S = 8 * 8 bit * 125 MHz.

How to calculate easily ...

ALL O-type, all P-type, MOST M-type perform at an efficiency of 100%.

Exeptions are:

Memory (ImageBuffer and friends) : in case of non-linear addressing

Memory (ImageBuffer and friends) for mE5 marathon : shared memory

DMAtoPc : mainboard/chipset related

All camera operators can only deliver at the used interface maximum speed.

Side effects of SYNC, InsertImage, InsertLine cause limitations and even deadlocks in case of wrong usage.

The VA "Bandwidth Analysis" is only usefull for very simple designs representing a single straight stream.

Pierre Chatelier · Feb 25th 2019

Ok, thanks.
If you don't detect any major mistake in my design, case closed for me.

B.Ru · Mar 1st 2019

Ok, thank you

kevinh · Aug 6th 2020

Dear Pierre,

I am very interested in an OTSU-Thresholding Method, sadly the last link: Otsu-optimized is broken.

If it isn't too much trouble would you be able to reupload your final design?

Greetings,

Kevin

B.Ru · Aug 7th 2020

Dear Kevin,

Maybe this will give you some good hints until you receive the final design from Pierre:

Otsu_NoDeadLock.va

>> Taken from previous post..

Best regards,

Björn

Detect min/max position

Share

Similar Threads

Serial Output and Serial Input Communication

Build a Logic Analyzer / Oscilloscope with VisualApplets - Function Generators, Digital IOs, Visualization, Trigger, Text Overlay - No additional Software required