MA-VCX-QP performance problem when using CoefficientBuffer

Pierre Chatelier · May 17th 2019

I have a performance problem with a Shading applet on MA-VCX-QP, when adding CoefficientBuffer operators (inspired by the Shading example of VA install folder)
I take care of the bandwidth and there should be no problem, but my design encounters a very low functional limit.

[Context]
The board is a Marathon MA-VCX-QP,

The camera is a Mono8 5120x5120@80fps, but I only target 50 fps for this board.

The targeted camera bandwidth is 1250MB/s ~ 1.23 GB/s
After the CXPQuadCamera operator, the native 8b@32x is downcast to 8b@16x after the input InfiniteSource ImageBuffer.
Then the VA design contains a shading algorithm.

If I submit dummy constants to the shading algorithm instead of reading CoefficientBuffers, it works @50fps.

Now, I want to read shading coefficients in a CoefficientBuffer.
According to the "Shared memory" documentation of the MA-VCX-QP, the optimal bandwidth should be of width 256b, so I first configured the CoefficientBuffer to output 64b@4x, with proper CastToParallel/ImageFifo/ParallelDn to transform the input file from TIFF 16b to 16b@16x shading information.

In that case, the design won't run at more than 20fps, which is far from what the MA-VCX-QP RAM bandwidth could sustain, even with shared memory.

I tried several variants, boosting the CoefficientBuffer output to 64b@8x, or reducing the data to 8b@16x shading information. I tried to add additional ImageBuffers as FIFOs. I tried many things, but I always have that ~20fps limit,

I will post soon screenshots of the different failing designs .

Pierre Chatelier · May 20th 2019

Here are the screen shots.

1)reference design

2)different tries (only a cherrypick of all my differents tries. I must have tried something like 20 different variations)

CXP-Mono8-5120x5120%2Bshading%2Bbpr%2Bgpi.jpg

CXP-Mono8-5120x5120%2Bshading%2Bbpr%2Bgpi16.jpg

CXP-Mono8-5120x5120%2Bshading%2Bbpr%2Bgpi2.jpg

CXP-Mono8-5120x5120%2Bshading%2Bbpr%2Bgpi13a.jpg

CXP-Mono8-5120x5120%2Bshading%2Bbpr%2Bgpi13b.jpg

CXP-Mono8-5120x5120%2Bshading%2Bbpr%2Bgpi14.jpg

CXP-Mono8-5120x5120%2Bshading%2Bbpr%2Bgpi17.jpg

B.Ru · May 20th 2019

Dear Pierre,

Thank you for your precise description.

Below you can see a stripped down design that is showing the aspects of the shared memory setup for shading correction:

pasted-from-clipboard.png

Download VA design sources: QuadCXP_Shading_BRudde.va (Please download the fixed version in the post below...

Here the design has to take care of shared memory.

So we increase to maximum bandwidth for ALL buffers:

64 * 8 bit = 512 bit

As soon as we use 512 bit (mE5-MA-VCX-QP) for all buffers, we can receive maximum performance.

The 512 bit are based on the memory interface and can be found in the appendix of the VA documentation.

In case we do not take care of this side effect over here:

This will slow down all other memory buffers,

even if those are connected at higher bandwidth option.

Around the ImageBuffer:

ParallelUP to get the maximum performance...

ParallelDN is taking care of mean target bandwidth.

So far I did not check this in hardware, but the synthesis is running right now.

I will update my measurement results in the next post...

B.Ru · May 20th 2019

Dear Pierre,

After the synthesis a runtime investigation was done by using our Runtime/SDK 5.7.0 x64 Windows.

Some more details on the used hardware can be seen in the screenshot below:

pasted-from-clipboard.png

Especially the DMA/PCIe details are of interest in order to see if all conditions are met to expect a maximum DMA performance. In case the DMA performance would be limited this could lead to a resticted performance.

In microDiagnostics the Performance dialog is showing the maximum bandwidth's for severalk image resolutions:

pasted-from-clipboard.png

Since the test design is based on 1024 x 1024 x 8 bit = 1 MB images I would expect a theoretical maximum of 1800 frames/s in the real test.

After flashing the synthesized applet into the hardware, activating it and loading in microDisplay a simple test acquisition was started. For the tests a "fake" camera (that was designed into this applet you can download below) was used.

During first testing attempts the number of frames per second received that did not match the discussed expectations.

So a quick review was done and a good reason for this limitation could be found:

The coefficient images were much too large, and that was slowing down the overall performance.

Because the large images need to be read out of the RAM, wasting the bandwidth...

So the VA design default preset was modified, the SYNC changed "toMax", so that this issue will not arise without notification again.

Then the performance reached at least 1600 Hz = (1600 Hz) * 1024 * 1024 * (8 bit) = 1 677.7216 MB / s

pasted-from-clipboard.png

A maximum bandwidth of 1800 Hz at a resolution of 1024 * 1024 * 8 bit = 1 MB per image would require a slight trick.

Please add several image to each other by using AppendImage since "small" DMA transfers reduce the efficiency.

Please have a look at the DMA bandwidth performance dialog (resolution and 48bit per pixel), where you can see what transfer size you require for reaching 1800 MB/s.

Thank you for your patience...

Here you can download the IMPROVED/FIXED design.

QuadCXP_Shading_BRudde_FakeCAM.va

Pierre Chatelier · May 21st 2019

Hi,

During the compilation, I can already provide you some information :

First, the performance of my board

MA-VCX-QP-performance.JPG

Second, the test of your applet:

QuadCXP_Shading_BRudde_FakeCAM.JPG

As you can see, it is around 1500fps rather than 1600 fps.

I don't understand two things :

-I couldn't run your applet under MicroDisplay without my camera to be detected (if no camera is found, I cannot start the applet, the buttons are disabled)

-The ROI(ImageBuffer) >FillLevel is at 75%, so the fifo is full ?!

I have checked that just setting the input Imagebuffer to 512b/clock is not enough.

CXP-Mono8-5120x5120+shading+bpr+gpi22.jpg

CXP-Mono8-5120x5120+shading+bpr+gpi21.jpg

B.Ru · May 21st 2019

Hi Pierre,

Concerning your first question. If you want to run mDisplay without camera,

mDisplay -> Tools -> Settings -> Check "Ignore Camclock status" and apply:

pasted-from-clipboard.png

By this the connection to a camera is not checked before starting the acquisition.

Concerning your second question:

The ImageBuffer FillLevel needs to be below 75% (better 0%) to investigate the maximum bandwidth supported.

Otherwise the delivered data can not be processed/transferred fast enough.

Please use the way the Coefficient buffer in my VA design QuadCXP_Shading_BRudde_FakeCAM.va is used.

The reason for this is the internal memory bandwidth handling of that operator.

More details on that can be found in a different thread.

We need full performance of 12.2 GB/s here, so the link needs to be:

8 Link, Par 2

12201 MB/s

512 MiB

B.Ru · May 21st 2019

Please do not hesitate to contact me in case of further difficulties.

Maybe a online/remote session on your system with audio assitance is usefull.

Pierre Chatelier · May 21st 2019

The design with 8x(64b@2x) coefficients is not yet compiled, I just reported a screen capture of your applet while the compilation is on.

This is just an informative report showing that for now I have similar figures as yours.

To avoid Overflow, I found that the speed limit is a period of 83950, which is ~1488Hz for the 125Mhz clock.
On your machine you seem to achieve 1600Hz.
As soon as I have the Shading applet compiled, I will report the figures here.

QuadCXP_Shading_BRudde_FakeCAM2.JPG

Pierre Chatelier · May 21st 2019

Success !

Setting the CoefficientBuffers to 8x(64b@2x) is OK. Now I have my 50 fps. (and I have already made and tested the program to split my Coefficent .tif files into 128b block parts).

But I still don't understand why SyncToMax is needed instead of SyncToMin since all image sizes are identical.

Some other compilations are still going on, I will update the thread with other results to make an exhaustive report.

CXP-Mono8-5120x5120+shading+bpr+gpi23.jpg

B.Ru · May 21st 2019

Hi Pierre,

Congratulations, that is fantastic!

Quote from Pierre Chatelier

Success !

Setting the CoefficientBuffers to 8x(64b@2x) is OK. Now I have my 50 fps. (and I have already made and tested the program to split my Coefficent .tif files into 128b block parts).

But I still don't understand why SyncToMax is needed instead of SyncToMin since all image sizes are identical.

SyncToMin / SyncToMax is identical in case of images of same size, but:

In case one of the image sources is delivering a larger image the output link is only supporting the minimum image dimension required. Due to that I went into my testing issue yesterday: I did not see that the CoefficientImage was too large...

So I always prefer SyncToMax in case of identical sizes...

Pierre Chatelier · May 21st 2019

Thanks for your help.
I summarize here the different information from this thread :

-The current version of the Appendix.Device Resources states:

"Due to the shared bandwidth architecture, the applet developer should utilize all 256 bits of the operator’s memory interface (RAM Data Width)"

But the "256" here is just a specific case, the real "RAM Data Width" may be different (and found in the same documentation for each board model type), and it is 512 for the MA-VCX-QP. (that's why I tried 64b@4x at first instead of 64b@8x for the output of the CoefficientBuffers)
I suggest that you modify a little that sentence of the doc.

-For a board using the Shared memory, it is indeed written in the doc that all RAM operators should use the same data width for performance, but I did not realize that it implied to "artifically" ParallelizingUp my 8b@32x camera stream to 8b@64x for the InfiniteSource ImageBuffer storage. It is perfectly logical afterwards, but not trivial at first. You may also insist on that point in the documentation.

-The current version of CoefficientBuffer (VA 3.1.2) has technical limitations that makes it tricky to use at full bandwidth. Your discussion thread CoefficientBuffer: Maximum performance... is really important.
I suggest that you include it in the VA documentation (or work on a new, less tricky, CoefficientBuffer operator!)

-In the shared memory model, no performance gain will be observed by splitting a CoefficientBuffer into several CoefficientBuffer operators, as long as the full data width is properly used.

B.Ru · May 21st 2019

That is an excelent summary of this thread and is already forwarded in order to improve our documentation!

Thank you for your suggestions.

Pierre Chatelier · May 22nd 2019

I have got an additional question.

If I want to set an ROI, I dynamically set the width and height of the:

-camera Genicam parameters (mandatory)

-applet ImageBuffer width and height (optional, but better bandwidth)

-applet SelectROI width and height (optional, but better bandwidth)

-applet CoefficientBuffer XLength and YLength (and offset) (required for Shading to be meaningful)

and it kind of works, but...

Regarding the bandwidth, since the "real" BufferWidth and BufferHeight of the CoefficientBuffer operator are static and set to 5120x5120 (actually 320x5120 because of the tricky use of the width when using 4x(64b@2x) per file), I have observed that my fps is limited by that hard-coded "full frame" dimensions.

For instance, in 3520x2000, while it should be ~250fps (that is what I observe in an applet without shading), I am limited to 75fps here.
I think that it would work if I built a specific version of the applet with a CoefficientBuffer operator adapted to 3520x2000 (i.e. 220x2000), but it is not very handy.

Am I right in my investigation ? Is there another solution to use a customizable ROI at higher FPS when using a CoefficientBuffer operator ?

B.Ru · May 27th 2019

Hi Pierre,

When using the CoefficienBuffer please make sure that only the required image dimension is read out of it before synchronizing the coefficients with the real image data. In case the amount of read data is more than required bandwidth is getting "wasted".

You need to write a 1 into its parameter UpdateROI to apply the changes during runtime.

In case you want to optimize or simplify the parameter access and the configuration itself feel free to think of using the Parameter Library. By this you can implement a single parameter that handles all width/height for example.

Only the maximum buffer width and height are static. The ROI values can be either static/dynamic depending on your preset during design time. If you set them to dynamic, these can be changed during runtime. Then you can adopt required changes concerning the ROI during runtime and reach the requested bandwidth without rebuilding.

B.Ru · May 27th 2019

Below you can see where to select the Static/Dynamic parameter type.

pasted-from-clipboard.png

Pierre Chatelier · May 27th 2019

But unlike LoadCoefficients, "UpdateROI" cannot be set to 0, so it is always set at 1.

I think that here again, this is a documentation problem :

I have juste checked that indeed, the CoefficientsBuffer ROI seems to be taken into account (regarding the fps), after writing 1 into UpdateROI *even if the value is already 1*

Do you confirm ?

Should it be the case, I also heavily suggest that the documentation of UpdateROI is updated to explain that !

[edit]

For instance, in my own GUI, since UpdateROI has only a single possible value, no GUI event is raised if I rewrite "1" in the associated NumericUpDown control, and thus I do not propagate to the Sgc_setIntegerValue(). That's why I got tricked !

It means that I should call Sgc_setxxxValue() even if the new value is the same as the current one.

B.Ru · May 28th 2019

Hi Pierre,

Thank you for your input.

Quote from Pierre Chatelier

I think that here again, this is a documentation problem :

I have juste checked that indeed, the CoefficientsBuffer ROI seems to be taken into account (regarding the fps), after writing 1 into UpdateROI *even if the value is already 1*

Do you confirm ?

Should it be the case, I also heavily suggest that the documentation of UpdateROI is updated to explain that !

I forwarded your hint concerning the CoefficientBuffer in order to improve and extend our documentation.