VQ8-CXP6D-DepthFromFocus

IhShin · Jun 25th 2020

Hi.

I am designing a DepthFromFocus example to be applied to VQ8-CXP6D, but the designed applet has a buffer overflow in the microDisplay. The maximum frame rate of the camera is 239 fps. but, overflow occurs at 96 fps or higher.

Please check the problem in the designed applet.

* Test environment

- Camera: Adimac S-25A70

- Fram Grabber: microEnable 5 VQ8-CXP6D

- Image ROI: 2000x2000 @8Bit(mono8)

Thanks.

B.Ru · Jun 25th 2020

Dear IhShin,

I will look into the VA design now and try to identify the issue of bandwidth.

Best regards,

B.Ru · Jun 25th 2020

Dear IhShin,

In your design I can see that you are using the camera with 2 RAMs on the acquisition side.

There you have chosen parallelism 32, where the processing itself is targeting a parallelism of 8.

Please check if your design works fine with a lower frame rate.

For example 100 Hz:

(2000 * 2000 * (100 Hz)) / (8 * (125 MHz)) = 40 %

Step by step you can check if the applet runs fine.

At a certain frame rate you will start observing a buffer overflow.

You can use the same RAM approach as in the acquisition to increase the bandwidth of the EDoF calculation.

That approach will use 2 RAM modules.

Please make sure that you utilize the maximum possible paralleism for each image buffer (RAM) module to get the maximum possible bandwidth.

Best regards,

B.Ru · Jun 25th 2020

Dear IhShin,

Please let me know if the mentioned approach is usefull for you and what the target bandwidth looks like:

Width * Height * BitPerPixel * Frequency
2000 * 2000 * 8 bit * 100 Hz = 400 MB/s

That will define the require paralleism representing the bandwidth.

Best regards,

IhShin · Jul 1st 2020

Dear B.Ru,

If you check the operation at a low frame rate, it operates normally up to 96Hz.

target bandwidth:

Width * Height * BitPerPixel * Frequency

-> 2000 x 2000 x 8 x 239 = 912MB/s

If the RAM approach of EDoF calculation is 2 RAM module and the parallelism of the process is changed to 16, the build does not work for a long time.

Thanks.

pasted-from-clipboard.png

B.Ru · Jul 1st 2020

Dear IhShin,

You question asks for a bandwidth of:

2000 * 2000 * (8 bit) * (239 Hz) = 956 MB / s

coming from the camera, while reporting a bandwidth limit at:

2000 * 2000 * (8 bit) * (96 Hz) = 384 MB / s

While your initial buffer is using 2 RAMs for 8 bit at paralleism 32 (receiving up to 20), the buffer approach inside the EDoF is receiving 16 bit at parallelism 32.

Looking at the RAM

Hardware Configuration microEnable 5 ironman:

Resource		mE5VQ8-CXP6B/mE5VQ8-CXP6D
Vision Processor		Xilinx Virtex6 XC6VLX240T FPGA
LUT		150720
FlipFlop		301440
Block RAM		832 x 18432Bit
Embedded Arithmetic Logic Unit (DSP48)		768
RAM		4 x 256MiB DDR3
Data Width per RAM		128Bit
Bandwidth per RAM		4GB/s
Base Design Clock		125MHz
Host Interface		PCIe x8 Gen2
Host Interface (PCIe x 8 Gen 2) Bandwidth (theor.)		4 Gbyte/s per direction on PCIe bus
Host Interface (PCIe x 8 Gen 2) Bandwidth (typ./max.)		up to 3.6 GByte/s on PCIe bus

Table 55. Hardware Configuration microEnable 5 ironman (Source)

Let's look into the details:

To get the maximum performance of each RAM module, we have to use the full data width:

128 bit

using the full interface, does mean all 128 bit == bit-depth * parallelism, would give 4 GB/s of bandwidth.

4 GB/s = (1+1) * 2GB /s one for writing input and one for reading output.

Above we end up at an bandwidth

956 MB / s

here it equals 956 MPixel/s, each pixel having 16 bit = 2000 * 2000 * (16 bit) * (239 Hz) = 1912 MB / s

Explanation: 16 bit consist of 2 intermediate values per pixel...

Two RAM blocks are used within the EDoF:

pasted-from-clipboard.png

Each RAM in here uses 4 bit at parallelism 32 = 4 bit * 32 = 128 bit
Same choice for acquisition RAM modules.

So RAM interfacing in EDoF is fine.

While your intended bandwidth is less than parallelism * system clock = 8 * 125 MHz = 1GPixel /s the selected parallelism of 16 is a secure choice. No problem with this too.

From my point of view there is no issue within the VA design.

That is OK, but no solution or answer to your question.

There are two other details we have to look at now:

DMA-performance and camera interface CXP.

I guess and hope that the DMA performance of the ironman grabber in your system is not limited, but please double check that.

A limited PCIe performance could be a reason for that, because it would propagate stop's into the design's data flow.

The ironman is providing PCIe x8 Gen2 with theoretical 4 GB/s and 3.6 GB/s in practice, but it is possible that the mainboards PCIe slot does only support Gen1 and/or less lanes than x8.

One PCIe lane at Gen1 would deliver 256 MB/s in theory and practically close to 200 MB/s

One PCIe lane at Gen2 would deliver 512 MB/s in theory and practically close to 400 MB/s

Your design is correctly configured for PCIe 8 Gen2:
pasted-from-clipboard.png

Shown in Applet Properties operator.

You reported:

2000 * 2000 * (8 bit) * (96 Hz) = 384 MB / s

and 1 of 50 images : 2% is output bandwidth of second DMA:

(2000 * 2000 * (8 bit) * (96 Hz)) * (1 + (1 / 50)) = 391.68 MB / s
That is pretty close to 400 MB/s, being an indicator for 1 PCIe lane at Gen2.

In microDiagnostics you can double-check the possible bandwidth of the grabber:

Linked here you have a description on how to perform this.

Output will look like:

Test is carried out for the applet that is available (flashed) on the selected frame grabber.

Since VA designs are not fully supported, please flash the related acquisition applet.

In your case it will be: Acq_QuadCXP6x1AreaGray8.dll

On X axis of the performance diagram for Acq_* applet at FG_GRAY = 8 bit per pixel you need to see a peak at 2048 at or above 1000 MB/s to reach your target bandwidth.

What to do if DMA performance is fine?

The camera may be a second external issue.

In practice you are using CXP and the used operator supports 4 CXP6 links.

That means up to 6.25 Gbit/s per link.

Due to 8b/10b encoding this represents per link:

(8 / 10) * (6.25 (Gbit / s)) = 625 MB / s

quad link you be:

((4 * 8) / 10) * (6.25 (Gbit / s)) = 2500 MB / s

Since the interface is protocol based you will not get the full bandwidth for image data.

But at least it is the possible peak performance for CXP6.

There are more options being supported by the VA design:

CXP-1	1.25 Gbit/s	up to 212 m
CXP-2	2.5 Gbit/s	up to 185 m
CXP-3	3.125 Gbit/s	up to 169 m
CXP-5	5 Gbit/s	up to 102 m
CXP-6	6.25 Gbit/s	up to 60 m
~~CXP-10~~	~~10 Gbit/s~~	~~up to 40 m~~
~~CXP-12~~	~~12.5 Gbit/s~~	~~up to 30 m~~

Source of table: Wikipedia

Some math again:

((4 * 8) / 10) * (1.25 (Gbit / s)) = 500 MB / s, where this peak bandwidth is very likely transporting 400 MB/s of image data.

Same for this configuration:

((1 * 8) / 10) * (5 (Gbit / s)) = 500 MB / s

To end this already pretty long story :

Check the CXP configuration using the hardware dialog in GenICam Explorer or microDisplayX.

The link topology will tell you what you are using precisely for CXP.

This will link to the documentation of link topology dialog, but screenshots are for GEV, but same steps for CXP.

The End is a Summary

Your VA design is correct, you should see the expected bandwidth.

From my perspective it is possible that the PCIe connection or CXP topology is causing this.

To me it is very likely that the PCIe Gen2 slot provides a single lane only.
Please let me know what the DMA performance test (microDiagnistics) shows...

If you need some help in interpreting your tests and interpreting this into the observed performance:
I and all the other people in the VA forum community will help.

Thanks and best regards,

VQ8-CXP6D-DepthFromFocus

Looking at the RAM

DMA-performance and camera interface CXP.

The End is a Summary

Share

Tags