Synchronization issues with flip an image twice

Jesse Lin · Oct 6th 2018

Porj_A(mE5_MA_VCX-QP_Single_proj_I) test video [Link]

Porj_B(mE5_MA_VCX-QP_Single_proj) test video [Link]

FFC_factor image [Link]

Porj_A can work at Line rate of 50000.

When I add HierarchicalBox of segment after HierarchicalBox of EdgexFilter (Porj_B).

Porj_B can not work at Line rate of 50000.

How can I calculate how many buffer need to add at Porj_B for Synchronization?

Thanks.

Jesse

B.Ru · Oct 8th 2018

Dear Jesse,

I looked at both of your designs.

Porj_B / mE5_MA_VCX-QP_Single_proj.va includes the segmantation H-Box.

Inside of this H-Box an additional RAM buffer is used.

You are asking for the number or size of this buffer that is required to run the design at the expected bandwidth.

In this case it is not related to the size.

You are using 3 RAM's modules on a marathon VCX-QP.

The memory bandwidth inside the VCX-QP is following a shared memory concept.

http://www.siliconsoftware.de/…t/device%20resources.html

The implemented RAM operators work at:

RAM Operator name	bit depth	parallelism (shared * 4)
RAM1	8	8 ( -> 32 )
RAM3	9	8 ( -> 32 )
ffc_factor	61	1 ( -> 4 )

Since all operators together share now:

64 / 72 / 64 bit

the design will not meet the expected performance.

Simply increase the parallelism around the RAM operators to enable a higher bandwidth.

Example:

If 2 RAMs need to handle 8 bit @ parallelism of 8, connect both with double parallelism:

8 bit @ parallelism 16

Then these two would share the parallelism of 16 to 8 each.

What you need to do now:

Use parallelism of 32 for RAM1 and RAM3.

Use parallelism of 2 for ffc_factor.

Then the bandwidth would be handled correctly by the shared memory concept.

Since 3 RAMs need to share the same Pixel-rate in that design we use a factor of 4 for the parallelism to speed-up accordingly. A factor of 4 is not affecting the image dimension.

LinkBandwidth = bit-depth * pixel-clock * parallelism

Example:

500 MB/s = 8 bit * 125 MHz * 4

In case of a shared memory RAM operator:

PARALLELup -> RAM -> PARALLELdn

Increase the parallelism before the RAM and reduce it after the RAM again.

The increase is depending on the amount of RAMs and their bandwidth needs.

The correspondingly changed VA dresign is attched here:

mE5_MA_VCX-QP_Single_Porj_AB_SpeedUp_BRudde.va

Best regards,

Jesse Lin · Oct 9th 2018

Dear Bjorn

I am not quite sure about the structure of the shared memory concept.

Quote

You are using 3 RAM's modules on a marathon VCX-QP.

The memory bandwidth inside the VCX-QP is following a shared memory concept.

http://www.siliconsoftware.de/…t/device%20resources.html

The implemented RAM operators work at:

RAM Operator name
bit depth
parallelism (shared * 4)

RAM1 8 8 ( -> 32 )

RAM3 9 8 ( -> 32 )

ffc_factor 61 1 ( -> 4 )

Since all operators together share now:

64 / 72 / 64 bit

the design will not meet the expected performance.

Display More

In Shared Memory Concept

When a design utilizes all 4 RAM resources, each of the 4 RAM based operators can have up to 1.6 GB/s exclusive bandwidth, minus the efficiency factor of that particular operator.

So a RAM 's maximum bandwidth is 1.6 GB/s?

RAM Operator name	bit depth	parallelism	LinkBandwidth(125MHz)
RAM1	8	8	1 GB/s
RAM3	9	8	1.125 GB/s
ffc_factor	64	1	1 GB/s

Why the design will not meet the expected performance?

Quote

What you need to do now:

Use parallelism of 32 for RAM1 and RAM3.

Use parallelism of 4 for ffc_factor.

Then the bandwidth would be handled correctly by the shared memory concept.

Since 3 RAMs need to share the same Pixel-rate in that design we use a factor of 4 for the parallelism to speed-up accordingly. A factor of 4 is not affecting the image dimension.

Display More

In Shared Memory Concept

Due to the shared bandwidth architecture, the applet developer should utilize all 256 bits of the operator’s memory interface (RAM Data Width) to achieve maximal throughput through the memory interface when using multiple RAM based operators even though the single RAM operator needs less bandwidth on its input.

RAM Operator name	bit depth	parallelism	bandwidth
RAM1	8	32	256 bits
RAM3	9	32	288 bits
ffc_factor	64	4	256 bits

mE5 marathon VCX-QP maximum RAM Data Width is 512 bits.

So set RAMs bandwidth to 256 bits?

Thank you.

Jesse

Jesse Lin · Oct 9th 2018

Dear Bjorn

I try to reduce image height to 512 .

And change parameter from mE5_MA_VCX-QP_Single_Porj_AB_SpeedUp_BRudde.va

module	Parameter Name	Value
Process0/Capture/TrgBoxLine	YLength	512
Process0/EdgexFilter/RAM1	YLength	512
Process0/EdgexFilter/projection_v/get_last_line/value	Nember	511
Process0/EdgexFilter/ffc_factor	YLength	512
Process0/EdgexFilter/RAM3	YLength	512
Process0/Segment/get_last_line/value	Number	511
Process0/DMA_Source	Height	512
Process0/DMA_Filter	Height	512

The output data is not synchronized. Test video[Link]

Is this design can dynamic reduce image height?[Height range : 512~1024]

The attachment is the final version in the project.

The line rate will up to 76923.

Thanks.

Jesse

Jesse Lin · Oct 11th 2018

Dear Bjorn

I try to modify ffc_factor (RAM) buffer height to 512 and load image height of 512.

Than Sync ffc_factor height to max.Now can dynamic modify image height!

But I need to try many times to get this result.It will take me a lot of time.Because it takes about an hour to build it once.

This is why I want to ask how to calculate about the design.

Jesse

B.Ru · Oct 12th 2018

Hi Jesse,

To make things more easy to understand:

The mE5-MA-VCX-QP is using a 512 bit wide shared memory concept.

This supports up to 12.8 GB/s for the mentioned platform.

( VA-documentation Appendix: Ressource table: RAM Bandwidth total (shared) )

If you use the maximum possible link-width around all RAM opertaors the bandwidth will be at maximum:

512 bit <= parallelism * bit-depth

You can try this once and check if everything is fine.

Then you do not need to do a re-synthesis.

Please consider that additional factors may influence you system-bandwidth:

- Limited DMA performance due to mainboard-specification

- There is a load-balancing in between of the memory buffers:

- write is of higher priority than read.

- all RAMs have same priority