RAM data width confirm with mE5_MA_VCX-QP and mE5VQ8-CXP6

  • Hi Sir


    I use 3 RAMs in VA design(mE5_MA_VCX-QP_Dual.va)

    RAM Operator name bit depth parallelism
    RAM1 8 8
    RAM2 9 8
    ffc_factor 64 1



    The platform microEnable 5 marathon(mE5_VCX-QP) has shared memory concept.

    So these RAMs data width need more than 64*1(RMA1) + 8*8(RAM2) + 8*9(RAM3) = 200 bit.

    Then increase parallelism to 32 before RAM1 and RAM2, and set ffc_factor parallelism to 4.

    So these RAMs will share 256 bit. Then RAMs all have enough bandwidth. Is it right?


    If move this design run on mE5VQ8-CXP6B. mE5VQ8-CXP6B doesn’t has shared memory concept.

    The RAMs have independent data width. So I do not modify the parallelism to 32 and RAMs all have enough bandwidth?



    This design use 96% LUT with mE5_MA_VCX-QP. It is possible to add applet in the future.

    I try to reduce parallelism to 4, and modify board frequency to 250 MHz(mE5_MA_VCX-QP_Dual_250MHz.va). Compilation will fail(CmopileError.PNG).

    If I want increase board frequency, Is there any detail in design that needs attention?



    Thanks.


    Jesse

  • Hello Jesse


    Thank you for your post. You obtained a very good understanding of VisualApplets.

    The platform microEnable 5marathon(mE5_VCX-QP) has shared memory concept.

    So these RAMs data width needmore than 64*1(RMA1) + 8*8(RAM2) + 8*9(RAM3) = 200 bit.

    Then increase parallelism to 32 before RAM1 and RAM2, and set ffc_factor parallelism to 4.

    So these RAMs will share 256 bit. Then RAMs all have enough bandwidth. Is it right?

    You are using a mE5-MA-VCX-QP. The total bandwidth of this platform is 12.8GB/s. To use the full bandwidth you need to use all 512bit = 64 byte of each DRAM.

    Now lets have a look at your configuration:

    - RAM1:

    Required 1200 MPixel/s (Max CXP6x2 Speed)

    Use parallelism = 32

    --> 1200 MP/s / 32 * 64 byte * 2 = 4.8GByte/s used in RAM1 (*2 because of read and write)

    If you use parallelism = 64 instead you will only use 2.4GByte/s in this RAM

    - RAM2:

    Same as RAM1: 4.8GByte/s. Because you are using 6Bit/s you cannot use parallelism 64,

    - ffc_factor:

    Required: 1200 MPixel/s. Because of 16Bit --> 2400 MB/s


    Unfortunately operator CoefficientBuffer is inefficient for the configuration with only one link. See explanations in this post: CoefficientBuffer: Maximum memory size and bandwidth on marathon frame grabbers pasted-from-clipboard.png


    So you need to change it to a configuration with 8 output links:


    Therefore the total required RAM bandwidth is

    RAM1: 2400MB/s (at parallelism 64)

    RAM2: 4800MB/s

    ffc_factor: 2400 MB/s


    Total = 9600 which is less than the theoretic maximum of 12800 MB/s. Therefore the memory bandwidth is enough.


    If move this design run on mE5VQ8-CXP6B. mE5VQ8-CXP6B doesn’t has shared memory concept.

    The RAMs have independent data width. So I do not modify the parallelism to 32 and RAMs all have enough bandwidth?

    On mE5VQ8-CXP6D you also have a bandwith of 3.2 GB/s for each of the four individual DRAMs. The data width is only 128 bit.

    RAM1: Parallelism 16

    RAM2: Parallelism 16 -> you will need to use two ImageBuffer operators in parallel

    ffc_factor: 4 outputs at parallelism 2


    This design use 96% LUT with mE5_MA_VCX-QP. It is possible to add applet in the future.

    I try to reduce parallelism to 4, and modify board frequency to 250 MHz(mE5_MA_VCX-QP_Dual_250MHz.va). Compilation will fail(CmopileError.PNG).

    If I want increase board frequency, Is there any detail in design that needs attention?

    You can change the FPGA clock for marathon frame grabbers but we cannot guarantee that you will meet the timing requirements of the FPGA during the build process. In practice it will always work with 125MHz. Up to 160MHz you have good chances to meet the timing. Everything above will most likely not work correct.

    The DRAM will not get faster when you change the FPGA clock. It it will only affect the speed of processing between the operators i.e. less parallelism required.


    In your case you need to save some FPGA resources. You are using 96% LUT but only a few of the embedded ALU types.

    Here are some tricks to reduce LUT and use ALU:

    1. Use ADD instead of FIRoperator like for the mean filter:

    pasted-from-clipboard.png


    2. Use the same idea for the Gauss and Laplace filter

    pasted-from-clipboard.png


    3. Replace SCALE by CONST and Mult operator. Mult will use ALU, Scale will use LUT


    4. Use DIV at low parallelism


    I hope my information will help you to solve this project.


    BR

    Johannes

  • Hi Jesse

    Are CoefficientBuffer's maximum memory size and bandwidth different with marathon?

    We have to say that the CoefficientBuffer is not very easy to use. It needs an update.

    However, it can be fully used with the full bandwidth and performance. So in mE5VQ8-CXP6D one operator can use 256MiB (Mebibyte = 256 * 2^20 Byte) and a theoretic speed of 3.2GByte/s.


    Johannes Trein
    Group Leader R&D
    frame grabber

    Basler AG