Posts by Johannes Trein

    Hello Jayasuriya,


    We totally revised the C# SDK interface in the current runtime versiones. I recommend to use the current Runtime 5.6 which includes the C# interface. See subdirectory \SDKWrapper\CSharpWrapper

    See the documenation in http://www.siliconsoftware.de/…per/doc/CSharpWrapper.pdf


    You can find the full setup in https://silicon.software/file-…-v5-6-win64-with-applets/


    Let me know if you prefer the old version instead.


    BR

    Johannes

    Hello Jesse


    Thank you for your post. You obtained a very good understanding of VisualApplets.

    The platform microEnable 5marathon(mE5_VCX-QP) has shared memory concept.

    So these RAMs data width needmore than 64*1(RMA1) + 8*8(RAM2) + 8*9(RAM3) = 200 bit.

    Then increase parallelism to 32 before RAM1 and RAM2, and set ffc_factor parallelism to 4.

    So these RAMs will share 256 bit. Then RAMs all have enough bandwidth. Is it right?

    You are using a mE5-MA-VCX-QP. The total bandwidth of this platform is 12.8GB/s. To use the full bandwidth you need to use all 512bit = 64 byte of each DRAM.

    Now lets have a look at your configuration:

    - RAM1:

    Required 1200 MPixel/s (Max CXP6x2 Speed)

    Use parallelism = 32

    --> 1200 MP/s / 32 * 64 byte * 2 = 4.8GByte/s used in RAM1 (*2 because of read and write)

    If you use parallelism = 64 instead you will only use 2.4GByte/s in this RAM

    - RAM2:

    Same as RAM1: 4.8GByte/s. Because you are using 6Bit/s you cannot use parallelism 64,

    - ffc_factor:

    Required: 1200 MPixel/s. Because of 16Bit --> 2400 MB/s


    Unfortunately operator CoefficientBuffer is inefficient for the configuration with only one link. See explanations in this post: CoefficientBuffer: Maximum memory size and bandwidth on marathon frame grabbers pasted-from-clipboard.png


    So you need to change it to a configuration with 8 output links:


    Therefore the total required RAM bandwidth is

    RAM1: 2400MB/s (at parallelism 64)

    RAM2: 4800MB/s

    ffc_factor: 2400 MB/s


    Total = 9600 which is less than the theoretic maximum of 12800 MB/s. Therefore the memory bandwidth is enough.


    If move this design run on mE5VQ8-CXP6B. mE5VQ8-CXP6B doesn’t has shared memory concept.

    The RAMs have independent data width. So I do not modify the parallelism to 32 and RAMs all have enough bandwidth?

    On mE5VQ8-CXP6D you also have a bandwith of 3.2 GB/s for each of the four individual DRAMs. The data width is only 128 bit.

    RAM1: Parallelism 16

    RAM2: Parallelism 16 -> you will need to use two ImageBuffer operators in parallel

    ffc_factor: 4 outputs at parallelism 2


    This design use 96% LUT with mE5_MA_VCX-QP. It is possible to add applet in the future.

    I try to reduce parallelism to 4, and modify board frequency to 250 MHz(mE5_MA_VCX-QP_Dual_250MHz.va). Compilation will fail(CmopileError.PNG).

    If I want increase board frequency, Is there any detail in design that needs attention?

    You can change the FPGA clock for marathon frame grabbers but we cannot guarantee that you will meet the timing requirements of the FPGA during the build process. In practice it will always work with 125MHz. Up to 160MHz you have good chances to meet the timing. Everything above will most likely not work correct.

    The DRAM will not get faster when you change the FPGA clock. It it will only affect the speed of processing between the operators i.e. less parallelism required.


    In your case you need to save some FPGA resources. You are using 96% LUT but only a few of the embedded ALU types.

    Here are some tricks to reduce LUT and use ALU:

    1. Use ADD instead of FIRoperator like for the mean filter:

    pasted-from-clipboard.png


    2. Use the same idea for the Gauss and Laplace filter

    pasted-from-clipboard.png


    3. Replace SCALE by CONST and Mult operator. Mult will use ALU, Scale will use LUT


    4. Use DIV at low parallelism


    I hope my information will help you to solve this project.


    BR

    Johannes

    Hi Mike,


    today I could test the apple in frame grabber hardware on FPGA. The "rotate90_simple.va" runs sufficiently fast enough for your application.

    See the following screenshot.

    We get 1024x1224*47fps = 58MPixel/s which is already the theoretic maximum of parallelism 1. As you will need 20fps only the applet is much faster than your requirements.

    pasted-from-clipboard.png


    Applet "rotate90_fast.va" can be used for faster inputs like non-downsampling or Camera Link inputs. I needed to increase the parallelism after the buffer to use the fast speed. So the "fast" file above is incomplete. I will update this file for others having the same request.


    Johannes

    Hi Mike


    if your camera runs at 2448 x 2048 * 20fps with 12 Bit packed format the bandwidth will be 150MByte/s which is more than the theoretical maximum of Gigabit Ethernet vision.

    Anyway I made two designs which fulfill your requirements.

    You can fully simulate the design and test the maximum bandwidth in hardware using the build in pattern generator. You will need a Silicon Software mE4-VQ4GE FPGA frame grabber together with VisualApplets for testing. The applet can be adapted to other frame grabbers but needs some modifications because of shared DDR3 memory compared to the DDR2 on the microEnable IV.


    Note: DRAM gets slow when it comes to non-linear write or read access. That's why such a simple task of 90° rotation is difficult for both technologies FGPA frame grabbers but also standard PC systems. The "fast" implementation uses all four DRAMs to increase the bandwidth.

    At this moment I have no access to frame grabber hardware so I could not measure the resulting bandwidth. Once I have access to the hardware I will do the measurement.


    pasted-from-clipboard.png


    BR

    Johannes

    Hi Jayasuriya


    Yes the applet that you have provided is working well!. Lets consider a scenario like, " if i generate 200 images and it has a pixel with value 1 at the 100th row of it, then in simulation it will be separated as two images each of height 100 where in the first image only a pixel 1 occurs". Here I want to set that particular pixel value to 1 for the next 10 images i.e from 101 to 110 lines of the 2nd image in the simulation , but in the output I get pixel value 1 at the 100 th row only. Could you please give me a solution where this SetToSequence works even between set of images?

    It will work correct in hardware. In the simulation the 1D protocoll is considered as 2D images. Therefore, if you want to simulate a sequence of 200 frames you need to set SetToSequence_To1D_LinesToSimulate = 200 to get a single 1D image.


    I like to mention one more thing: If the number of past images is much more than 10 or dynamic you should consider a loop operation. Check the rolling average examples in the VA documentation for this.


    Johannes

    Hi Jayasuriya

    the attached VA design should solve your task.

    It generates a random pattern representing your input images. Next all small images are appended to a single image of infinite height i.e. 1D image. Now we can check if one of the pixel in the same column of the last 10 rows was 1 and output 1.


    To simulate the design you need to set the simulation cycles to 100 which is the same value as in SetToSequence_To1D_LinesToSimulate.


    pasted-from-clipboard.png

    pasted-from-clipboard.png


    BR

    Johannes

    Hi Mike,


    image rotation is a task for an FPGA which cannot be solved by the one algorithm. Factors are bandwidth, FPGA and board generation as well as specifications for the rotation.

    To you need a fixed rotation by 90, 180, 270°? Or a variable roation by e.g. 15°? Should the rotation angle be dynamic or static?


    We have different examples included in the VisualApplets example list. See the examples in http://www.siliconsoftware.de/…ric%20Transformation.html


    These examples can be adapted for rotation only. Depending on the bandwidth requirements they work well with small rotation angles.

    For 90° rotations the implementation will always depend on the image dimensions and bandwidth requirements.


    I hope my post will give you some ideas for your further work on this.


    BR

    Johannes

    Hi Simon

    welcome to the forum. Before searching for a solution based on TCL maybe another feature will help you: Did you notice a new setting for the location of simulation image files:

    pasted-from-clipboard.png


    It is available from VA 3.1.0

    Maybe that'll help you.


    Other than that you need to build absolute paths for all simulation images when adding them to the simulation. Note that you can use regular expressions within the TCL.


    Johannes

    Hi Theo,


    we have a pretty elegant solution using restart markers within the JPEG stream to separate it to multiple operators. We can get up to 6 JPEG encoders in a design which results in a datarate of 6 * 300MB/s = 1800 MB/s on a mE5-MA-VCX-QP.


    The design is quite tricky but I am sure you'll understand. We'll send you a VA design and SDK example. Please give us some days to finish it.


    Johannes

    Hi Mike

    welcome to the forum and thank you for the question. Others might have this question, too. I don't have my own example right now but let me share the example of a collegue. Maybe he can add some more information.


    Files

    • GrabToCV.zip

      (5.84 kB, downloaded 9 times, last: )

    Hi Theo

    welcome to the forum and thank you for the question. You are right, a process without a DMA channel will be started immediately after loading the applet (Fg_Init function) and canot be reset or restart. For processes including one or more DMA channels you can stop and start the acquisition (Fg_StopAcquire(Ex) and Fg_StartAcquire(Ex)) or the "Play" and "Stop" Button in microDisplay. See doc Processes without DMAs / Trigger Processes


    I can only immagine two solutions:

    1st: Add a dummy DMA operator to the process. It does not need to transfer any data if you keep the timeout long enough. However in this solution you "waste" FPGA ressources for this extra DMA.

    2nd: You did mention the second solution already: Add reset signals to the inputs of the signal operators. You can conveniently do this by adding a TxSignalLink operators before each of the reset inputs. However, this solution is only for signal operators and won't work for operators in the 0D, 1D or 2D domain.


    In addition you can wait until the pipeline got empty or transfer signals between processes using TxSignalLink and RxSignalLink.


    So far we did not have this request before but I will generate a feature request so that it can be analyzed by our PM.


    Johannes

    The attached VA design is a simple example for color plane separation for RGB+IR intput and separated output. The example is made for CL-Medium 4 Tap 8 Bit cameras sending the data interleaved i.e. R + G + B + IR in the four taps. It works for the JAI Sweep+ series or RGBIR cameras on the Silicon Software mE5-MA-VCL FPGA frame grabber but can easily adapted to others.

    The example uses the straight forward solution. There are many other options using a single DRAM operator for implementation only. The actual best solution depends on the requirements.

    Files

    In this example we stretch a histogram. The histogram is calculated for the

    input image. The lower starting point as well es end point in the histogram is considered

    as black and white. The input image is shifted by the black offset, and scaled by the white gain

    so that the histogram is in the range 0 to 255.

    To eliminate noise and single pixel from the calculation the minimum number of pixel threshold

    is used.


    H-Box MakeBadImage will make a picture having a bad histogram on purpose.


    pasted-from-clipboard.png

    Operator RemovePixel requires many ressources at high parallelisms. This is because a removing pixel can be at any position in the parallel pixel. This causes a complex implementation full of barell shifters.


    However, in many cases the number of remaining pixel is very low. If you want to remove 90% of the pixel anyway you can implement a two stage solution. First, remove all parallel words if no pixel if left. Second, remove the unwanted pixel left.

    The output is exaclty the same compared to using a single RemovePixel operator but requires much less ressources. The only difference is that the two stage solution will use a new output parallelism.


    The attached design shows a little example with simulation data.

    pasted-from-clipboard.png

    The Silicon Software AcquisitionApplets have a build in trigger sequencer and queue.

    If trigger inputs exceed the minimum allowed output period, pulses are queued and delayed. This will kill the timing but will keep the trigger synchronized. The example is made for a Silicon Software mE5-MA-VCL frame grabber but can be easily adapted to any other FPGA frame grabber.


    The attached example has the following functions:

    - external trigger input or software trigger

    - trigger pulse multiplication 1:N

    - trigger queue

    - trigger output period limitation

    - exsync output


    pasted-from-clipboard.png

    pasted-from-clipboard.png


    The implementation was tested with a logic analyzer. The following screenshot shows an input period of 5ms. The applet is configured to a multiplication of 2 pulses with a minimum output period of 2.5ms.

    pasted-from-clipboard.png


    If we increase the input period, the output will be delayed and the queue will be filled with pulses.

    pasted-from-clipboard.png


    If the gap between input pulses gets sufficiently large enough, the queue will compensate it. In the example 5 input pulses with a period of 2.9ms will generate 10 output pulses with a period of 2500 ms.

    pasted-from-clipboard.png


    NOTE: This example is not trigger scaling

    If you want to scale an input trigger e.g. an encoder signal by a multiplication and division factor you need to measure the input period, scale it and generate pulses. This is not the purpose of this example.

    To get the width of each line into a pixel value use the attached sample. It will count the length of each line and will output the width as a pixel.

    Extension: If you need to get only the width of a frame you need to remove all lines except the one for measurement.

    Debugging Library operators: If you just need to read the width of an image using a parameter you can use the operators of the debugging library instead.

    pasted-from-clipboard.png


    See attached VA design file.