The calculation of the next pixel directly depends on the calculation result of the previous pixel

  • As mentioned in the Subject, is it possible to provide such a design example, the calculation of a certain operation of the next pixel directly depends on the calculation result of the same operation of the previous pixel.

    This is quite different from operations such as cascade of two FIR filters or DILATE followed by ERODE, because the result obtained by operators such as FIRkernel is still the result of the previous operation. What I want is that the calculation result of one operation is immediately available for the same operation on the next pixel.

  • sgm.JPG

    Let me explain in detail, suppose I want to calculate an operation op for pixel P, but this calculation process depends on the calculation result of the same operation op for pixels A, B, C and D. How can i deal with this problem?

    Using operators similar to FIRkernel can only get the result of the previous operation on pixels A, B, C and D, not op.

  • Hi

    what you are requesting is a IIR filter instead of an FFR filter. The problem here is that you will need to calculate a current step before getting the new pixel. Any parallelism in a pipeline stage will therefore impossible. This makes this operation so very slow.

    A direct approach in VisualApplets does not exist but you can use loops for your requirement. Instead of the existing loop examples where lines or frames are processed you will need to process a single pixel inside the loop.

    Before going into detail you should be aware of the bandwidth limitation. A fraction of the FPGA clock will be possible.


    Johannes Trein
    Teamleader Applications and Development
    SiliconSoftware GmbH

  • Hi,

    it is possible to use a loop for single lines for B, C and D, as Johannes mentioned. But your image dimensions have to be constant. For A I would use a Pixel-Neighbours-Operator to do the calculation once again. Just pass a Kernel with all used arguments to that operator.

    A far as I can see, parallelism > 1 could work, depends on the operations you have to do.



    Edit: A kernel could be difficult - but Pixel-Neighbours is an O-Type, so you can use multiple in Parallel.

    Edited once, last by SWe ().

  • I thought about using the loop method to solve this problem, but there are two difficulties.

    The first is that the granularity of the loop is a pixel, not an image, how to solve the synchronization problem between the current pixel and the previous pixel;

    The other is that when calculating P, I need not only pixel A, but also pixels B, C, and D. Since the line buffer cannot provide a definite delay, how can I locate pixels B, C, and D?

  • I hope to simplify and make the problem more concrete, assuming that the op operation is to find the average of the upper left neighborhood of the pixel P, that is, op(P)=average(A,B,C,D).

    However, the pixels A, B, C, and D are also the mean values of their respective upper left neighborhoods, rather than the original image gray values.

    Is it possible to provide such a simple va program?

  • So you want to implement something, which provides P{0,0} = mean(P_{0,-1},P_{-1,-1},P_{-1,0},P_{-1,1})? Where P{y-relative,x-relative} is the "recursive" mean at it's position?

    If you want to, I can provide you a basic design.

    From a system theoretic point of view this should converge to something like a (masked) Gaussian mask. Maybe you can evaluate this and create a single filter mask?



  • "recursive" is just what I want, which is different from ordinary template filtering, because operators such as firkernel or pixelneighbor can only get the original image pixel, but what I hope is that the filtering result of the current pixel can be immediately available to the next pixel.

    I would be very grateful if you can provide a basic design like this.

  • Hi,

    I've got to admit that I had a mistake in my thoughts. Doing calculations on the previous line operations (B, C, D) is no Problem.

    But I haven't found a good way for calculating A yet.

    I think about it, but this could take some time - and no guarantee for a solution.


  • Dear,

    I'v built a basic example which applies that filtering. It is based on a double loop approach. The outer loop is for the feedback of the old line (B, C, D), the inner loop is pixel based and feeds back the last value (A).

    On Hardware (ME5-VCL, 125MHz) I only achieved 3.2fps with a (1024x1024)px image. But maybe it's fast enough for your application.

    How to test on hardware:

    1. Build, Flash, Load in MicroDisplay
    2. Set the timeout of your ackquisition to really high value
    3. Search for Source/SimDimension and Source/P32ROI and set the test image dimensions
    4. PixelsToLineImage and LineImageToOneLine have to be set to the image width, Output/AppendTo2D has to be your image height
    5. Start continous grabbing
    6. Search for Source/Inject, Insert the file name of your test image to the parameter "ImageFile", Change "InjectFromFile" to yes
    7. --> Now you have one test image in DRAM
    8. Search for Source/Cam0_Loop1_Inject2, Set parameter "SelectSource" to 1
    9. Search for Source/RUN, Set parameter "Mode" to High for continous operation, or to Pulse for a single shot (Toggle this for more single shots).
    10. Look at the output :)

    I hope this helps.

    Best regards,


  • Thank you for the exquisite double loop reference design.

    Basically, you decompose the two-dimensional image into line images first, so that the processing results of the previous image lines can be fed back to the current line through a loop; then, the line image is decomposed into individual pixels to provide a pixel-level loop.

    I will try to integrate the double loop method into my own design. Thank you very much!

    In addition, since the loop is at the pixel level, the design can only work under one pixel parallelism.

    You only achieved a processing speed of 3.2fps for a 1024x1024 image at a clock frequency of 125MHz. Since each pixel is executed sequentially and there is no pipeline between pixels, the processing time of an image is the sum of the individual processing time of all pixels. With this calculation, the operation of each pixel (calculating the mean value) requires about 39 clock cycles. Is this reasonable?

    The processing speed of 3.2fps is not enough for my application. Do you have any additional optimization suggestions, such as how the clock fraction mentioned by Johannes Trein on the third floor is implemented in VisualApplets.

  • Dear,

    the only thing which came in my Mind was to decompose the single lines to a 0D-Stream, which would dramatically improve the inner loop latency (and therefore the bandwidth). Right now every Pixel doubles every operators latency as the subsequent EoL and EoF flags are present.

    The only problem, which I couldn't solve until now is, that I would need a multiplexer like InsertImage on 0D-level.

    Maybe you or someone else has an idea for that problem?

    Best regards,


  • Anyway, thank you very much.

    It is worth mentioning that my question comes from the semi-global matching algorithm[1] in stereo matching, which may be of interest to you. This is a very common algorithm in stereo vision.

    I also found a problem when designing a stereo matching algorithm. When a design contains a large number of operators (for example, several thousand operators, which is common when the disparity level of stereo matching is 128 or greater), VisualApplets will becomes very stuck or even exits abnormally. Is this a known problem?

    [1] Hirschmuller H. Stereo processing by semiglobal matching and mutual information[J]. IEEE Transactions on pattern analysis and machine intelligence, 2007, 30(2): 328-341.

  • Thank you for mentioning that algorithm. I heard of it, several times.

    I have a big Design right now, too. VA is stable in my case, please ensure that you use the newest version, which is 3.2.1 right now. I encountered, that modifying links takes more time in big designs.

    A workaround I use is, to develop single parts of the big design in smaller designs, which speeds up the development.

    Aside that: I wish you sucess for the implementation!