Posts by SWe

    Thank you for mentioning that algorithm. I heard of it, several times.


    I have a big Design right now, too. VA is stable in my case, please ensure that you use the newest version, which is 3.2.1 right now. I encountered, that modifying links takes more time in big designs.


    A workaround I use is, to develop single parts of the big design in smaller designs, which speeds up the development.


    Aside that: I wish you sucess for the implementation!

    Dear,


    the only thing which came in my Mind was to decompose the single lines to a 0D-Stream, which would dramatically improve the inner loop latency (and therefore the bandwidth). Right now every Pixel doubles every operators latency as the subsequent EoL and EoF flags are present.


    The only problem, which I couldn't solve until now is, that I would need a multiplexer like InsertImage on 0D-level.

    Maybe you or someone else has an idea for that problem?


    Best regards,

    Simon

    Dear,


    I'v built a basic example which applies that filtering. It is based on a double loop approach. The outer loop is for the feedback of the old line (B, C, D), the inner loop is pixel based and feeds back the last value (A).


    On Hardware (ME5-VCL, 125MHz) I only achieved 3.2fps with a (1024x1024)px image. But maybe it's fast enough for your application.


    How to test on hardware:

    1. Build, Flash, Load in MicroDisplay
    2. Set the timeout of your ackquisition to really high value
    3. Search for Source/SimDimension and Source/P32ROI and set the test image dimensions
    4. PixelsToLineImage and LineImageToOneLine have to be set to the image width, Output/AppendTo2D has to be your image height
    5. Start continous grabbing
    6. Search for Source/Inject, Insert the file name of your test image to the parameter "ImageFile", Change "InjectFromFile" to yes
    7. --> Now you have one test image in DRAM
    8. Search for Source/Cam0_Loop1_Inject2, Set parameter "SelectSource" to 1
    9. Search for Source/RUN, Set parameter "Mode" to High for continous operation, or to Pulse for a single shot (Toggle this for more single shots).
    10. Look at the output :)

    I hope this helps.

    Best regards,

    Simon

    Dear Lucy,


    please have a look at the modified version.

    CarmenZ and I modified it to achieve more bandwidth.

    Most of the magic is done in "Par8_ROI_Extract". There are 2 LUTs with the pixel coordinates to extract from the image. The box "FrameBufferRandomRd_Par8" is a special approach to get more bandwidth while being able to do random reads in DRAM. You can find this in the examples folder of VA: "Examples/Processing/Geometry/GeometricTransformation/GeometricTransformation_PixelReplicator.va". It's just a little bit stripped down. It is also really expensive, I hope the rest of your design fits to the FPGA. If not, please tell me.


    Some changes are in "subtract_bkgnd/ImageSequence", where the three ImageFIFOs and the InsertImage are replaced by a FrameMemoryRandomRd with an address generator which repeats the input image three times.


    I hope this helps.


    Best regards,

    Simon

    Dear,


    you are appending 4x4 frames, which results in an image height of 16. The Simulation stops then:

    Quote

    (BRANCH) Process0\subtract_bkgnd\ImageSequence\CopySequence\I: Image height(16) exceeds the maximal link height(4)!

    In runtime the behaviour of exceeded link dimensions is undefined! So maybe that's your main problem.


    Aside from this I don't see a obvious deadlock problem. All of your FIFOs are big enough.

    I see a bandwidth problem, as you are using a DRAM-operator at parallelism 1.

    What is the frame rate of your camera?

    On ME5 platforms the maximum possible parallelism should be used, as stated in the manual (Appendix. Device Resources/Shared Memory Concept):

    Quote

    Due to the shared bandwidth architecture, the applet developer should utilize all 256 bits of the operator’s memory interface (RAM Data Width) to achieve maximal throughput through the memory interface when using multiple RAM based operators even though the single RAM operator needs less bandwidth on its input.



    Best regards,

    Simon

    So you want to implement something, which provides P{0,0} = mean(P_{0,-1},P_{-1,-1},P_{-1,0},P_{-1,1})? Where P{y-relative,x-relative} is the "recursive" mean at it's position?


    If you want to, I can provide you a basic design.

    From a system theoretic point of view this should converge to something like a (masked) Gaussian mask. Maybe you can evaluate this and create a single filter mask?


    Greetings

    Simon

    Hi,


    it is possible to use a loop for single lines for B, C and D, as Johannes mentioned. But your image dimensions have to be constant. For A I would use a Pixel-Neighbours-Operator to do the calculation once again. Just pass a Kernel with all used arguments to that operator.

    A far as I can see, parallelism > 1 could work, depends on the operations you have to do.


    Greetings,

    Simon


    Edit: A kernel could be difficult - but Pixel-Neighbours is an O-Type, so you can use multiple in Parallel.

    Dear Pier,


    I would suggest to use the "SignalGate" operator.

    For the "Gate" input I would use a "RS-FlipFlop", which is Set/Reset by your GPIOs,


    Best regards

    Simon


    Edit: For a defined number of Frames, use a "PulseCounter" and count it via "FrameStartToSignal"

    Hello Community,


    similar to my big Sqrt problem (Thread), I encountered the need for wide multiplications in one of my designs.

    In this post I want to show you a solution for that problem I worked out.


    My solution is to split the multiplication into two multiplications and add them after the multiplications:

    BigMult.PNG


    For signed values, I take the absolute value and do a manual complement afterwards:

    BigMult2.PNG

    In this example only the lower link is signed. If both links are signed, do the absolute value and the negative check for the upper link, too. Feed "neg A xor neg B" into the lower port of "Complement".


    I added a test design for you. There are other options possible: you may use the same approach like in the Sqrt example to work with pseudo floating point data.


    Best regards

    Simon

    Hello,


    I did some optimization:

    • As the sqrt operation only allows shift-increments by two, the adaptive shift operations can be done with 16 cases and not 32
    • The "highest bit set" logic can be done via combinatoric logic

    This leads to the following improvements:

    Old New
    LUT 1218 789
    Flip Flop 218 105


    Best regards

    Simon

    Files

    Hello Community,


    in my recent work I got a Problem: Sqrt for values bigger than 32 Bit.

    Just shifting right with a hard coded number of Bits is not a good idea, since there may be small Values observed, too.


    I want to share my solution I came up with.


    What the design does:

    1. Check which set bit's number is the most significant
    2. Shift dynamically to the right -> "Anti fractional Bits", they have to be an even number, as the sqrt operation divides them by two
    3. Perform the sqrt operation on the shifted value
    4. Shift left by the "Anti fractional Bits" divided by two


    Pro:

    • High precision

    Con:

    • Extremely Expensive

    It's just like the poor man's floating point

    Please feel free to comment - maybe there is some ability to improve this.


    Best regards

    Simon

    Dear Pier,


    could you please post an image with example raw data of that camera's output?
    I've read the document, but couldn't find the specification how the data is packed (and transfered).


    Do you know what pixel data format is used?

    What ROI dimensions do you use for the triangulation?



    Best regards

    Simon

    Thank you!

    In my opinion, you should do some preprocessing (Maybe a Closing) to make sure, you pass only one (and the right!) Blob to the subtraction loop. You could select the right blob based on some other parameters, like area or bounding box, too.


    Besides that, the design should work.


    Greetings

    Simon

    Hey Mich,


    I did some changes to your design. It should work now. You created yourself a deadlock with "InsertImage". To fix that "CreateBlankImage" is used to create an independent input for the dummy values. The loop now calculates the X- and Y-Difference in parallel, as I moved it to a kernel.


    Do you use blob analysis more often in your Applications? I have a question online here, too - maybe you could help me with that? Link: BLOB-Features


    If you have further questions: don't hesitate asking :)


    Thanks and Greetings

    Simon