![]() |
Abhijit
S. Ogale Senior Engineer, Computer Vision and Graphics ![]() |
Home Curriculum vitae Publications Research Teaching Download Code |
|
Research · Low and intermediate-level vision ·
Video
stabilization and tracking Low and intermediate-level vision
Summary: Low and
intermediate-level vision consists of several problems, such as the
computation of stereo disparity, optical flow, depth, shape,
occlusions, 3D motion, and various segmentation problems based on
modalities such as depth, motion, texture and color. These problems
depend on each other in a chicken-and-egg fashion. For ease of
formulation and solution, they are often treated independently, which
only leads to sub-optimal and sometimes even incorrect solutions. As
part of my doctoral work, I have shown that problems such as image
correspondence, segmentation and shape are inseparable, and must only
be solved together. This work has led to new compositional
stereo and optical flow algorithms, which succeed in cases where other
approaches fail.
Summary Visual motion
analysis includes problems such as camera motion estimation, 3D
structure from video, and the detection of independently moving objects
in a video. In my doctoral work, I classify independently moving
objects into three groups, including a previously unknown group of
moving objects which is found using occlusions. I have created
algorithms which automatically discover ordinal depth relations in a
video using occlusions to find new moving objects. This techniques can
be used as building blocks for applications such as semi-autonomous
(e.g., driver assistance technology for cars) and autonomous navigation
(e.g., unmanned ground vehicles), video compression, surveillance and
graphics.
There are three classes of independently moving objects: (a) Class 1: those which are detected using motion alone (b) Class 2: those which are detected using motion and occlusions by looking for ordinal depth violations, and (c) Class 3: those which are detected by comparing depth from motion with depth from another source (such as stereo). Toy examples of these three classes are shown below. In each case, the red object moves independently, and the arrows indicate the optical flow. Dashed regions indicate regions which will soon be occluded due to the movement.
The first column shows a situation in which the background objects (non-independently moving) are translating horizontally, while the red object is moving vertically. In this scenario, motion based clustering approaches will be successful and such Class 1 objects can be detected using motion alone. The second column shows a situation in which the background objects are translating horizontally to the right, and the red object also moves towards the right. In this scenario, motion clustering will fail, and we also need the occlusions to find such objects. The occlusions tell us that the red object is behind the black object. However, if we compute depth from motion, since the motion is predominantly a translation, the result would indicate that the red object is in front of the black object (since the red object moves faster). This conflict signals Class 2 moving objects. The third column shows a situation similar to the second column, except that the black object which was in front of the red object has been removed. The ordinal depth conflict in the earlier case is no longer present, and we must employ cardinal comparisons between structure from motion, and structure from another source (such as stereo) to identify Class 3 moving objects. Finding
ordinal depth using occlusions: Given two
frames from a video, occlusions are points in one frame which have no
corresponding point in the other frame. However, merely knowing the
occluded regions is not sufficient to deduce ordinal depth. We also
need to know `who occluded what' as opposed to merely knowing `what was
occluded'. Thus, we have to group occluded regions with their
neighboring visible regions. Then, if we find (say) that region R1
disappeared under region R2, then we can say that R1 is behind R2. The figure below demonstrates the idea behind finding ordinal depth by filling in occlusions found in the optical flow. The top portion (a) shows three frames of a video sequence. The yellow region which is visible in Frame 1 and Frame 2 disappears behind the tree (i.e., becomes occluded) in Frame 3. The next row (b) shows the reverse optical flow (frame 2 to 1) and the forward flow (frame 2 to 3). Only the x-components are shown. Occlusions are colored white. In (c), occlusions in the forward flow u23, are filled using the segmentation from the reverse flow u21. After filling, we can find ordinal depth relations as shown in (d), where the tree (marked in green) is found to be in front of the region on it's left. The advantage of this technique is that it uses purely optical flow information, and is directly applicable even in the case where independently moving objects are present.
Summary: Human activity, like human speech, requires mechanisms which can be used for both recognitive and generative purposes. The relationship runs even deeper, since human speech is mostly used to describe actions. Hence, it makes sense to examine whether the computational models for speech can also be applied to the problem of recognizing and generating human actions. Adopting this viewpoint, my current research seeks to model human activity using grammars. Loosely speaking, the alphabet of this language consists of body poses (which include motion data), the words can be thought of as actions (such as jump, kneel), while sentences describe activity. Sequences of simple actions can be parsed to discover more abstract descriptions of activity. We use training videos using many actors and multiple viewpoints, with each actor performing a given set of basic actions. The figure below shows a dataset with 8 views, 10 actors, and several sample keyposes.
Key poses are detected by extracting keyframes, which are extreme pose or movement configurations. These are found using the optical flow. Here is an example with the sit and stand actions. For more examples, click here. Optical flow: (The video below shows the action (left), X-component of the optical flow (middle) and Y-component (right). Note: blue indicates negative, red indicates positive flow.) Click the image below to play the video.
This training data is used to create a model (a probabilistic context-free grammar), which is then used for recognizing (parsing) actions and viewpoints within a new video. The figure below shows an example of viewpoint and 3D pose recognition using this system. The input video (in the leftmost column) shows a person walking on a circle, then picking up something on the floor. On the right side, each row shows the identified multi-viewpoint 3D pose from the database, and the orange cells denote the identified view.
The figure
below shows the most probable parse tree returned by the system for a
novel sequence involving four actions performed in sequence (walk, turn, kick, kneel).
Summary: At the
This system was used for several purposes. The photo shows the cameras in an omnidirectional Argus Eye configuration, with cameras mounted on a wooden octahedron, which is carried around by a person. The papers related to the Argus Eye configuration are given below. Video stabilization and tracking Summary: Video
stabilization and moving object detection and tracking software for
project Video Verification of Identity (VIVID). Some
stabilization results: The clip shows
the original input video on top, and the stabilized video at the
bottom. The stabilization is reset every sixty frames. Click on the
image below to see the video clip (On some machines, Windows
Media player fails to load the video. In this case, save the video to
disk and then play it; this works. Otherwise, try other players). Here is another example where the same technique is applied to a video with plenty of depth variation, and large independent object motion. The video shows the stabilized sequence, and the inset on the top-left shows the original video. Click the image below to play the video. |
| Last updated: Feb 2008 |
|