AT logo - The Machine Vision Specialists  
  Cameras  
  Frame Grabbers  
  Lenses  
  Lighting  
  Software  
  Vision Systems  
  Accessories  
  News  
  Careers  
  Contact AT  
  Site Map  
Request product information

Contact
info@adeptturnkey.com.au

Perth:
(08) 9242 5411

Sydney:
(02) 9905 5551

Melbourne:
(03) 9384 1775


Defence Recognised Supplier Scheme Logo

 

How depth information is obtained using stereo vision

Go to 3D Cameras

Making it work with the stereo-matching technique and texture projection

General stereo vision configuration
Depth perception from stereo vision is based on the triangulation principle. Two cameras with projective optics as in, for example, the camera/projector integrated unit, Ensenso N10, are used and arranged side-by-side so that their view fields overlap at the desired object distance. By taking a picture with each camera the scene is captured from two different viewpoints.

(Figure 1, right, illusrates an example from the Middlebury Stereo Dataset "cones" where a scene with paper cones is imaged with a stereo camera. The projection rays of two cone tips into both camera images are drawn as an example).

For each surface point visible in both images, there is a ray in 3D space connecting the surface point with each camera’s centre of projection. In order to obtain the 3D position of the whole captured scene two tasks need to be accomplished. Firstly, we need to identify where each surface point that is visible in the left image is located in the right image. Secondly, the exact camera geometry must be known to compute the ray intersection point for associated pixels of the left and right camera. Assuming that the cameras are firmly attached to each other, the geometry is only computed once during the calibration process.

Application-specific processing
A three-step process, described below, needs to be carried out on the stereo image pair in order to obtain the full 3D point cloud of the scene. The point cloud then needs to be processed further to realize a specific application.

It can be used to match the surface of the scene against a known object, either learned from a previous point cloud or a CAD model. If the part can be located uniquely in the captured scene surface, the complete position and rotation of the object can be computed and it could, for example, be picked up by a robot.

Calibration
The geometry of the two-camera system is computed, deductively, in the stereo calibration process. Firstly, a calibration object is required and this is usually a planar calibration plate with a checkerboard or dot pattern of known size. Secondly, synchronous image pairs are captured from both cameras showing the pattern in different positions, orientations and distances. Consequently, the pixel locations of the "points" pattern dots in each image pair and their known positions on the calibration plate can be used to compute both the 3D poses of all observed patterns and an accurate model of the stereo camera
. The model consists of the so-called intrinsic parameters of each camera (the camera’s focal length and distortion) and the extrinsic parameters (the rotation and shift in three dimensions between the left and right camera). This calibration data is used to triangulate corresponding points that have been identified in both images and recover their metric 3D co-ordinates with respect to the camera.

Figure 2: Search space to match image locations is only one dimensional. Left: The epipolar lines are curved in the distorted raw images. Middle: Removing image distortions results in straight epipolar lines. Right: Rectification makes epipolar lines aligned with the image axes. Correspondence search can be carried out along image scanlines.

Processing steps for depth computation
Three steps are now necessary for computing 3D location for each pixel of an image pair. These steps must be performed in real time for each captured stereo image to obtain a 3D point cloud or surface of the scene.
1 Rectification
 

In order to triangulate the imaged points we need to identify corresponding image parts in the left and right image. Considering a small image patch from the left image, the entire right image could be searched for a sufficiently good match but this can be computationally intensive and too time consuming to be done in real time. Consider the stereo image pair example in Figure 3, (right), with the cone tip visible in the top of the left image in the pair. Where do we have to search for the cone tip in the second image of the pair? Intuitively it does not seem necessary to search for the tip of the cone in the bottom half of the right image when the cameras are mounted side by side. In fact, the geometry of the two projective cameras allows us to restrict the search to a one dimensional line in the right image, the so called epipolar line.

Figure 2 (above left) shows a few hand-marked point correspondences and their epipolar lines. In the raw camera images the epipolar lines will be curved due to distortions caused by the camera optics. Searching correspondences along these curved lines will be quite slow and complicated, but the the image distortions can be removed by applying the inverse of the distortion learnt during the calibration process. The resulting undistorted images have straight epipolar lines, depicted in Figure 2 (above middle).

Although straight, the epipolar lines will have different orientations in different parts of each image. This is caused by the image planes (i.e. the camera sensors) neither being perfectly co-planar nor identically oriented. To further accelerate the correspondence search we can use the camera geometry from the calibration and apply an additional perspective transformation to the images such that the epipolar lines are aligned with the image scan lines*. This step is called rectification.

The search for the tip of the white cone can now be carried out by simply looking at the same scan line in the right image and finding the best matching position. All further processing will take place in the rectified images only, the resulting images are shown in Figure 2 (above right).
*Mathematically speaking, the rays captured by the same scan line of the rectified left and right image are all contained within a single plane.

2 Stereo matching
  For each pixel in the left image, we can now search for the pixel on the same scan line in the right image which captured the same object point.

Because a single pixel value is typically not discriminative enough to reliably find the corresponding pixel, it is more beneficial to match small windows (e.g. 7x7 pixels) around each pixel against all possible windows in the right image on the same row. The entire row need not be searched; only a limited number of pixels to the left of the left image pixel’s x-coordinate, corresponding to the slightly cross-eyes gaze necessary to focus near objects. This accelerates the matching and restricts the depth range where points can be triangulated. If a sufficiently good and unique match is found, the left image pixel is associated with the corresponding right image pixel. The association is stored in the disparity map in the form of an offset between the pixels x-positions (Figure 4 below).
Figure 4: Result of image matching. The disparity map represents depth information in form of pixel shifts between left and right image. A special value (black) is used to indicate that a pixel could not be identified in the right image. This will happen for occluded areas or reflections on the object, which appears differently in both cameras.

This matching technique is called local stereo matching as it only uses local information around each pixel. It works when a region in the left and right images can be matched because it is sufficiently distinct from other image parts on the same scan line. Local stereo matching will fail on objects and in regions with a poor or a repetitive texture impedes it from finding a unique match.

Other methods, known as global stereo matching, can also exploit neighboring information where the process doesn't just consider each pixel (or image patch) individually to search for a matching partner,but rather will try to find an assignment for all left and right image pixels at once. This global assigment also takes into account that surfaces are mostly smooth and thus neighbouring pixels will often have similar depths. Global methods are more complex and need more processing power than the local approach, but they require less texture on the surfaces and deliver more accurate results, especially at object boundaries.
   
3 Reprojection
  Regardless of what matching technique is used, the result is always an association between pixels of the left and right image, stored in the disparity map. The values in the disparity map encode the offset in pixels, where the corresponding location was found in the right image. Figure 4 (above) illustrates the disparity notion. The camera geometry obtained during calibration can then be reused to convert the pixel-based disparity values into actual metric X, Y and Z co-ordinates for every pixel. This conversion is called reprojection.

We can simply intersect the two rays of each associated left and right image pixel, as illustrated earlier in Figure 1 (above). The resulting XYZ data is called a point cloud and is often stored as a three-channel image to keep, also, the point’s neighbouring information from the image’s pixel grid. A visualization of the point cloud illustrated in Figure 5 (right) which shows the view of the 3D Surface generated from the disparity map and the camera calibration data. The surface is also textured with the left camera image (here converted to gray scale).

Adept Turnkey Pty Ltd are "The Machine Vision and Imaging Specialists" and distributor of Machine Vision products in Australia and New Zealand. To find out more about any machine vision product please call us at Perth (08) 9242 5411 / Sydney (02) 9905 5551 / Melbourne (03) 9384 1775 or contact us online.

 

 

If you like this page, please recommend and share it.

Facebook Twitter More