126x Filetype PDF File size 0.41 MB Source: vnav.mit.edu
16.485: Visual Navigation for Autonomous Vehicles (VNAV) Fall 2021 Lecture 14: 2-view Geometry Lecturer: Luca Carlone Scribes: - Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor(s). This lecture discusses: • basic principles of 2-view geometry for calibrated cameras (Epipolar constraint, Essential matrix), • how to compute the essential matrix from pixel correspondences in 2 camera images, • how to estimate the relative pose (up to scale) between the two cameras from the Essential matrix. The presentation mostly follows Chapter 5 in [1, Sections 5.1-5.3]. 14.1 Epipolar constraint and Essential matrix From Lecture 12, we know how to compute keypoint correspondences in two images using feature tracking or descriptor-based feature matching. In other words, given a pixel x1 in image I1, we are able to compute the corresponding pixel x in image I (assuming the two images are picturing the same scene). Note that 2 2 “corresponding pixels” refers to the fact that the two pixels picture the same 3D point. In this lecture our goal is to compute the geometry of the two cameras taking images I and I (i.e., the 1 2 relative pose between the cameras) given a number of pixel correspondences. Towards this goal, we take two main assumptions: • the pixel correspondences are correct, i.e., for every correspondence (x1,x2), the two pixels do represent the same 3D point. We will relax this assumption during the next lecture, since in practice many of the correspondences may be wrong. We also assume that the 3D point does not move. • the cameras are calibrated, i.e., the calibration matrices: s f s f o s f s f o x 1 θ 1 x x 2 θ 2 x 1 1 1 2 2 2 K = 0 s f o K = 0 s f o (14.1) 1 y 1 y 2 y 2 y 1 1 2 2 0 0 1 0 0 1 of each camera is known. This is an acceptable assumption in robotics, where we can typically calibrate the cameras on our robots and compute K1 and K2 before deployment (note: for a mobile robot K1=K2,i.e., both images are collected by the same camera at different time instants). w The perspective projection of the 3D point p to the two cameras (Fig. 14.1) can be written as: c c c w c c c w 1 ˜ 1 1 ˜ 2 ˜ 2 2 ˜ p x =K [R t ]p p x =K [R t ]p (14.2) z 1 1 w w z 2 2 w w ci ci ci where (R t ) is the (inverse of) the pose of the camera i in the world frame, p is the depth of the point w w z w ˜ with respect to camera i, and p contains the point coordinates with respect to the world frame (note: homogeneous coordinates). 14-1 14-2 Lecture 14: 2-view Geometry Figure 14.1: 2-view geometry. Since the point is unknown anyway and we only attempt to compute the relative pose between the two cameras, we may simply assume that c is the w frame, which allows simplifying the expressions above as: 1 c1 c1 c2 c2 c1 ˜ ˜ ˜ ˜ d x =K [I 0 ]p =K p d x =K [R t ]p (14.3) 1 1 1 3 3 1 2 2 2 c1 c1 c1 c2 where to simplify the notation we also defined d = p and d =p . 1 z 2 z Since the calibration matrices are known, we can pre-multiply both members of the equation on the left by K−1 and both members of the equation on the right by K−1: 1 2 c c c c ˜ 1 ˜ 2 1 2 d y =p d y =R p +t (14.4) 1 1 2 2 c c 1 1 ˜ −1˜ ˜ −1˜ ˜ ˜ where y = K x and y = K x (both still expressed in homogeneous coordinates). y and y are 1 1 1 2 2 2 1 2 often called “calibrated” pixel coordinates. Substituting pc1 from the expression on the left to the right one: ˜ c2 ˜ c2 d y =d R y +t (14.5) 2 2 1 c 1 c 1 1 To simplify notation further, we drop the super- and subscripts for the rotation and translation and write t c2 c2 (instead of t ) and R (instead of R ). This should not cause confusion since these are the only translation c c 1 1 and rotation we are going to deal with in this lecture. Therefore, (14.5) becomes: ˜ ˜ d y =d Ry +t (14.6) 2 2 1 1 Premultiplying both members by [t]×: ˜ ˜ d [t] y = d [t] Ry (14.7) 2 × 2 1 × 1 Lecture 14: 2-view Geometry 14-3 T ˜ where we noticed that [t] t = 0 . Pre-multiplying both members by y : × 3 2 T T ˜ ˜ ˜ ˜ d y [t] y = d y [t] Ry (14.8) 2 2 × 2 1 2 × 1 T ˜ ˜ ˜ ˜ However, y [t] y = 0 (since [t] y is orthogonal to y ). Therefore: 2 × 2 3 × 2 2 T ˜ ˜ d y [t] Ry =0 (14.9) 1 2 × 1 Since d is non-zero, this leads to the Epipolar constraint: 1 T ˜ ˜ y [t] R y =0 (14.10) 2 × 1 which relates corresponding pixels in two images. The matrix E=[t]×R (14.11) is known as the Essential matrix. Definitions: • epipolar plane: plane passing through the optical centers and the point p • epipoles: intersection between the segment connecting the optical centers and the image planes ˜ • epipolar line: line corresponding to the set of pixels y2 in the second image that satisfy the epipolar ˜ constraint for a given pixel y1 in the first image. Geometric interpretation. The epipolar constraint can be written as (see slide 12): c c c T c c T c T c c 2 ˜ 2 ˜ 2 ˜ 2 ˜ 2 ˜ 2 ˜ ˜ 2 2 ˜ (t ×y )⊥(R y ) ⇐⇒ (t ×y ) R y =0 ⇐⇒ ([t ] y ) R y =0 ⇐⇒ y [t ] R y =0 c 2 c 1 c 2 c 1 c × 2 c 1 2 c × c 1 1 1 1 1 1 1 1 1 Stereo example (slide 14): T T 0 0 0 T 0 ˜ ˜ ˜ 0 0 b ˜ ˜ b y [t] Ry =0 ⇐⇒ y y =0 ⇐⇒ y =0 ⇐⇒ bv −bv =0 ⇐⇒ v =v 2 × 1 2 0 −b 0 1 2 −bv 2 1 2 1 1 Properties of the essential matrix: • The epipolar constraint does not constrain the scale of the translation. • A matrix is an essential matrix if and only if it has singular values {σ,σ,0} (in particular σ = ktk), see Theorem 5.5 in [1]. 2 Proof. We only prove that the largest eigenvalue of E = [t]×R is λmax(E) = ktk and the smallest eigenvalue is zero. A complete (but more involved) proof can be found in [1, Thm 5.5]. 2 T T T T T λ (E) = max kEdk = max d E Ed= max d R [t] [t] Rd (14.12) max kdk=1 kdk=1 kdk=1 × × T T 2 2 2 = max d [t] [t] d = max k[t] dk = max kt×dk = ktk (14.13) kdk=1 × × kdk=1 × kdk=1 2 2 λmin(E) = min kEdk = min kt×dk =0 (14.14) kdk=1 kdk=1 14-4 Lecture 14: 2-view Geometry • The space of the essential matrices is called the Essential space S (i.e., the space of 3 × 3 matrices E 3 that can be written as [t]×R for some R ∈ SO(3) and t ∈ R ). The projection of a matrix M onto the Essential space can be computed as prescribed in [1, Thm 5.9]: λ1+λ2 0 0 2 2 λ +λ T argminkE−Mk =U 0 1 2 0 V (14.15) F 2 E∈SE 0 0 0 where M =Udiag(λ ,λ ,λ )VT is a singular value decomposition of M. 1 2 3 14.2 How to estimate the Camera Poses from Correspondences? In this section we address the following problem: given N (calibrated) pixel correspondences, compute the relative pose (up to scale) between the cameras. This is typically done in 2 steps, discussed below: • use the pixel correspondences and the epipolar constraint to estimate the essential matrix E • retrieve the relative pose (R,t) between the cameras from the essential matrix E 14.2.1 Compute the Essential Matrix from Pixel Correspondences ˜ ˜ Assume that we are given N (calibrated) pixel correspondences (y , y ) for k = 1,...,N. As men- 1,k 2,k tioned at the beginning of this document, we assume that there is no outlier (i.e., we do not have wrong correspondences). Each of these correspondences need to satisfy the epipolar constraint (14.10): T ˜ ˜ y Ey =0 k = 1,...,N (14.16) 2,k 1,k ˜ ˜ Noticing that (y , y ) are known pixel values, we realize that these are simply linear equalities. Note that 1,k 2,k the essential matrix can be only computed up to scale since we can multiply (14.17) by an arbitrary constant without altering the equality. Recalling that E = [t] R this means that we can apply an arbitrary scaling × to the vector t without altering the epipolar constraint. This is consistent with the geometric intuition that from a set of camera images we are not able to resolve the scale of the scene, i.e., we are not able to understand if we are observing a small scene from close distance of a large scene from a far distance. Therefore, it is customary to assume ktk = 1, which means we only try to estimate the direction of the translation rather than it’s scale. Eight-point algorithm. In absence of noise, we can compute the essential matrix by solving a linear system. Rearranging the entries of E in a vector e ∈ R9, the set of linear equations (14.17) can be written as: T a e=0 k = 1,...,N (14.17) k ˜ ˜ where a are known vectors whose entries are only function of the pixel correspondences (y , y ). k 1,k 2,k T Stacking the vectors a as rows of a matrix A, the linear equations k Ae=0 (14.18) For this linear system to admit a unique solution, A ∈ RN×9 should have rank 8, therefore we need N = 8 point correspondences to compute the essential matrix using the linear system (14.18).
no reviews yet
Please Login to review.