Skip to main content

ARCore and Arkit, What is under the hood: SLAM (Part 2)

In our last blog post (part 1), we took a look at how algorithms detect keypoints in camera images. These form the basis of our world tracking and environment recognition. But for Mixed Reality, that alone is not enough. We have to be able to calculate the 3d position in the real world. It is often calculated by the spatial distance between itself and multiple keypoints. This is often called Simultaneous Localization and Mapping (SLAM). And this is what is responsible for all the world tracking we see in ARCore/ARKit.

What we will cover today:

  • How ARCore and ARKit does it's SLAM/Visual Inertia Odometry
  • Can we D.I.Y our own SLAM with reasonable accuracy to understand the process better

Sensing the world: as a computer

When we start any augmented reality application in mobile or elsewhere, the first thing it tries to do is to detect a plane. When you first start any MR app in ARKit, ARCore, the system doesn't know anything about the surroundings. It starts processing data from camera and pairs it up with other sensors.
Once it has those data it tries to do the following two things
  1. Build a point cloud mesh of the environment by building a map
  2. Assign a relative position of the device within that perceived environment
From our previous article, we know it's not always easy to build this map from unique feature points and maintain that. However, that becomes easy in certain scenarios if you have the freedom to place beacons at different known locations. Something we did at Mozfest 2016 when Mozilla still had the Magnets project which we had utilized as our beacons. A similar approach is used in a few museums for providing turn by turn navigation to point of interests as their indoor navigation system. However Augmented Reality systems don't have this luxury.

A little saga about relationships

We will start with a map.....about relationships. Or rather "A Stochastic Map For Uncertain Spatial Relationships" by Smith et al. 
In the real world, you have precise and correct information about the exact location of every object. However in AR world that is not the case. For understanding the case lets assume we are in an empty room and our mobile has detected a reliable unique anchor (A) (or that can be a stationary beacon) and our position is at (B). 
In a perfect situation, we know the distance between A and B, and if we want to move towards C we can infer exactly how we need to move.

Unfortunately, in the world of AR and SLAM we need to work with imprecise knowledge about the position of A and C. This results in uncertainties and the need to continually correct the locations. 

The points have a relative spatial relationship with each other and that allows us to get a probability distribution of every possible position. Some of the common methods to deal with the uncertainty and correct positioning errors are Kalman Filter (this is what we used in Mozfest), Maximum Posteriori Estimation or Bundle Adjustment. 
Since these estimations are not perfect, every new sensor update also has to update the estimation model.

Aligning the Virtual World

To map our surroundings reliably in Augmented Reality, we need to continually update our measurement data. The assumptions are, every sensory input we get contains some inaccuracies. We can take help from Milios et al in their paper "Globally Consistent Range Scan Alignment for Environment Mapping" to understand the issue. 
Image credits: Lu, F., & Milios, E. (1997). Globally consistent range scan alignment for environment mapping
Here in figure a, we see how going from position P1....Pn accumulates little measurement errors over time until the resulting environment map is wrong. But when we align the scan sin fig b, the result is considerably improved. To do that, the algorithm keeps track of all local frame data and network spatial relations among those.
A common problem at this point is how much data to store to keep doing the above correctly. Often to reduce complexity level the algorithm reduces the keyframes it stores.

Let's build the map a.k.a SLAM

To make Mixed Reality feasible, SLAM has the following challenges to handle
  1. Monocular Camera input
  2. Real-time
  3. Drift

Skeleton of SLAM

How do we deal with these in a Mixed Reality scene?
We start with the principles by Cadena et. al in their "Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age" paper. From that paper, we can see the standard architecture of SLAM to be something like
Image Credit: Cadena et al
If we deconstruct the diagram we get the following four modules
  1. Sensor: On mobiles, this is primarily Camera, augmented by accelerometer, gyroscope and depending on the device light sensor. Apart from Project Tango enabled phones, nobody ahd depth sensor for Android.
  2. Front End: The feature extraction and anchor identification happens here as we described in previous post.
  3. Back End: Does error correction to compensate for the drift and also takes care of localizing pose model and overall geometric reconstruction.
  4. SLAM estimate: This is the result containing the tracked features and locations.
To better understand it, we can take a look at one of the open source implementations of SLAM.

D.I.Y SlAM: Taking a peek at ORB-SLAM

To try our hands on to understand how SLAM works let's take a look at a recent algorithm by Montiel et al called ORB-SLAM. We will use the code of its successor ORB-SLAM2. The algorithm is available in Github under GPL3 and I found this excellent blog which goes into nifty details on how we can run ORB-SLAM2 in our computer. I highly encourage you to read that to avoid encountering problems at the setup.
His talk is also available here to see and is very interesting


ORB-SLAM just uses the camera and doesn't utilize any other gyroscope or accelerometer inputs. But the result is still impressive.

  1. Detecting Features: ORB-SLAM, as the name suggests uses ORB to find keypoint and generate binary descriptors. Internally ORB is based on the same method to find keypoint and generating binary descriptors as we discussed in part 1 for BRISK. In short ORB-SLAM analyzes each picture to find keyframe and then store it with a reference to the keyframe in a map. These are utilized in future to correct historical data.
  2. Keypoint > 3d landmark: The algorithm looks for new frames from the image and when it finds one it performs keypoint detection on it. These are then matched with the previous frame to get a spatial distance. This so far provides a good idea on where it can find the same key points again in a new frame. This provides the initial camera pose estimation.
  3. Refine Camera Pose: The algorithm repeats Step 2 by projecting the estimated initial camera pose into next camera frame to search for more keypoint which corresponds to the one it already knows. If it is certain it can find them, it uses the additional data to refine the pose and correct any spatial measurement error.
green squares  = tracked keypoints. Blue boxes: keyframes. Red box = camera view. Red points = local map points.
Image credits: ORB-SLAM video by Raúl Mur Artal


Returning home a.k.a Loop Closing

One of the goals of MR is when you walk back to your starting point it should understand you have returned. The inherent inefficiency and the induced error make it hard to accurately predict this. This is called loop closing for SLAM. ORB-SLAM handles it by defining a threshold. It tries to match keypoints in a frame with next frames and if the previously detected frames matching percentage exceeds a threshold then it knows you have returned.
Loop Closing performed by the ORB-SLAM algorithm.
Image credits: Mur-Artal, R., Montiel
To account for the error, the algorithm has to propagate coordinate correction throughout the whole frame with updated knowledge to know the loop should be closed
The reconstructed map before (up) and after (down) loop closure.
Image credits: Mur-Artal, R., Montiel

SLAM today:

Google: ARCore's documentation describes it's tracking method as "concurrent odometry and mapping" which is essentially SLAM+sensor inputs. Their patent also indicates they have included inertial sensors into the design.

Apple: Apple also is using Visual Interial Odometry which they acquired by buying Metaio and FlyBy. I learned a lot about what they are doing by having a look at this video at WWDC18.

Additional Read: I found this "A comparative analysis of tightly-coupled monocular, binocular, and stereo VINS" paper to be a nice read to see how different IMU's are used and compared. IMU's are the devices that provide all this sensory data to our devices today. And their calibration is supposed to be crazy difficult. 

I hope this post along with the previous one provides a better understanding of how our world is tracked inside ARCore/ARKit.

In a few days, I will start another blog series on how to build Mixed Reality applications and use experimental as well as some stable WebXR api's to build Mixed Reality application demos.
As always feedbacks are welcome.

References/Interesting Reads:

Comments

Popular posts from this blog

FirefoxOS, A keyboard and prediction: Story of my first contribution

Returning to my cubical holding a hot cup of coffee and with a head loaded with frustration and panic over a system codebase that I managed to break with no sufficient time to fix it before the next morning.  This was at IBM, New York where I was interning and working on the TJ Watson project. I returned back to my desk, turned on my dual monitors, started reading some blogs and engaging on Mozilla IRC (a new found and pretty short lived hobby). Just a few days before that, FirefoxOS was launched in India in the form of an Intex phone with a $35 price tag. It was making waves all around, because of its hefty price and poor performance . The OS struggle was showing up in the super low cost hardware. I was personally furious about some of the shortcomings, primarily the keyboard which at that time didn’t support prediction in any language other than English and also did not learn new words. Coincidentally, I came upon Dietrich Ayala in the FirefoxOS IRC channel, who...

Curious case of Cisco AnyConnect and WSL2

One thing Covid has taught me is the importance of VPN. Also one other thing COVID has taught me while I work from home  is that your Windows Machine can be brilliant  as long as you have WSL2 configured in it. So imagine my dismay when I realized I cannot access my University resources while being inside the University provided VPN client. Both of the institutions I have affiliation with, requires me to use VPN software which messes up WSL2 configuration (which of course I realized at 1:30 AM). Don't get me wrong, I have faced this multiple times last two years (when I was stuck in India), and mostly I have been lazy and bypassed the actual problem by side-stepping with my not-so-noble  alternatives, which mostly include one of the following: Connect to a physical machine exposed to the internet and do an ssh tunnel from there (not so reliable since this is my actual box sitting at lab desk, also not secure enough) Create a poor man's socks proxy in that same box to have...

April Fool and Google Part 2: A Round Up of ALL of Google’s April Fools Jokes

Ok....this post I think will contain all of the pranks I could find  for today. After my last post here http://rkrants.blogspot.com/2012/04/april-fool-and-google-my-favorite.html Last Time I reported Only a handful of the pranks.. Understandable, as it was only the morning. After that I stumbled upon more of them Which I am gonna round up here. Now staring with the list. The very first one is obviously our favourite Google Maps Quest The above is their official video. In a post in Google Plus they say about it as follows  Today  + Google Maps  announced Google Maps 8-bit for NES. With #8bitmaps , you can do everything you'd normally do in Maps—search for famous landmarks and sites around the world, get directions and even use Street View. Just in time for April Fool's Day, Google has introduced Google Maps Quest, a retro 8-bit version of its mapping tool that is... totally awesome. In a characteristically whimsical video, availabl...