ARCore and Arkit, What is under the hood: SLAM (Part 2)

In our last blog post (part 1), we took a look at how algorithms detect keypoints in camera images. These form the basis of our world tracking and environment recognition. But for Mixed Reality, that alone is not enough. We have to be able to calculate the 3d position in the real world. It is often calculated by the spatial distance between itself and multiple keypoints. This is often called Simultaneous Localization and Mapping (SLAM). And this is what is responsible for all the world tracking we see in ARCore/ARKit.

What we will cover today:

How ARCore and ARKit does it's SLAM/Visual Inertia Odometry
Can we D.I.Y our own SLAM with reasonable accuracy to understand the process better

Sensing the world: as a computer

When we start any augmented reality application in mobile or elsewhere, the first thing it tries to do is to detect a plane. When you first start any MR app in ARKit, ARCore, the system doesn't know anything about the surroundings. It starts processing data from camera and pairs it up with other sensors.

Once it has those data it tries to do the following two things

Build a point cloud mesh of the environment by building a map
Assign a relative position of the device within that perceived environment

From our previous article, we know it's not always easy to build this map from unique feature points and maintain that. However, that becomes easy in certain scenarios if you have the freedom to place beacons at different known locations. Something we did at Mozfest 2016 when Mozilla still had the Magnets project which we had utilized as our beacons. A similar approach is used in a few museums for providing turn by turn navigation to point of interests as their indoor navigation system. However Augmented Reality systems don't have this luxury.

A little saga about relationships

We will start with a map.....about relationships. Or rather "A Stochastic Map For Uncertain Spatial Relationships" by Smith et al.

In the real world, you have precise and correct information about the exact location of every object. However in AR world that is not the case. For understanding the case lets assume we are in an empty room and our mobile has detected a reliable unique anchor (A) (or that can be a stationary beacon) and our position is at (B).

In a perfect situation, we know the distance between A and B, and if we want to move towards C we can infer exactly how we need to move.

Unfortunately, in the world of AR and SLAM we need to work with imprecise knowledge about the position of A and C. This results in uncertainties and the need to continually correct the locations.

The points have a relative spatial relationship with each other and that allows us to get a probability distribution of every possible position. Some of the common methods to deal with the uncertainty and correct positioning errors are Kalman Filter (this is what we used in Mozfest), Maximum Posteriori Estimation or Bundle Adjustment.

Since these estimations are not perfect, every new sensor update also has to update the estimation model.

Aligning the Virtual World

To map our surroundings reliably in Augmented Reality, we need to continually update our measurement data. The assumptions are, every sensory input we get contains some inaccuracies. We can take help from Milios et al in their paper "Globally Consistent Range Scan Alignment for Environment Mapping" to understand the issue.

Image credits: Lu, F., & Milios, E. (1997). Globally consistent range scan alignment for environment mapping

Here in figure a, we see how going from position P1....Pn accumulates little measurement errors over time until the resulting environment map is wrong. But when we align the scan sin fig b, the result is considerably improved. To do that, the algorithm keeps track of all local frame data and network spatial relations among those.

A common problem at this point is how much data to store to keep doing the above correctly. Often to reduce complexity level the algorithm reduces the keyframes it stores.

Let's build the map a.k.a SLAM

To make Mixed Reality feasible, SLAM has the following challenges to handle

Monocular Camera input
Real-time
Drift

Skeleton of SLAM

How do we deal with these in a Mixed Reality scene?

We start with the principles by Cadena et. al in their "Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age" paper. From that paper, we can see the standard architecture of SLAM to be something like

Image Credit: Cadena et al

If we deconstruct the diagram we get the following four modules

Sensor: On mobiles, this is primarily Camera, augmented by accelerometer, gyroscope and depending on the device light sensor. Apart from Project Tango enabled phones, nobody ahd depth sensor for Android.
Front End: The feature extraction and anchor identification happens here as we described in previous post.
Back End: Does error correction to compensate for the drift and also takes care of localizing pose model and overall geometric reconstruction.
SLAM estimate: This is the result containing the tracked features and locations.

To better understand it, we can take a look at one of the open source implementations of SLAM.

D.I.Y SlAM: Taking a peek at ORB-SLAM

To try our hands on to understand how SLAM works let's take a look at a recent algorithm by Montiel et al called ORB-SLAM. We will use the code of its successor ORB-SLAM2. The algorithm is available in Github under GPL3 and I found this excellent blog which goes into nifty details on how we can run ORB-SLAM2 in our computer. I highly encourage you to read that to avoid encountering problems at the setup.

His talk is also available here to see and is very interesting

ORB-SLAM just uses the camera and doesn't utilize any other gyroscope or accelerometer inputs. But the result is still impressive.

Detecting Features: ORB-SLAM, as the name suggests uses ORB to find keypoint and generate binary descriptors. Internally ORB is based on the same method to find keypoint and generating binary descriptors as we discussed in part 1 for BRISK. In short ORB-SLAM analyzes each picture to find keyframe and then store it with a reference to the keyframe in a map. These are utilized in future to correct historical data.

Keypoint > 3d landmark: The algorithm looks for new frames from the image and when it finds one it performs keypoint detection on it. These are then matched with the previous frame to get a spatial distance. This so far provides a good idea on where it can find the same key points again in a new frame. This provides the initial camera pose estimation.

Refine Camera Pose: The algorithm repeats Step 2 by projecting the estimated initial camera pose into next camera frame to search for more keypoint which corresponds to the one it already knows. If it is certain it can find them, it uses the additional data to refine the pose and correct any spatial measurement error.

green squares = tracked keypoints. Blue boxes: keyframes. Red box = camera view. Red points = local map points.
Image credits: ORB-SLAM video by Raúl Mur Artal

Returning home a.k.a Loop Closing

One of the goals of MR is when you walk back to your starting point it should understand you have returned. The inherent inefficiency and the induced error make it hard to accurately predict this. This is called loop closing for SLAM. ORB-SLAM handles it by defining a threshold. It tries to match keypoints in a frame with next frames and if the previously detected frames matching percentage exceeds a threshold then it knows you have returned.

Loop Closing performed by the ORB-SLAM algorithm.
Image credits: Mur-Artal, R., Montiel

To account for the error, the algorithm has to propagate coordinate correction throughout the whole frame with updated knowledge to know the loop should be closed

The reconstructed map before (up) and after (down) loop closure.
Image credits: Mur-Artal, R., Montiel

SLAM today:

Google: ARCore's documentation describes it's tracking method as "concurrent odometry and mapping" which is essentially SLAM+sensor inputs. Their patent also indicates they have included inertial sensors into the design.

Apple: Apple also is using Visual Interial Odometry which they acquired by buying Metaio and FlyBy. I learned a lot about what they are doing by having a look at this video at WWDC18.

Additional Read: I found this "A comparative analysis of tightly-coupled monocular, binocular, and stereo VINS" paper to be a nice read to see how different IMU's are used and compared. IMU's are the devices that provide all this sensory data to our devices today. And their calibration is supposed to be crazy difficult.

I hope this post along with the previous one provides a better understanding of how our world is tracked inside ARCore/ARKit.

In a few days, I will start another blog series on how to build Mixed Reality applications and use experimental as well as some stable WebXR api's to build Mixed Reality application demos.

As always feedbacks are welcome.

References/Interesting Reads:

The first part of this lives here: https://blog.rabimba.com/2018/10/arcore-and-arkit-what-is-under-hood.html

Build Smarter AI Agents Faster: Introducing the Google Agent Development Kit (ADK)

The world is buzzing about AI agents – intelligent entities that can understand goals, make plans, use tools, and interact with the world to get things done. But building truly capable agents that go beyond simple chatbots can be complex. You need to handle Large Language Model (LLM) interactions, manage conversation state, give the agent access to tools (like APIs or code execution), orchestrate complex workflows, and much more. Introducing the Google Agent Development Kit (ADK) , a comprehensive Python framework from Google designed to significantly simplify the process of building, testing, deploying, and managing sophisticated AI agents. Whether you're building a customer service assistant that interacts with your internal APIs, a research agent that can browse the web and summarize findings, or a home automation hub, ADK provides the building blocks you need. Core Concepts: What Makes ADK Tick? ADK is built around several key concepts that make agent development more s...

RK's Rambling

Search This Blog