Skip to main content

ARCore and Arkit: What is under the hood : Anchors and World Mapping (Part 1)

Reading Time: 7 MIn
Some of you know I have been recently experimenting a bit more with WebXR than a WebVR and when we talk about mobile Mixed Reality, ARkit and ARCore is something which plays a pivotal role to map and understand the environment inside our applications.

I am planning to write a series of blog posts on how you can start developing WebXR applications now and play with them starting with the basics and then going on to using different features of it. But before that, I planned to pen down this series of how actually the "world mapping" works in arcore and arkit. So that we have a better understanding of the Mixed Reality capabilities of the devices we will be working with.

Mapping: feature detection and anchors

Creating apps that work seamlessly with arcore/kit requires a little bit of knowledge about the algorithms that work in the back and that involves knowing about Anchors.

What are anchors:

Anchors are your virtual markers in the real world. As a developer, you anchor any virtual object to the real world and that 3d model will stay glued into the physical location of the real world. Anchors also get updated over time depending on the new information that the system learns. For example, if you anchor a pikcahu 2 ft away from you and then you actually walk towards your pikachu, it realises the actual distance is 2.1ft, then it will compensate that. In real life we have a static coordinate system where every object has it's own x,y,z coordinates. Anchors in devices like HoloLens override the rotation and position of the transform component.

How Anchors stick?

If we follow the documentation in Google ARCore then we see it is attached to something called "trackables", which are feature points and planes in an image. Planes are essentially clustered feature points. You can have a more in-depth look at what Google says ARCore anchor does by reading their really nice Fundamentals. But for our purposes, we first need to understand what exactly are these Feature Points.

Feature Points: through the eyes of computer vision

Feature points are distinctive markers on an image that an algorithm can use to track specific things in that image. Normally any distinctive pattern, such as T-junctions, corners are a good candidate. They lone are not too useful to distinguish between each other and reliable place marker on an image so the neighbouring pixels of that image are also analyzed and saved as a descriptor.
Now a good anchor should have reliable feature points attached to it. The algorithm must be able to find the same physical space under different viewpoints. It should be able to accommodate the change in 
  • Camera Perspective
  • Rotation
  • Scale
  • Lightning
  • Motion blur and noise

Reliable Feature Points

This is an open research problem with multiple solutions on board. Each with their own set of problems though. One of the most popular algorithms stem from this paper by David G. Lowe at IJCV is called Scale Invariant Feature Transform (SIFT). Another follow up work which claims to have even better speed was published in ECCV in 2006 called Speeded Up Robust Features (SURF) by Bay et al. Thought both of them are patented at this point. 

Microsoft Hololens doesn't really need to do all this heavy lifting since it can rely on an extensive amount of sensor data. Especially it gets aid from the depth sensor data from its Infrared Sensors. However, ARCore and ARkit doesn't enjoy those privileges and has to work with 2d images. Though we cannot say for sure which of the algorithm is actually used for ARKit or ARCore we can try to replicate the procecss with a patent-free algorithm to understand how the process actually works.

D.I.Y Feature Detection and keypoint descriptors

To understand the process we will use an algorithm by Leutenegger et al called BRISK. To detect feature we must follow a multi-step process. A typical algorithm would adhere to the following two steps
  1. Keypoint Detection: Detecting keypoint can be as simple as just detecting corners. Which essentially evaluates the contrast between neighbouring pixels. A common way to do that in a large-scale image is to blur the image to smooth out pixel contrast variation and then do edge detection on it. The rationale for this is that you would normally want to detect a tree and a house as a whole to achieve reliable tracking instead of every single twig or window. SIFT and SURF adhere to this approach. However, for real-time scenario blurring adds a compute penalty which we don't want. In their paper "Machine Learning for High-Speed Corner Detection" Rosten and Drummond proposed a method called FAST which analyzes the circular surrounding if each pixel p. If the neighbouring pixels brightness is lower/higher than  and a certain number of connected pixels fall into this category then the algorithm found a corner. 
    Image credits: Rosten E., Drummond T. (2006) Machine Learning for High-Speed Corner Detection.. Computer Vision – ECCV 2006. 
    Now back to BRISK, for it out of the 16-pixel circle, 9 consecutive pixels must be brighter or darker than the central one. Also BRISk uses down-sized images allowing it to achieve better invariance to scale.
  2. Keypoint Descriptor: The primary property of the detected keypoints should be that they are unique. The algorithm should be able to find the feature in a different image with a different viewpoint, lightning. BRISK concatenates the brightness comparison results between different pixels surrounding the centre keypoint and concatenates them to a 512 bit string.
    Image credits: S. Leutenegger, M. Chli and R. Y. Siegwart, “BRISK: Binary Robust invariant scalable keypoints”, 2011 International Conference on Computer Vision
    As we can see from the sample figure, the blue dots create the concentric circles and the red circles indicate the individually sampled areas. Based on these, the results of the brightness comparison are determined. BRISK also ensures rotational invariance. This is calculated by the largest gradients between two samples with a long distance from each other.

Test out the code

To test out the algorithms we will use the reference implementations available in OpenCV. We can use a pip install to install it with python bindings

As test image, I used two test images I had previously captured in my talks at All Things Open and TechSpeaker meetup. One is an evening shot of Louvre, which should have enough corners as well as overlapping edges, and another a picture of my friend in a beer garden in portrait mode. To see how it fares against the already existing blur on the image.

Original Image Link

Original Image Link

Visualizing the Feature Points

We use the below small code snippet to use BRISK on the two images above to visualize the feature points

What the code does:
  1. Loads the jpeg into the variable and converts into grayscale
  2. Initialize BRISK and run it. We use the paper suggested 4 octaves and an increased threshold of 70. This ensures we get a low number but highly reliable key points. As we will see below we still got a lot of key points
  3. We used the detectAndCompute() to get two arrays from the algo holding both the keypoints and their descriptors
  4. We draw the keypoints at their detected positions indicating keypoint size and orientation through circle diameter and angle, The "DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS" does that.

As you can see most of the key points are visible at the castle edges and all visible edges for Louvre and almost none on the floor. With the portrait mode pic of my friend it's more diverse but also shows some false positives like the reflection on the glass. Considering how BRISK works, this is normal.


In this first part of understanding what power ARCore/Kit we understood how basic feature detection works behind the scenes and tried our hands on to replicate that. These spatial anchors are vital to the virtual objects "glueing" to the real world. This whole demo and hands-on now should help you understand why designing your apps so that users place objects in areas where the device has a chance to create anchors is a good idea.
Placing a lot of objects in a single non-textured smooth plane may produce inconsistent experience since its just too hard to detect keypoints in a plane surface (now you know why). As a result, the objects may sometimes drift away if tracking is not good enough.
If your app design encourages placing objects near corners of floor or table then the app has a much better chance of working reliably. Also, Google's guide on anchor placement is an excellent read.

Their short recommendation is:
  • Release spatial anchors if you don’t need them. Each anchor costs CPU cycles which you can save
  • Keep the object close to the anchor. 
On our next post, we will see how we can use this knowledge to do basic SLAM.

Update: The next post lives here:


Popular posts from this blog

FirefoxOS, A keyboard and prediction: Story of my first contribution

Returning to my cubical holding a hot cup of coffee and with a head loaded with frustration and panic over a system codebase that I managed to break with no sufficient time to fix it before the next morning.  This was at IBM, New York where I was interning and working on the TJ Watson project. I returned back to my desk, turned on my dual monitors, started reading some blogs and engaging on Mozilla IRC (a new found and pretty short lived hobby). Just a few days before that, FirefoxOS was launched in India in the form of an Intex phone with a $35 price tag. It was making waves all around, because of its hefty price and poor performance . The OS struggle was showing up in the super low cost hardware. I was personally furious about some of the shortcomings, primarily the keyboard which at that time didn’t support prediction in any language other than English and also did not learn new words. Coincidentally, I came upon Dietrich Ayala in the FirefoxOS IRC channel, who at

April Fool and Google Part 2: A Round Up of ALL of Google’s April Fools Jokes

Ok....this post I think will contain all of the pranks I could find  for today. After my last post here Last Time I reported Only a handful of the pranks.. Understandable, as it was only the morning. After that I stumbled upon more of them Which I am gonna round up here. Now staring with the list. The very first one is obviously our favourite Google Maps Quest The above is their official video. In a post in Google Plus they say about it as follows  Today  + Google Maps  announced Google Maps 8-bit for NES. With #8bitmaps , you can do everything you'd normally do in Maps—search for famous landmarks and sites around the world, get directions and even use Street View. Just in time for April Fool's Day, Google has introduced Google Maps Quest, a retro 8-bit version of its mapping tool that is... totally awesome. In a characteristically whimsical video, available above, Google emplo

Curious case of Cisco AnyConnect and WSL2

One thing Covid has taught me is the importance of VPN. Also one other thing COVID has taught me while I work from home  is that your Windows Machine can be brilliant  as long as you have WSL2 configured in it. So imagine my dismay when I realized I cannot access my University resources while being inside the University provided VPN client. Both of the institutions I have affiliation with, requires me to use VPN software which messes up WSL2 configuration (which of course I realized at 1:30 AM). Don't get me wrong, I have faced this multiple times last two years (when I was stuck in India), and mostly I have been lazy and bypassed the actual problem by side-stepping with my not-so-noble  alternatives, which mostly include one of the following: Connect to a physical machine exposed to the internet and do an ssh tunnel from there (not so reliable since this is my actual box sitting at lab desk, also not secure enough) Create a poor man's socks proxy in that same box to have my ow