Skip to main content

Visualizing large scale Uber Movement Data

New York's cab data visualization from Uber's Engineering blog
Last month one of my acquaintances in LinkedIn pointed me to a very interesting dataset. Uber's Movement Dataset. It was fascinating to explore their awesome GUI and to play with the data. However, their UI for exploring the dataset leaves much more to be desired, especially the fact that we always have to specify source and destination to get relevant data and can't play with the whole dataset. Another limitation also was, the dataset doesn't include any time component. Which immediately threw out a lot of things I wanted to explore.
When I started looking out if there is another publicly available dataset, I found one at Kaggle. And then quite a few more at Kaggle. But none of them seemed official, and then I found one released by NYC - TLC which looked pretty official and I was hooked.

To explore the data I wanted to try out OmniSci. I recently saw a video of a talk at jupytercon by Randy Zwitch where he goes through a demo of exploring an NYC Cab dataset using OmniSci. And since my dataset was very similar to that, I thought of giving it a try.

You can find the Jupyter Notebook here:


Just as a toy experiment I tried to answer and visualize the following.

Can we visualize the number of Uber Trips in a period

Distribution based on per hour, week and month

Estimated Monthly base revenue 

Distribution of traffic between months

Distribution based on weekdays and weekends on short and long trips

Distribution of Trip Duration

Trip Duration vs Trip Distance

And also a bunch of interesting facts we can glean from this dataset.
However, while trying to do this, I realized it's pretty hard to work on a huge dataset in jupyter directly if you load the whole dataset into a dataframe anyway.

I used OmniSci's Cloud interface to load up my data and then connect to that dataset using pymapd to read the sql data.
What I did not do was to be smart and utilize OmniSci's super powerful mapd core and slice and dice the dataset in the cloud itself. Which cloud have saved me a lot of time. For example, the query I was running on one-sixth of the whole dataset was taking 25 minutes.

You can take a look at some of my rough ideas, tries and more graphs in this Jupyter Notebook.

However, it seems OmniSci also has a super helpful visualization web interface as part of OmniSci Cloud called Immerse. And I was able to cook up these dashboards in less than 5 minutes.

And Immerse was able to crunch through the whole dataset (not one-sixth) almost instantly and produce these charts for me. I am pretty impressed with it so far. And it seems with help of pymapd and crafting some sql queries, I should be able to harness this speed as well. 
That would be my next try probably.

What's Next:

Since I realized how powerful OmniSci Immerse can be and starting to play with pymapd. My next pet project is merging Uber Movement's yearly data with ward based time series data. So that we can recreate the whole dataset and analyze some of the interesting aspects of it as we did above. I am mostly interested to see (preferably in Bengaluru data)
  • Uber's growth through time (and specific activity growth in different wards)
  • Figuring from historical time series which wards and routes have most traffic in which hour (this also should let us predict which areas may face surge pricing)
  • See if the growth has saturated in any specific place (should give us upper threshold for that area)
  • If an increase in Uber Demand directly co-relate to travel time (maybe the increased demand is causing traffic?)
  • Can we load it up in (more specifically using this demo as a template) and have a nice timeseries visualization?
Should be a fun project!


Since I wrote this post I was also playing with what we can do if we visualize this data in VR. And I have a preliminary asnwer :D


Popular posts from this blog

ARCore and Arkit, What is under the hood: SLAM (Part 2)

In our last blog post ( part 1 ), we took a look at how algorithms detect keypoints in camera images. These form the basis of our world tracking and environment recognition. But for Mixed Reality, that alone is not enough. We have to be able to calculate the 3d position in the real world. It is often calculated by the spatial distance between itself and multiple keypoints. This is often called Simultaneous Localization and Mapping (SLAM). And this is what is responsible for all the world tracking we see in ARCore/ARKit. What we will cover today: How ARCore and ARKit does it's SLAM/Visual Inertia Odometry Can we D.I.Y our own SLAM with reasonable accuracy to understand the process better Sensing the world: as a computer When we start any augmented reality application in mobile or elsewhere, the first thing it tries to do is to detect a plane. When you first start any MR app in ARKit, ARCore, the system doesn't know anything about the surroundings. It starts pro

ARCore and Arkit: What is under the hood : Anchors and World Mapping (Part 1)

Reading Time: 7 MIn Some of you know I have been recently experimenting a bit more with WebXR than a WebVR and when we talk about mobile Mixed Reality, ARkit and ARCore is something which plays a pivotal role to map and understand the environment inside our applications. I am planning to write a series of blog posts on how you can start developing WebXR applications now and play with them starting with the basics and then going on to using different features of it. But before that, I planned to pen down this series of how actually the "world mapping" works in arcore and arkit. So that we have a better understanding of the Mixed Reality capabilities of the devices we will be working with. Mapping: feature detection and anchors Creating apps that work seamlessly with arcore/kit requires a little bit of knowledge about the algorithms that work in the back and that involves knowing about Anchors. What are anchors: Anchors are your virtual markers in the real wo

IRCTC blocking certain countries?

Indian Railway Catering and Tourism Corporation or most commonly known as IRCTC is the only authorized government portal in India through which someone can book a Train Ticket. It also provides booking for flights and buses but its primary use for most people is to book rail tickets online. And like thousands of other people I also use the site intermittently while booking train tickets, especially for my parents who are in India and when I want to book tickets for them. A few days back they asked me to book a ticket for them and that is when the fun started. I found out that when I tried to access the website day before yesterday (4th July 2018), instead of the familiar login page I was greeted with an error that page cannot be loaded. I thought maybe something wrong and I would try later. After a day I tried and faced the same error. Now a little bit curious since I actually never seen the site down for a prolonged time, blamed it on my Comcast connection and connecte