Skip to main content

Visualizing large scale Uber Movement Data

New York's cab data visualization from Uber's Engineering blog
Last month one of my acquaintances in LinkedIn pointed me to a very interesting dataset. Uber's Movement Dataset. It was fascinating to explore their awesome GUI and to play with the data. However, their UI for exploring the dataset leaves much more to be desired, especially the fact that we always have to specify source and destination to get relevant data and can't play with the whole dataset. Another limitation also was, the dataset doesn't include any time component. Which immediately threw out a lot of things I wanted to explore.
When I started looking out if there is another publicly available dataset, I found one at Kaggle. And then quite a few more at Kaggle. But none of them seemed official, and then I found one released by NYC - TLC which looked pretty official and I was hooked.

To explore the data I wanted to try out OmniSci. I recently saw a video of a talk at jupytercon by Randy Zwitch where he goes through a demo of exploring an NYC Cab dataset using OmniSci. And since my dataset was very similar to that, I thought of giving it a try.

You can find the Jupyter Notebook here:


Just as a toy experiment I tried to answer and visualize the following.

Can we visualize the number of Uber Trips in a period

Distribution based on per hour, week and month

Estimated Monthly base revenue 

Distribution of traffic between months

Distribution based on weekdays and weekends on short and long trips

Distribution of Trip Duration

Trip Duration vs Trip Distance

And also a bunch of interesting facts we can glean from this dataset.
However, while trying to do this, I realized it's pretty hard to work on a huge dataset in jupyter directly if you load the whole dataset into a dataframe anyway.

I used OmniSci's Cloud interface to load up my data and then connect to that dataset using pymapd to read the sql data.
What I did not do was to be smart and utilize OmniSci's super powerful mapd core and slice and dice the dataset in the cloud itself. Which cloud have saved me a lot of time. For example, the query I was running on one-sixth of the whole dataset was taking 25 minutes.

You can take a look at some of my rough ideas, tries and more graphs in this Jupyter Notebook.

However, it seems OmniSci also has a super helpful visualization web interface as part of OmniSci Cloud called Immerse. And I was able to cook up these dashboards in less than 5 minutes.

And Immerse was able to crunch through the whole dataset (not one-sixth) almost instantly and produce these charts for me. I am pretty impressed with it so far. And it seems with help of pymapd and crafting some sql queries, I should be able to harness this speed as well. 
That would be my next try probably.

What's Next:

Since I realized how powerful OmniSci Immerse can be and starting to play with pymapd. My next pet project is merging Uber Movement's yearly data with ward based time series data. So that we can recreate the whole dataset and analyze some of the interesting aspects of it as we did above. I am mostly interested to see (preferably in Bengaluru data)
  • Uber's growth through time (and specific activity growth in different wards)
  • Figuring from historical time series which wards and routes have most traffic in which hour (this also should let us predict which areas may face surge pricing)
  • See if the growth has saturated in any specific place (should give us upper threshold for that area)
  • If an increase in Uber Demand directly co-relate to travel time (maybe the increased demand is causing traffic?)
  • Can we load it up in (more specifically using this demo as a template) and have a nice timeseries visualization?
Should be a fun project!


Since I wrote this post I was also playing with what we can do if we visualize this data in VR. And I have a preliminary asnwer :D


Popular posts from this blog

FirefoxOS, A keyboard and prediction: Story of my first contribution

Returning to my cubical holding a hot cup of coffee and with a head loaded with frustration and panic over a system codebase that I managed to break with no sufficient time to fix it before the next morning.  This was at IBM, New York where I was interning and working on the TJ Watson project. I returned back to my desk, turned on my dual monitors, started reading some blogs and engaging on Mozilla IRC (a new found and pretty short lived hobby). Just a few days before that, FirefoxOS was launched in India in the form of an Intex phone with a $35 price tag. It was making waves all around, because of its hefty price and poor performance . The OS struggle was showing up in the super low cost hardware. I was personally furious about some of the shortcomings, primarily the keyboard which at that time didn’t support prediction in any language other than English and also did not learn new words. Coincidentally, I came upon Dietrich Ayala in the FirefoxOS IRC channel, who at

April Fool and Google Part 2: A Round Up of ALL of Google’s April Fools Jokes

Ok....this post I think will contain all of the pranks I could find  for today. After my last post here Last Time I reported Only a handful of the pranks.. Understandable, as it was only the morning. After that I stumbled upon more of them Which I am gonna round up here. Now staring with the list. The very first one is obviously our favourite Google Maps Quest The above is their official video. In a post in Google Plus they say about it as follows  Today  + Google Maps  announced Google Maps 8-bit for NES. With #8bitmaps , you can do everything you'd normally do in Maps—search for famous landmarks and sites around the world, get directions and even use Street View. Just in time for April Fool's Day, Google has introduced Google Maps Quest, a retro 8-bit version of its mapping tool that is... totally awesome. In a characteristically whimsical video, available above, Google emplo

Curious case of Cisco AnyConnect and WSL2

One thing Covid has taught me is the importance of VPN. Also one other thing COVID has taught me while I work from home  is that your Windows Machine can be brilliant  as long as you have WSL2 configured in it. So imagine my dismay when I realized I cannot access my University resources while being inside the University provided VPN client. Both of the institutions I have affiliation with, requires me to use VPN software which messes up WSL2 configuration (which of course I realized at 1:30 AM). Don't get me wrong, I have faced this multiple times last two years (when I was stuck in India), and mostly I have been lazy and bypassed the actual problem by side-stepping with my not-so-noble  alternatives, which mostly include one of the following: Connect to a physical machine exposed to the internet and do an ssh tunnel from there (not so reliable since this is my actual box sitting at lab desk, also not secure enough) Create a poor man's socks proxy in that same box to have my ow