New York's cab data visualization from Uber's Engineering blog |
When I started looking out if there is another publicly available dataset, I found one at Kaggle. And then quite a few more at Kaggle. But none of them seemed official, and then I found one released by NYC - TLC which looked pretty official and I was hooked.
You can find the Jupyter Notebook here: https://github.com/rabimba/uber_analysis_mapd/blob/master/uberFinal1.ipynb
Analysis
Just as a toy experiment I tried to answer and visualize the following.
Can we visualize the number of Uber Trips in a period
Distribution based on per hour, week and month
Estimated Monthly base revenue
Distribution of traffic between months
Distribution based on weekdays and weekends on short and long trips
Distribution of Trip Duration
Trip Duration vs Trip Distance
And also a bunch of interesting facts we can glean from this dataset.
However, while trying to do this, I realized it's pretty hard to work on a huge dataset in jupyter directly if you load the whole dataset into a dataframe anyway.
I used OmniSci's Cloud interface to load up my data and then connect to that dataset using pymapd to read the sql data.
What I did not do was to be smart and utilize OmniSci's super powerful mapd core and slice and dice the dataset in the cloud itself. Which cloud have saved me a lot of time. For example, the query I was running on one-sixth of the whole dataset was taking 25 minutes.
You can take a look at some of my rough ideas, tries and more graphs in this Jupyter Notebook.
However, it seems OmniSci also has a super helpful visualization web interface as part of OmniSci Cloud called Immerse. And I was able to cook up these dashboards in less than 5 minutes.
And Immerse was able to crunch through the whole dataset (not one-sixth) almost instantly and produce these charts for me. I am pretty impressed with it so far. And it seems with help of pymapd and crafting some sql queries, I should be able to harness this speed as well.
That would be my next try probably.
What's Next:
Since I realized how powerful OmniSci Immerse can be and starting to play with pymapd. My next pet project is merging Uber Movement's yearly data with ward based time series data. So that we can recreate the whole dataset and analyze some of the interesting aspects of it as we did above. I am mostly interested to see (preferably in Bengaluru data)
- Uber's growth through time (and specific activity growth in different wards)
- Figuring from historical time series which wards and routes have most traffic in which hour (this also should let us predict which areas may face surge pricing)
- See if the growth has saturated in any specific place (should give us upper threshold for that area)
- If an increase in Uber Demand directly co-relate to travel time (maybe the increased demand is causing traffic?)
- Can we load it up in kepler.gl (more specifically using this demo as a template) and have a nice timeseries visualization?
Should be a fun project!
Update:
Since I wrote this post I was also playing with what we can do if we visualize this data in VR. And I have a preliminary asnwer :D
And here I am back. Took a very very small amount of data (as a toy example), connected onmnisci cloud with nodejs and with #WebVR in Oculus Rift here we are :D— Rabimba Karanjai (@rabimba) November 29, 2018
(I know the labelling is all messed up, but will fix that later) pic.twitter.com/d8gIDnoWiV
Meanwhile I realized it might be fun to see this data in a more visual way in #WebVR. This is not yet complete. But you can already see a small amount in a Oculus Device (or mobile)
— Rabimba Karanjai (@rabimba) November 29, 2018
(I know the labelling is messed up. Will fix that later) pic.twitter.com/MTUbw8wvMI
Comments
Post a Comment