Impact analysis of COVID lockdowns on Airbnb Listings in Los Angeles
An Udacity CRISP-DM project
This blog is part of the project submission for the UDACITY Data Scientist Nanodegree program.
This project is performed using the CRISP-DM process. To describe briefly CRISP-DM stands for “Cross-Industry Standard Process for Data Mining” and consists of the following steps:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
Airbnb’s business model is a multi-sided marketplace that connects travelers with host and experience providers. The company makes money from booking fees that come from stays and experiences.
Scope and project description:
For this project, I am only using the dataset restricted to the city of Los-Angeles for the data scraped each month from Jan-June for both 2019 and 2020 for comparative analysis. This dataset is open source and can be found on the Inside Airbnb website.
Business Impact analysis is analyzing the financial and operational impact of a disruptive event on a business’s functions and processes.
In this particular case, I chose to analyze the loss of sales/bookings on the Airbnb platform due to the impact of COVID imposed lockdowns in the city of Los-Angeles. The city of LA decided to announce its first set of lockdowns or stay at home orders somewhere around mid-march in 2020 therefore a special emphasis has been placed on these months for this analysis.
Details, Assumptions, and Limitations:
- Inside Airbnb-website contains three major datasets namely calendars, listings, and reviews.
- These datasets are fairly large with each month dataset consisting of millions of rows, on which I further performed feature engineering to get ~600 features. I have used the free tier in Google Cloud platform to perform these computations, however, due to computing limitations, I have restricted my analysis to only relevant months of 2019 and 2020. Anyone, who wishes to improve on this analysis and has access to the infrastructure required to perform such heavy computations can consider including additional historical data which is available on the website
- Throughout the analysis, I refer to two variables 1-‘Data collected on Month’ and 2-‘Month of Booking’(or variant). The first one refers to which month a particular dataset of bookings was collected on, and the second one refers to a particular month when the booking was made for. A dataset collected on a particular month has the details of bookings made until the next year or so. For eg the dataset collected in Jan 2020 has the details of bookings made for a particular listing every single day until Jan 2021. Bookings can get made/canceled from month to month therefore this difference is crucial in understanding the analysis.
- A detailed code repository is available on my Github link.
Problem statements:
To conduct this impact analysis, I decided to try and answer the below three questions based on my business understanding.
- Is there a major change in the trend of bookings made from Jan to June in 2020 when compared to 2019?
There is usually seasonality that is noticed in every industry where the sales are affected by holiday season/off-season and so on. Comparing the same set of months can help us understand if 2020 lockdowns lead to a major spike in cancellations(as expected) and by how much the bookings dropped on average.
2. Estimated impact on Total Daily Bookings and revenue (Sales) :
Although it may not be entirely possible to estimate the exact financial impact due to loss of business/cancellations. It is possible to estimate it by looking at the average difference in bookings and sales in 2019 vs 2020.
3. Features/Characteristics of the listings that were most important in 2019 vs 2020 datasets
The datasets provide an elaborate list of features for each listing. I thought it would be interesting to create a Machine learning model for each year and compare what features showed up as most important in getting higher bookings. This would help explore if certain features were more valued during the lockdown and others lead to more cancellations.
Exploratory Data Analysis(EDA):
To answer the first question, I first conducted an EDA on the calendars dataset to compare the trends of booking made between Jan to June in 2019 Vs 2020
From the above trends, it is clear that in 2019 bookings were somewhat higher when data was collected closer to the booking date and it remained somewhat stable for future bookings. However, for the trend in 2020, it is clear that bookings dropped/cancellations increased closer to the booking date. This is especially consistent with the timeline of when lockdowns were announced in the state of California.
To get a sense of the dip in bookings in the COVID era vs pre-COVID era. We can compare the average monthly bookings for both the years especially for the months that are common to all the datasets.
From the generic trends of the bar charts, it is visible that in 2019 there isn't a standard trend of whether Avg monthly bookings go higher/lower for the datasets, it may even depend on the seasonality of bookings. However, during 2020 we can see that in general that either booking dropped or cancellations increased in the later datasets i.e. for the June 2020 dataset the average is lower compared to other datasets almost for all the months which could clearly be the impact of lockdowns that were announced during COVID in Los Angeles
To answer the second question, I plotted the average difference in bookings made per month in 2019 vs 2020
The above difference/delta chart shows that in general average bookings per day have dropped by 3500 to 5000 per Day in 2020 when data was averaged for datasets collected from Jan to June at the beginning of 2019 and 2020.
Clarification: The June to Jan months depicted in the graph above are the actual months of bookings and not the month when data was collected on. The Jan bar is for the next year i.e in the 2019 dataset it indicates Jan-2020 bookings diff and in the 2020 dataset, it indicates Jan-2021. Similar logic is used for the below graph as well.
To further explore the drop in Revenue I plotted the average difference in Sales (price per day for booked days)
From the above charts, an interesting observation that can be seen is that although overall bookings seem to have gone down, based on current bookings average sales for advanced bookings in the next year are still higher in 2020 vs 2019.
Given this observation, I decided to explore the reason for this observation by identifying the features that most affected bookings in the 2 years.
This also brings me to my third and final question. To answer it, I built random forest models for the 2019 and 2020 datasets and plotted the top 15 important features for each model as below.
Comparing the 2019 and 2020 model top features:
- One thing that is strikingly common between the two models is that time-related features such as year, month, etc are the most important when it comes to predicting the bookings in both the years
- Although, in comparison, one can see that in the 2020 model the future year variable i.e. year_2021 has lower predictability that the current year_2020 whereas in the 2019 model the future and current year variables year_2020 and year_2019 have almost the same level of predictability
- It is also interesting to note that although the month_ features showed up as important in both the models, months 3,4,5 and 6 had a consistently increasing impact on predictability on the 2020 dataset. This is interesting in the context that lockdowns did start around March 2020(month_3) in Los Angeles
Next steps and improvements:
Working with a large dataset had its own set of challenges(Mainly longer time taken to build a model and RAM availability) even though I utilized Google cloud’s Compute engine with infrastructure available in the free tier to perform this analysis. In the future to improve upon this analysis, I would look into the following approaches
- Utilize the historical Airbnb datasets available on the Inside Airbnb website to do a better historical analysis
- Reduce the dataset size by using PCA to filter in only the most important feature components for predictability