Customer Churn Prediction with Spark

Rahul Das
4 min readApr 12, 2020

Project Overview

Data used in this project is similar to the data of a music streaming service like Spotify. Millions of users stream their favorite songs everyday on such platform. There are two levels of subscriptions available for users which are Free tier and Premium tier. Users can upgrade, downgrade and cancel their subscription at any time. Hence, it is necessary to make sure to are satisfied with the service provided by the platform. Events such as liking a song, log in, log out, playing a song are recorded as every time user interacts with the platform. All this data contains key insights which can help us understand consumer behaviour and boost business. Goal of this project is to analyse the data and predict which users are expected to churn — either downgrading from premium to free or cancel their subscription.

Steps needed to build a machine learning model which can predict customer churn are follows:

  • Data Exploration: Clean and analyze the data, define churn and label data
    based on churn definition.
  • Feature Engineering: Create features for the users: this features will be used as input for the model.
  • Data transformation, data splitting and model training:
    - Transform the features data.
    - Split the data into train and test sets.
    - Build a machine learning model to train using the training data.

Data Exploration

Features for the data-set are as depicted below. The data-set provided has 18 features and 286500 total records.

data_path = “mini_sparkify_event_data.json”
user_event = spark.read.json(data_path)
user_event.printSchema()

By looking at the schema we can make educated assumptions as to which of the features can be good indication of a customer which might walk away. Here are some of the features I thought are in such category.

  • song : a song played by each user
  • registration: user registration timestamp
  • page : all the pages visited by a user
  • level : free or paid

Now let’s have a look at the distinct values of the page feature

user_event.select(‘page’).distinct().show()

Further analysis indicates that Cancel page is followed by cancellation confirmation page and both indicate user is canceling their subscription unless they change their mind once they are on the cancel page which seems unlikely event at least for this data-set. So we will use both of these events to define Churn. We have also seen that about 23% of the customers who downgraded have also cancelled their subscription. However, we’ll exclude those who downgraded but didn’t cancel yet for this analysis even though they seem candidates who will eventually cancel their subscription.

Below is the distribution after labeling our data-set

use_level_count = labeled_df.groupby(‘userId’, ‘level’, ‘label’).count()
use_level_count_pd = use_level_count.select(“userId”, “level”, ‘label’).toPandas()
use_level_count_pd[[‘level’, ‘label’]].groupby([‘level’, ‘label’]).agg({‘label’:’count’}).unstack().plot(kind=’bar’);
plt.title(‘churned-level customer count comparison’)
plt.ylabel(‘customer count’)

Feature Engineering

Which of the above features can be useful to predict churn?
- ts: we can get information like total number of hours a customer played songs
- song: we can get the number of songs a customer played
- from page column we can find the number of Thumbs Up, Thumbs Down, and Downgrades

Final features generated based on the above assumptions.

Data transformation, data splitting and model training

Steps in this section are:

  • Split the full data-set into train, validation and test set.
  • Tune selected machine learning algorithm using the validation data-set.
  • Score tuned machine learning model on the test data-set to verify it generalizes well.

We can also select a couple of models and tune using the validation data-set. Score them on the test data-set and pick the one with the best performance on the test data-set. However, for this post I have picked only Random Forest Classifier as it has been proved to be the best off the shelf model.

Results and Conclusion

We have analysed the music data-set and came up with new features to predict churn. We then created a machine learning model and tuned it to improve its performance. We achieved an accuracy score of 87% and F1 score of 84% on the test data-set.

The model performance can be further improved by creating additional features and including some of the features that I have left out for this analysis. The model should also be tested using samples from the left out big data-set which hasn’t been used for this analysis.We can also pick other models to see if any of them can perform better. Once we are satisfied with the result, a large scale of the model can be implemented on the cloud.

Github repo for in detail code for this project:
https://github.com/rahul81/Data-Scientist-Nanodegree/tree/master/Churn_Prediction_Spark

--

--