Data preprocess
- make the columns name in the same format (lowercase + replace ‘ ‘ with _)
- make categorical data in the same format (lowercase + replace ‘ ‘ with _)
to_numeric → change to numerical format
errors → coerce (fix invalid parse as Nan)
- change target variable to numerical (if the original data is in categorical form)
EDA
- view target variable distribution
- view churn rate
- view
Nan column
Feature importance
churn rate
- mean of
churn
churn = 1 → person who churn
- mean of
churn = person who churn / all samples = churn rate
Risk ration (group churn rate / global churn rate) -> relative term
- churn rate of focus feature / global churn rate
>1 → more likely to churn
<1 → less likely to churn