HR Analytics: Understanding & Controlling Employee Attrition using Predictive Modeling

By: Vikash Singh

Introduction

Employee Attrition is a challege for most of the organizations. More often than not, it’s a chellenge identifying variables or factors which may increase the probability of attrition as many of these are qualitative in nature. Even if one identifies these triggers, its difficult to quantify them and create a High- Medium – Low Risk Profile of the candidates. This is where Predictive Modeling can be of great help, and that is what is covered in the article.

Data

The dataset consists of 1470 observations of 15 variables which are described below:

Attrition: Whether the attrition happened (Yes=1) or not (No=0). This is our independent variable which we are trying to score.
Age
BusinessTravel: relates to travel frequency of the employee
Satisfaction_level: relates to environment satisfaction level of the employee (1-Low, 2-Medium, 3-High, 4-Very high)
Sex – gender of the employee
JobInvolvement: relates to job involvement level of the employee (1-Low, 2-Medium, 3-High, 4-Very high)
JobSatisfaction: relates to job satisfaction level of the employee (1-Low, 2-Medium, 3-High, 4-Very high)
MaritalStatus
MonthlyIncome
NumCompaniesWorked: how many companies has the employee worked upon
OverTime: whether the employee works overtime (Yes / No)
TrainingTimesLastYear: number of trainings received by the employee last year
YearsInCurrentRole: number of years in current role
YearsSinceLastPromotion: number of years since last promotion
YearsWithCurrManager: number of years with current manager

Loading the Data

Data Exploration

1) Structure of the Data

str(attrition)

## ‘data.frame’: 1470 obs. of 15 variables:
## $ Age : int 40 48 36 32 26 31 58 29 37 35 …
## $ Attrition : int 1 0 1 0 0 0 0 0 0 0 …
## $ BusinessTravel : Factor w/ 3 levels “Non-Travel”,”Travel_Frequently”,..: 3 2 3 2 3 2 3 3 2 3 …
## $ Satisfaction_level : int 2 3 4 4 1 4 3 4 4 3 …
## $ Sex : Factor w/ 2 levels “Female”,”Male”: 1 2 2 1 2 2 1 2 2 2 …
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 …
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 …
## $ MaritalStatus : Factor w/ 3 levels “Divorced”,”Married”,..: 3 2 3 2 2 3 2 1 3 2 …
## $ MonthlyIncome : int 5394 4617 1881 2618 3121 2761 2403 2424 8573 4713 …
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 …
## $ OverTime : Factor w/ 2 levels “No”,”Yes”: 2 1 2 2 1 1 2 1 1 1 …
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 …
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 …
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 …
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 …

#Converting categorical variables as factors which are coded as numbers
attrition$Attrition = as.factor(attrition$Attrition)
attrition$Satisfaction_level = as.factor(attrition$Satisfaction_level)
attrition$JobInvolvement = as.factor(attrition$JobInvolvement)
attrition$JobSatisfaction = as.factor(attrition$JobSatisfaction)

2) Missing Values

table(is.na(attrition))

##
## FALSE
## 22050

There are no missing values in the data set.

3) Visualisation of Independent Variables

For numerical variables, we will use histograms; whereas for Categorical Values, we will use Bar Charts or Frequency Counts.

Distribution of Categorical Variables

par(mfrow=c(4,2))
par(mar = rep(2, 4))
barplot(table(attrition$Attrition), main =“Attrition Distribution”)
barplot(table(attrition$Satisfaction_level), main =“Satisfaction Level”)
barplot(table(attrition$BusinessTravel), main =“BusinessTravel”)
barplot(table(attrition$Sex), main =“Sex”)
barplot(table(attrition$JobInvolvement), main =“JobInvolvement”)
barplot(table(attrition$MaritalStatus), main =“MaritalStatus”)
barplot(table(attrition$OverTime), main =“OverTime”)
barplot(table(attrition$JobSatisfaction), main =“JobSatisfaction”)

attrition-and-satisfaction-levels

Distribution of Continuous Variables

par(mfrow=c(4,2))
par(mar = rep(2, 4))
hist(attrition$Age)
hist(attrition$MonthlyIncome)
hist(attrition$NumCompaniesWorked)
hist(attrition$TrainingTimesLastYear)
hist(attrition$YearsInCurrentRole)
hist(attrition$YearsSinceLastPromotion)
hist(attrition$YearsWithCurrManager)

histogram-of-attrition

Building Predictive Models

Baseline Accuracy

table(attrition$Attrition)

##
## 0 1
## 1233 237

237 out of total 1470 observations left the job. So the baseline accuracy is 1233/1470 = 84%, without building any model. But this naive way would classify all employee churned as non-churn, which defeats the purpose of creating a response plan to reduce employee attrition.

Let’s go ahead and build a model which is more ‘sensitive’ to our business requirement.

Dividing the dataset

Before doing any modeling, let’s divide our data set into training and testing data set to evaluate the performance of our model.

#Dividing the data set into train and test
library(caTools)

## Warning: package ‘caTools’ was built under R version 3.2.4

set.seed(200)
spl = sample.split(attrition$Attrition, SplitRatio = 0.70)

train = subset(attrition, spl == TRUE)
test = subset(attrition, spl == FALSE)

Logistic Regression

Build a logistic regression model using all of the independent variables to predict the dependent variable “Attrition”, and use the training set to build the model.

log = glm(Attrition ~ . , family=“binomial”, data = train)
summary(log)

##
## Call:
## glm(formula = Attrition ~ ., family = “binomial”, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6850 -0.5645 -0.3447 -0.1810 3.5033
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.090e+00 7.381e-01   1.477 0.139604
## Age -3.866e-02 1.349e-02 -2.866 0.004153 **
## BusinessTravelTravel_Frequently 1.142e+00 4.241e-01   2.694 0.007059 **
## BusinessTravelTravel_Rarely 7.295e-01 3.908e-01   1.866 0.061973 .
## Satisfaction_level2 -8.338e-01 2.980e-01 -2.797 0.005151 **
## Satisfaction_level3 -9.093e-01 2.748e-01 -3.309 0.000936 ***
## Satisfaction_level4 -1.046e+00 2.741e-01 -3.815 0.000136 ***
## SexMale 2.936e-01 2.009e-01   1.462 0.143859
## JobInvolvement2 -1.131e+00 3.872e-01 -2.922 0.003479 **
## JobInvolvement3 -1.230e+00 3.611e-01 -3.406 0.000658 ***
## JobInvolvement4 -1.817e+00 5.048e-01 -3.599 0.000319 ***
## JobSatisfaction2 -3.894e-01 2.865e-01 -1.359 0.174200
## JobSatisfaction3 -5.039e-01 2.543e-01 -1.982 0.047528 *
## JobSatisfaction4 -1.213e+00 2.833e-01 -4.283 1.85e-05 ***
## MaritalStatusMarried 3.027e-01 2.819e-01   1.074 0.282906
## MaritalStatusSingle 1.248e+00 2.794e-01   4.467 7.92e-06 ***
## MonthlyIncome -1.057e-04 3.583e-05 -2.950 0.003173 **
## NumCompaniesWorked 9.840e-02 4.238e-02   2.322 0.020236 *
## OverTimeYes 1.488e+00 1.999e-01   7.445 9.71e-14 ***
## TrainingTimesLastYear -2.074e-01 7.754e-02 -2.675 0.007468 **
## YearsInCurrentRole -4.289e-02 4.534e-02 -0.946 0.344189
## YearsSinceLastPromotion 1.589e-01 4.250e-02   3.740 0.000184 ***
## YearsWithCurrManager            – 1.171e-01 4.512e-02 -2.597 0.009417 **
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 909.34 on 1028 degrees of freedom
## Residual deviance: 712.27 on 1006 degrees of freedom
## AIC: 758.27
##
## Number of Fisher Scoring iterations: 6

All the variables are significant, except YearsInCurrentRole and Sex. Anyways, we will keep it in our model.

Making Predictions on test data, with a threshold of 0.5.

predictLog = predict(log, newdata = test, type = “response”)

#Confusion Matrix
table(test$Attrition, predictLog >= 0.5)

##
## FALSE TRUE
## 0 366 4
## 1 45 26

(366+26)/(nrow(test)) #Accuracy 0.89

## [1] 0.8888889

26/71 # Sensitivity 0.37

## [1] 0.3661972

Our accuracy has increased form Baseline Accuracy of 83% to 89%. However, that’s not much relevant. What’s important is that our sensitivity has gone leaps and bounds from 0 to 37%.

Area Under the Curve (AUC) for the model on the test data

library(ROCR)

## Loading required package: gplots

##
## Attaching package: ‘gplots’

## The following object is masked from ‘package:stats’:
##
## lowess

ROCRlog = prediction(predictLog, test$Attrition)
as.numeric(performance(ROCRlog, “auc”)@y.values)

## [1] 0.9018272

The AUC comes out to be 0.90 – indicating high accuracy.

CART Model

The logistic regression model gives us high accuracy, as well as significance of the variables. But there is a limitation. It is not immediately clear which variables are more important than the others, especially due to the large number of categorical variables in this problem.

Let us now build a classification tree for this model. Using the same training set, fit a CART model, and plot the tree.

#CART Model
library(rpart)
library(rpart.plot)

Tree = rpart(Attrition ~ ., method=“class”, data = train)
prp(Tree)

cart-model

The Variable which the Tree Splits uupon in the first level is ‘OverTime’, followed by ‘MonthlyIncome’, indicating these are the most important variables.

Accuracy of the model on testing data set

PredictCART = predict(Tree, newdata = test, type = “class”)
table(test$Attrition, PredictCART)

## PredictCART
## 0 1
## 0 358 12
## 1 51 20

(358+20)/nrow(test) #Accuracy ~ 86%

## [1] 0.8571429

20/71 #Sensitivity ~ 28%

## [1] 0.2816901

AUC of the model

library(ROCR)
predictTestCART = predict(Tree, newdata = test)
predictTestCART = predictTestCART[,2]
#Compute the AUC:
ROCRCART = prediction(predictTestCART, test$Attrition)
as.numeric(performance(ROCRCART, “auc”)@y.values)

## [1] 0.7009326

Interpretation

We see that even though CART model beats the Baseline method, it underperforms the Logistic Regression model. This highlights a very regular phenomenon when comparing CART and logistic regression. CART often performs a little worse than logistic regression in out-of-sample accuracy. However, as is the case here, the CART model is often much simpler to describe and understand.

Random Forest Model

Finally, let’s try build a Random Forest Model.

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

set.seed(100)
rf = randomForest(Attrition ~ ., data=train)

#make predictions
predictRF = predict(rf, newdata=test)
table(test$Attrition, predictRF)

## predictRF
## 0 1
## 0 366 4
## 1 57 14

380/nrow(test) #Accuracy ~ 86%

## [1] 0.861678

14/71 #Sensitivity ~ 19.7%

## [1] 0.1971831

The accuracy is better than the CART Model but lower than the Logistic Regression Model.

Understanding Important Variables in Random Forest Model

One way of understanding this is to look at he number of times, aggregated over all of the trees in the random forest model,that a certain variable is selected for a split. This can be done using the following code:

#Method 1
vu = varUsed(rf, count=TRUE)
vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vusorted$x, names(rf$forest$xlevels[vusorted$ix]))

random-forest-model

We can see that MonthlyIncome and Age Variables are used significantly more than the other variables.

Method 2

A different metric we can look at is related to “impurity”, which measures how homogenous each bucket or leaf of the tree is. To compute this metric, run the following command in R

varImpPlot(rf)

impurity

Conclusion – The Advantage of Analytics

In the absence of predictive modeling, we would have taken the naïve approach of predicting the majority of outcomes as the predictions which would have meant we would have labeled all employees as ‘Non Churner’. This would have defeated the entire purpose of controlling employee churn. Using various predictive modeling techniques, the organization is not just able to conveniently beat the baseline model but also predict with increased accuracy which employees have higher probability of leaving the organization.

Action Plan

Finally the organisation can look at the predictions and score the employees basis their probability of leaving the organization and accordingly,develop retention strategies.

End Notes

Please note that this is an introduction to HR Analytics using Predictive modeling. There are various other techniques which can be used to carry out this exercise,which was beyond the scope of this article. Also, even the models which have been constructed could have been improved by fine tuning various parameters. However, these have been not considered to keep the explanation simpler and non-technical.

Author: Vikash Singh

Profile: Seasoned Decision Scientist with over 11 years of experience in Data Science, Business Analytics, Business Strategy & Planning.

Big Data Market Expected to Reach $46 Billion in 2016

October 4, 2016
3

Analytics for a Cognitive Business

October 7, 2016
0

About The Autor
More From Autor

vikash-analytics1 Posts

5 Comments

Leading HR Execs Discusses How Big Data Can Help You Hire Smarter - Fusion Analytics World Reply
October 18, 2016 at 6:20 am
[…] major challenge initially was overcoming the lack of analytical skills within HR. This is consistent with the findings of a joint Harvard Business Review and Visier study, which […]
Eloisa Reply
April 15, 2017 at 3:12 am
Pretty! This has been a really wonderful post. Thank you for providing this info.
Andrew Stokes Reply
April 23, 2017 at 6:29 am
I might not concur with whatever you authored right here.
I have a different perspective and to my thoughts it’s a lot more intensifying!
I think that HR Analytics: Understanding & Controlling Employee Attrition using
Predictive Modeling is the topic which is very interesting,
but we should look at it from different angles!
Rajesh Shaw Reply
April 25, 2017 at 2:37 pm
I love the number of troubles in our modern training method are displayed listed here!

I’m students myself and I know from very own practical experience about some problems that are described listed.
Atul Bhandari Reply
February 2, 2018 at 2:53 pm
Wonderful goods from you, man. I’ve understand your stuff previous
to and you’re just too great. I really like what
you have acquired here, really like what you’re saying and the way
in which you say it. You make it enjoyable and you still care
for to keep it smart. I can not wait to read far more from you.
This is really a wonderful site.