# HR Analytics: Understanding & Controlling Employee Attrition using Predictive Modeling

#### By: Vikash Singh

**Introduction**

Employee Attrition is a challege for most of the organizations. More often than not, it’s a chellenge identifying variables or factors which may increase the probability of attrition as many of these are qualitative in nature. Even if one identifies these triggers, its difficult to quantify them and create a High- Medium – Low Risk Profile of the candidates. This is where Predictive Modeling can be of great help, and that is what is covered in the article.

**Data**

The dataset consists of 1470 observations of 15 variables which are described below:

- Attrition: Whether the attrition happened (Yes=1) or not (No=0). This is our independent variable which we are trying to score.
- Age
- BusinessTravel: relates to travel frequency of the employee
- Satisfaction_level: relates to environment satisfaction level of the employee (1-Low, 2-Medium, 3-High, 4-Very high)
- Sex – gender of the employee
- JobInvolvement: relates to job involvement level of the employee (1-Low, 2-Medium, 3-High, 4-Very high)
- JobSatisfaction: relates to job satisfaction level of the employee (1-Low, 2-Medium, 3-High, 4-Very high)
- MaritalStatus
- MonthlyIncome
- NumCompaniesWorked: how many companies has the employee worked upon
- OverTime: whether the employee works overtime (Yes / No)
- TrainingTimesLastYear: number of trainings received by the employee last year
- YearsInCurrentRole: number of years in current role
- YearsSinceLastPromotion: number of years since last promotion
- YearsWithCurrManager: number of years with current manager

**Loading the Data**

**Data Exploration**

**1) Structure of the Data**

**str**(attrition)

## ‘data.frame’: 1470 obs. of 15 variables:

## $ Age : int 40 48 36 32 26 31 58 29 37 35 …

## $ Attrition : int 1 0 1 0 0 0 0 0 0 0 …

## $ BusinessTravel : Factor w/ 3 levels “Non-Travel”,”Travel_Frequently”,..: 3 2 3 2 3 2 3 3 2 3 …

## $ Satisfaction_level : int 2 3 4 4 1 4 3 4 4 3 …

## $ Sex : Factor w/ 2 levels “Female”,”Male”: 1 2 2 1 2 2 1 2 2 2 …

## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 …

## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 …

## $ MaritalStatus : Factor w/ 3 levels “Divorced”,”Married”,..: 3 2 3 2 2 3 2 1 3 2 …

## $ MonthlyIncome : int 5394 4617 1881 2618 3121 2761 2403 2424 8573 4713 …

## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 …

## $ OverTime : Factor w/ 2 levels “No”,”Yes”: 2 1 2 2 1 1 2 1 1 1 …

## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 …

## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 …

## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 …

## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 …

*#Converting categorical variables as factors which are coded as numbers*

attrition$Attrition = **as.factor**(attrition$Attrition)

attrition$Satisfaction_level = **as.factor**(attrition$Satisfaction_level)

attrition$JobInvolvement = **as.factor**(attrition$JobInvolvement)

attrition$JobSatisfaction = **as.factor**(attrition$JobSatisfaction)

**2) Missing Values**

**table**(**is.na**(attrition))

##

## FALSE

## 22050

There are no missing values in the data set.

**3) Visualisation of Independent Variables**

For numerical variables, we will use histograms; whereas for Categorical Values, we will use Bar Charts or Frequency Counts.

**Distribution of Categorical Variables**

**par**(mfrow=**c**(4,2))

**par**(mar = **rep**(2, 4))

**barplot**(**table**(attrition$Attrition), main =“Attrition Distribution”)

**barplot**(**table**(attrition$Satisfaction_level), main =“Satisfaction Level”)

**barplot**(**table**(attrition$BusinessTravel), main =“BusinessTravel”)

**barplot**(**table**(attrition$Sex), main =“Sex”)

**barplot**(**table**(attrition$JobInvolvement), main =“JobInvolvement”)

**barplot**(**table**(attrition$MaritalStatus), main =“MaritalStatus”)

**barplot**(**table**(attrition$OverTime), main =“OverTime”)

**barplot**(**table**(attrition$JobSatisfaction), main =“JobSatisfaction”)

**Distribution of Continuous Variables**

**par**(mfrow=**c**(4,2))

**par**(mar = **rep**(2, 4))

**hist**(attrition$Age)

**hist**(attrition$MonthlyIncome)

**hist**(attrition$NumCompaniesWorked)

**hist**(attrition$TrainingTimesLastYear)

**hist**(attrition$YearsInCurrentRole)

**hist**(attrition$YearsSinceLastPromotion)

**hist**(attrition$YearsWithCurrManager)

**Building Predictive Models**

**Baseline Accuracy**

**table**(attrition$Attrition)

##

## 0 1

## 1233 237

237 out of total 1470 observations left the job. So the baseline accuracy is 1233/1470 = 84%, without building any model. But this naive way would classify all employee churned as non-churn, which defeats the purpose of creating a response plan to reduce employee attrition.

Let’s go ahead and build a model which is more ‘sensitive’ to our business requirement.

**Dividing the dataset**

Before doing any modeling, let’s divide our data set into training and testing data set to evaluate the performance of our model.

*#Dividing the data set into train and test*

**library**(caTools)

## Warning: package ‘caTools’ was built under R version 3.2.4

**set.seed**(200)

spl = **sample.split**(attrition$Attrition, SplitRatio = 0.70)

train = **subset**(attrition, spl == TRUE)

test = **subset**(attrition, spl == FALSE)

**Logistic Regression**

Build a logistic regression model using all of the independent variables to predict the dependent variable “Attrition”, and use the training set to build the model.

log = **glm**(Attrition ~ . , family=“binomial”, data = train)

**summary**(log)

##

## Call:

## glm(formula = Attrition ~ ., family = “binomial”, data = train)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -1.6850 -0.5645 -0.3447 -0.1810 3.5033

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) 1.090e+00 7.381e-01 1.477 0.139604

## Age -3.866e-02 1.349e-02 -2.866 0.004153 **

## BusinessTravelTravel_Frequently 1.142e+00 4.241e-01 2.694 0.007059 **

## BusinessTravelTravel_Rarely 7.295e-01 3.908e-01 1.866 0.061973 .

## Satisfaction_level2 -8.338e-01 2.980e-01 -2.797 0.005151 **

## Satisfaction_level3 -9.093e-01 2.748e-01 -3.309 0.000936 ***

## Satisfaction_level4 -1.046e+00 2.741e-01 -3.815 0.000136 ***

## SexMale 2.936e-01 2.009e-01 1.462 0.143859

## JobInvolvement2 -1.131e+00 3.872e-01 -2.922 0.003479 **

## JobInvolvement3 -1.230e+00 3.611e-01 -3.406 0.000658 ***

## JobInvolvement4 -1.817e+00 5.048e-01 -3.599 0.000319 ***

## JobSatisfaction2 -3.894e-01 2.865e-01 -1.359 0.174200

## JobSatisfaction3 -5.039e-01 2.543e-01 -1.982 0.047528 *

## JobSatisfaction4 -1.213e+00 2.833e-01 -4.283 1.85e-05 ***

## MaritalStatusMarried 3.027e-01 2.819e-01 1.074 0.282906

## MaritalStatusSingle 1.248e+00 2.794e-01 4.467 7.92e-06 ***

## MonthlyIncome -1.057e-04 3.583e-05 -2.950 0.003173 **

## NumCompaniesWorked 9.840e-02 4.238e-02 2.322 0.020236 *

## OverTimeYes 1.488e+00 1.999e-01 7.445 9.71e-14 ***

## TrainingTimesLastYear -2.074e-01 7.754e-02 -2.675 0.007468 **

## YearsInCurrentRole -4.289e-02 4.534e-02 -0.946 0.344189

## YearsSinceLastPromotion 1.589e-01 4.250e-02 3.740 0.000184 ***

## YearsWithCurrManager – 1.171e-01 4.512e-02 -2.597 0.009417 **

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 909.34 on 1028 degrees of freedom

## Residual deviance: 712.27 on 1006 degrees of freedom

## AIC: 758.27

##

## Number of Fisher Scoring iterations: 6

All the variables are significant, except YearsInCurrentRole and Sex. Anyways, we will keep it in our model.

Making Predictions on test data, with a threshold of 0.5.

predictLog = **predict**(log, newdata = test, type = “response”)

*#Confusion Matrix*

**table**(test$Attrition, predictLog >= 0.5)

##

## FALSE TRUE

## 0 366 4

## 1 45 26

(366+26)/(**nrow**(test)) *#Accuracy 0.89*

## [1] 0.8888889

26/71 *# Sensitivity 0.37*

## [1] 0.3661972

Our accuracy has increased form Baseline Accuracy of 83% to 89%. However, that’s not much relevant. What’s important is that our sensitivity has gone leaps and bounds from 0 to 37%.

**Area Under the Curve (AUC) for the model on the test data**

**library**(ROCR)

## Loading required package: gplots

##

## Attaching package: ‘gplots’

## The following object is masked from ‘package:stats’:

##

## lowess

ROCRlog = **prediction**(predictLog, test$Attrition)

**as.numeric**(**performance**(ROCRlog, “auc”)@y.values)

## [1] 0.9018272

The AUC comes out to be 0.90 – indicating high accuracy.

**CART Model**

The logistic regression model gives us high accuracy, as well as significance of the variables. But there is a limitation. It is not immediately clear which variables are more important than the others, especially due to the large number of categorical variables in this problem.

Let us now build a classification tree for this model. Using the same training set, fit a CART model, and plot the tree.

*#CART Model*

**library**(rpart)

**library**(rpart.plot)

Tree = **rpart**(Attrition ~ ., method=“class”, data = train)

**prp**(Tree)

The Variable which the Tree Splits uupon in the first level is ‘OverTime’, followed by ‘MonthlyIncome’, indicating these are the most important variables.

Accuracy of the model on testing data set

PredictCART = **predict**(Tree, newdata = test, type = “class”)

**table**(test$Attrition, PredictCART)

## PredictCART

## 0 1

## 0 358 12

## 1 51 20

(358+20)/**nrow**(test) *#Accuracy ~ 86%*

## [1] 0.8571429

20/71 * #Sensitivity ~ 28%*

## [1] 0.2816901

**AUC of the model**

**library**(ROCR)

predictTestCART = **predict**(Tree, newdata = test)

predictTestCART = predictTestCART[,2]

*#Compute the AUC:*

ROCRCART = **prediction**(predictTestCART, test$Attrition)

**as.numeric**(**performance**(ROCRCART, “auc”)@y.values)

## [1] 0.7009326

**Interpretation**

We see that even though CART model beats the Baseline method, it underperforms the Logistic Regression model. This highlights a very regular phenomenon when comparing CART and logistic regression. CART often performs a little worse than logistic regression in out-of-sample accuracy. However, as is the case here, the CART model is often much simpler to describe and understand.

**Random Forest Model**

Finally, let’s try build a Random Forest Model.

**library**(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

**set.seed**(100)

rf = **randomForest**(Attrition ~ ., data=train)

*#make predictions*

predictRF = **predict**(rf, newdata=test)

**table**(test$Attrition, predictRF)

## predictRF

## 0 1

## 0 366 4

## 1 57 14

380/**nrow**(test) *#Accuracy ~ 86%*

## [1] 0.861678

14/71 *#Sensitivity ~ 19.7%*

## [1] 0.1971831

The accuracy is better than the CART Model but lower than the Logistic Regression Model.

**Understanding Important Variables in Random Forest Model**

One way of understanding this is to look at he number of times, aggregated over all of the trees in the random forest model,that a certain variable is selected for a split. This can be done using the following code:

*#Method 1*

vu = **varUsed**(rf, count=TRUE)

vusorted = **sort**(vu, decreasing = FALSE, index.return = TRUE)

**dotchart**(vusorted$x, **names**(rf$forest$xlevels[vusorted$ix]))

We can see that MonthlyIncome and Age Variables are used significantly more than the other variables.

**Method 2**

A different metric we can look at is related to “impurity”, which measures how homogenous each bucket or leaf of the tree is. To compute this metric, run the following command in R

**varImpPlot**(rf)

**Conclusion – The Advantage of Analytics**

In the absence of predictive modeling, we would have taken the naïve approach of predicting the majority of outcomes as the predictions which would have meant we would have labeled all employees as ‘Non Churner’. This would have defeated the entire purpose of controlling employee churn. Using various predictive modeling techniques, the organization is not just able to conveniently beat the baseline model but also predict with increased accuracy which employees have higher probability of leaving the organization.

**Action Plan**

Finally the organisation can look at the predictions and score the employees basis their probability of leaving the organization and accordingly,develop retention strategies.

**End Notes**

Please note that this is an introduction to HR Analytics using Predictive modeling. There are various other techniques which can be used to carry out this exercise,which was beyond the scope of this article. Also, even the models which have been constructed could have been improved by fine tuning various parameters. However, these have been not considered to keep the explanation simpler and non-technical.

**Author**: Vikash Singh

**Profile**: Seasoned Decision Scientist with over 11 years of experience in Data Science, Business Analytics, Business Strategy & Planning.

# vikash-analytics1 Posts

## 5 Comments

## Leading HR Execs Discusses How Big Data Can Help You Hire Smarter - Fusion Analytics World

October 18, 2016 at 6:20 am[…] major challenge initially was overcoming the lack of analytical skills within HR. This is consistent with the findings of a joint Harvard Business Review and Visier study, which […]

## Eloisa

April 15, 2017 at 3:12 amPretty! This has been a really wonderful post. Thank you for providing this info.

## Andrew Stokes

April 23, 2017 at 6:29 amI might not concur with whatever you authored right here.

I have a different perspective and to my thoughts it’s a lot more intensifying!

I think that HR Analytics: Understanding & Controlling Employee Attrition using

Predictive Modeling is the topic which is very interesting,

but we should look at it from different angles!

## Rajesh Shaw

April 25, 2017 at 2:37 pmI love the number of troubles in our modern training method are displayed listed here!

I’m students myself and I know from very own practical experience about some problems that are described listed.

## Atul Bhandari

February 2, 2018 at 2:53 pmWonderful goods from you, man. I’ve understand your stuff previous

to and you’re just too great. I really like what

you have acquired here, really like what you’re saying and the way

in which you say it. You make it enjoyable and you still care

for to keep it smart. I can not wait to read far more from you.

This is really a wonderful site.