SVM (Support Vector Machine): Job Attrition
Introduction
SVM (Support Vector Machine) is a supervised machine learning algorithm that is mainly used to classify data into different classes. The aim of this exercise was to predict job attrition using several individual characteristics such as distance from home, education, environment satisfaction, gender, and hourly rate and to finally predict if two specific employees would leave the job.
Preparing data
data("attrition")
# Load attrition data
set.seed(123) # for reproducibility
df <- dplyr::mutate_if(attrition, is.ordered, factor, ordered = FALSE)
head(df)
## Age Attrition BusinessTravel DailyRate Department
## 1 41 Yes Travel_Rarely 1102 Sales
## 2 49 No Travel_Frequently 279 Research_Development
## 3 37 Yes Travel_Rarely 1373 Research_Development
## 4 33 No Travel_Frequently 1392 Research_Development
## 5 27 No Travel_Rarely 591 Research_Development
## 6 32 No Travel_Frequently 1005 Research_Development
## DistanceFromHome Education EducationField EnvironmentSatisfaction Gender
## 1 1 College Life_Sciences Medium Female
## 2 8 Below_College Life_Sciences High Male
## 3 2 College Other Very_High Male
## 4 3 Master Life_Sciences Very_High Female
## 5 2 Below_College Medical Low Male
## 6 2 College Life_Sciences Very_High Male
## HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction
## 1 94 High 2 Sales_Executive Very_High
## 2 61 Medium 2 Research_Scientist Medium
## 3 92 Medium 1 Laboratory_Technician High
## 4 56 High 1 Research_Scientist High
## 5 40 High 1 Laboratory_Technician Medium
## 6 79 High 1 Laboratory_Technician Very_High
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## 1 Single 5993 19479 8 Yes
## 2 Married 5130 24907 1 No
## 3 Single 2090 2396 6 Yes
## 4 Married 2909 23159 1 Yes
## 5 Married 3468 16632 9 No
## 6 Single 3068 11864 0 No
## PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
## 1 11 Excellent Low 0
## 2 23 Outstanding Very_High 1
## 3 15 Excellent Medium 0
## 4 11 Excellent High 0
## 5 12 Excellent Very_High 1
## 6 13 Excellent High 0
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 8 0 Bad 6
## 2 10 3 Better 10
## 3 7 3 Better 0
## 4 8 3 Better 8
## 5 6 3 Better 2
## 6 8 2 Good 7
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 4 0 5
## 2 7 1 7
## 3 0 0 0
## 4 7 3 0
## 5 2 2 2
## 6 7 3 6
# Create training (80%) and test (20%) sets
set.seed(123) # for reproducibility
attrition_split <- initial_split(df, prop = 0.8, strata = "Attrition")
#If we want to explicitly control the sampling so that our training and test
#sets have similar y distributions, we can use stratified sampling
attrition_train <- training(attrition_split)
attrition_test <- testing(attrition_split)
Radial Basis Function - SVM
#caret’s train() function with method = "svmRadialSigma" is used to get
#values of C (cost) and \sigma (related with the \gamma of Radial Basis function)
#through cross-validation
set.seed(1854) # for reproducibility
model_svm_rbf <- train(
Attrition ~ .,
data = attrition_train,
method = "svmRadial",
preProcess = c("center", "scale"), #x's standardized (i.e.,centered around zero with a sd of one)
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10 #select 10 random numbers for the hypertune parameters
)
# Print results
print(model_svm_rbf$results)
## sigma C Accuracy Kappa AccuracySD KappaSD
## 1 0.009662318 0.25 0.8385702 0.0000000 0.0006647742 0.00000000
## 2 0.009662318 0.50 0.8385702 0.0000000 0.0006647742 0.00000000
## 3 0.009662318 1.00 0.8513110 0.1239377 0.0073175959 0.06778683
## 4 0.009662318 2.00 0.8657685 0.3112862 0.0163132180 0.10086539
## 5 0.009662318 4.00 0.8657685 0.3626285 0.0210030787 0.09429814
## 6 0.009662318 8.00 0.8589816 0.3852671 0.0232781466 0.09110905
## 7 0.009662318 16.00 0.8419817 0.3438177 0.0182919567 0.06346784
## 8 0.009662318 32.00 0.8326380 0.3182507 0.0177556841 0.06644716
## 9 0.009662318 64.00 0.8326380 0.3182507 0.0177556841 0.06644716
## 10 0.009662318 128.00 0.8326380 0.3182507 0.0177556841 0.06644716
model_svm_rbf
## Support Vector Machines with Radial Basis Function Kernel
##
## 1177 samples
## 30 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1059, 1060, 1059, 1059, 1059, 1059, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8385702 0.0000000
## 0.50 0.8385702 0.0000000
## 1.00 0.8513110 0.1239377
## 2.00 0.8657685 0.3112862
## 4.00 0.8657685 0.3626285
## 8.00 0.8589816 0.3852671
## 16.00 0.8419817 0.3438177
## 32.00 0.8326380 0.3182507
## 64.00 0.8326380 0.3182507
## 128.00 0.8326380 0.3182507
##
## Tuning parameter 'sigma' was held constant at a value of 0.009662318
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.009662318 and C = 4.
#tune the hyperparameters with tunegrid
set.seed(1854) # for reproducibility
model_svm_rbf_2 <- train(
Attrition ~ .,
data = attrition_train,
method = "svmRadial",
preProcess = c("center", "scale"), #x's standardized (i.e.,centered around zero with a sd of one)
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(C=seq(4, 6, 1), sigma=seq(0.003,0.006,0.0005))
)
# Print results
#print(model_svm_rbf_2$results)
model_svm_rbf_2
## Support Vector Machines with Radial Basis Function Kernel
##
## 1177 samples
## 30 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1059, 1060, 1059, 1059, 1059, 1059, ...
## Resampling results across tuning parameters:
##
## C sigma Accuracy Kappa
## 4 0.0030 0.8674562 0.2817171
## 4 0.0035 0.8708533 0.3233907
## 4 0.0040 0.8725554 0.3387736
## 4 0.0045 0.8742721 0.3601304
## 4 0.0050 0.8751268 0.3749610
## 4 0.0055 0.8742648 0.3817300
## 4 0.0060 0.8725699 0.3766865
## 5 0.0030 0.8717080 0.3294572
## 5 0.0035 0.8768217 0.3706600
## 5 0.0040 0.8759742 0.3802197
## 5 0.0045 0.8751123 0.3836210
## 5 0.0050 0.8734246 0.3823293
## 5 0.0055 0.8734246 0.3823293
## 5 0.0060 0.8725771 0.3799746
## 6 0.0030 0.8751123 0.3658621
## 6 0.0035 0.8751195 0.3778871
## 6 0.0040 0.8751123 0.3862916
## 6 0.0045 0.8725699 0.3799100
## 6 0.0050 0.8717224 0.3775552
## 6 0.0055 0.8708822 0.3753383
## 6 0.0060 0.8717297 0.3820362
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.0035 and C = 5.
#Plotting the results, we see that smaller values of the cost parameter
#( C≈ 2–8) provide better cross-validated accuracy scores for these
#training data:
ggplot(model_svm_rbf_2) + theme_light()

#confusion matrix on training dataset
confusionMatrix(model_svm_rbf_2)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction No Yes
## No 83.2 11.6
## Yes 0.7 4.5
##
## Accuracy (average) : 0.8768
## testing
#We test the model on the testing dataset
test_svm_rbf_2 = predict(model_svm_rbf_2, attrition_test)
confusionMatrix(data = test_svm_rbf_2, attrition_test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 245 34
## Yes 1 13
##
## Accuracy : 0.8805
## 95% CI : (0.8378, 0.9154)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 0.03008
##
## Kappa : 0.3806
##
## Mcnemar's Test P-Value : 6.338e-08
##
## Sensitivity : 0.9959
## Specificity : 0.2766
## Pos Pred Value : 0.8781
## Neg Pred Value : 0.9286
## Prevalence : 0.8396
## Detection Rate : 0.8362
## Detection Prevalence : 0.9522
## Balanced Accuracy : 0.6363
##
## 'Positive' Class : No
##
After loading the dataset “attrition”, I transform the data to the format of an SVM package (outcome as factor). Then, I set a seed for reproducibility and split the data into a training & testing dataset, explicitly controlling the sampling so that both have a similar distribution of y. To classify the employees based in the given features into “yes” and “no”, we use the radial basis function kernel. This function transforms the data points to generate new features by measuring the distance between all other points to a specific dot centre. Doing so it projects all points into an “infinite” higher dimensional space where the data becomes linearly separable. Before running the model with the method “svmRadial” of caret’s train() function, I again set a seed. I could optimize the model using different metrics such as accuracy and ROC. Here, I use “accuracy” to select the best model. When running the model, I standardize the data by centring and scaling the data. I use cross-sampling (10-fold) as resampling method and tune the value C & γ. TuneLength = 10 takes 10 random values of C & γ and gives the best output amongst these randomly chosen numbers. Subsequently, I further tune the parameters by using tuneGrid around the output of tuneLength. Optimizing “accuracy” the final model I select has sigma = 0.0035, C = 5, and reaches an accuracy of 87.68% with a sensitivity of 99.59% and a specificity of 27.66% . Now, I predict the classes on the testing dataset and compare the predicted classification with the labels of the dataset. On the testing dataset, this model reaches an accuracy of 88.05%. (Sensitivity: 99.59%, Specificity: 27.66%, Balanced Accuracy: 63.63%.)
##2 Linear kernel function
#
set.seed(1854) # for reproducibility
# Tell SVM that the kernel is linear, the tune-in parameter cost is 5, and scale equals true.
##2 linear with caret package
set.seed(1854) # for reproducibility
model_svm_linear_2 <- train(
Attrition ~ .,
data = attrition_train,
method = "svmLinear",
preProcess = c("center", "scale"), #x's standardized (i.e.,centered around zero with a sd of one)
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(C=5)
)
#print results
print(model_svm_linear_2$results)
## C Accuracy Kappa AccuracySD KappaSD
## 1 5 0.870042 0.4252287 0.02902589 0.1192711
## testing
#We test the model on the testing dataset
test_svm_linear_2 = predict(model_svm_linear_2, attrition_test)
confusionMatrix(data = test_svm_linear_2, attrition_test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 241 23
## Yes 5 24
##
## Accuracy : 0.9044
## 95% CI : (0.8648, 0.9356)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 0.0009129
##
## Kappa : 0.5802
##
## Mcnemar's Test P-Value : 0.0013149
##
## Sensitivity : 0.9797
## Specificity : 0.5106
## Pos Pred Value : 0.9129
## Neg Pred Value : 0.8276
## Prevalence : 0.8396
## Detection Rate : 0.8225
## Detection Prevalence : 0.9010
## Balanced Accuracy : 0.7452
##
## 'Positive' Class : No
##
Comparison of SVM RBF and SVM with Linear kernel function (c=5)
As second model I use the SVM model with the linear kernel function. Here, I use the same cost factor that I determined in the first model (cost = 5). A linear SVM is a parametric model, and is less expensive to train and predict than a RBF kernel SVM model. Besides being more computational expensive, the latter it much easier to overfit being a complex model with many hyperparameters.
Here, this model gives an accuracy of 87.00%, what is slightly lower than the accuracy of the SVM EBF model, with an accuracy if 87.68%. The out of sample testing performs better though, with an accuracy of 90.44% (compared to an accuracy of 88.05%).
##3 Logistic Regression
model_logistic <- glm(Attrition~BusinessTravel + DistanceFromHome + EnvironmentSatisfaction +
Gender+ JobInvolvement + JobRole + MaritalStatus+
NumCompaniesWorked + OverTime + RelationshipSatisfaction+ TotalWorkingYears+
TrainingTimesLastYear+ WorkLifeBalance+ YearsInCurrentRole + YearsSinceLastPromotion,
family="binomial", attrition_train)
summary(model_logistic)
##
## Call:
## glm(formula = Attrition ~ BusinessTravel + DistanceFromHome +
## EnvironmentSatisfaction + Gender + JobInvolvement + JobRole +
## MaritalStatus + NumCompaniesWorked + OverTime + RelationshipSatisfaction +
## TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +
## YearsInCurrentRole + YearsSinceLastPromotion, family = "binomial",
## data = attrition_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.4944 -0.2749 -0.1033 3.9483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.90029 0.83791 -1.074 0.282621
## BusinessTravelTravel_Frequently 1.60242 0.42480 3.772 0.000162 ***
## BusinessTravelTravel_Rarely 0.95694 0.39118 2.446 0.014433 *
## DistanceFromHome 0.04324 0.01164 3.714 0.000204 ***
## EnvironmentSatisfactionMedium -1.11996 0.30199 -3.709 0.000208 ***
## EnvironmentSatisfactionHigh -1.01384 0.26613 -3.810 0.000139 ***
## EnvironmentSatisfactionVery_High -1.34486 0.27659 -4.862 1.16e-06 ***
## GenderMale 0.40817 0.20008 2.040 0.041352 *
## JobInvolvementMedium -0.75783 0.38054 -1.991 0.046428 *
## JobInvolvementHigh -1.24222 0.36043 -3.447 0.000568 ***
## JobInvolvementVery_High -1.68965 0.49486 -3.414 0.000639 ***
## JobRoleHuman_Resources 2.04238 0.59622 3.426 0.000614 ***
## JobRoleLaboratory_Technician 1.54114 0.48297 3.191 0.001418 **
## JobRoleManager 0.48627 0.73624 0.660 0.508944
## JobRoleManufacturing_Director 0.34080 0.60355 0.565 0.572306
## JobRoleResearch_Director -1.20518 1.16715 -1.033 0.301799
## JobRoleResearch_Scientist 0.60884 0.48803 1.248 0.212201
## JobRoleSales_Executive 1.08881 0.47868 2.275 0.022928 *
## JobRoleSales_Representative 2.16183 0.55651 3.885 0.000102 ***
## MaritalStatusMarried 0.22081 0.27805 0.794 0.427125
## MaritalStatusSingle 1.22718 0.28184 4.354 1.34e-05 ***
## NumCompaniesWorked 0.15355 0.04087 3.757 0.000172 ***
## OverTimeYes 2.05669 0.21338 9.639 < 2e-16 ***
## RelationshipSatisfactionMedium -0.80170 0.30498 -2.629 0.008571 **
## RelationshipSatisfactionHigh -0.83069 0.27632 -3.006 0.002645 **
## RelationshipSatisfactionVery_High -0.92268 0.28118 -3.281 0.001033 **
## TotalWorkingYears -0.09092 0.02384 -3.814 0.000137 ***
## TrainingTimesLastYear -0.19553 0.08021 -2.438 0.014786 *
## WorkLifeBalanceGood -1.15480 0.38988 -2.962 0.003057 **
## WorkLifeBalanceBetter -1.44181 0.36179 -3.985 6.74e-05 ***
## WorkLifeBalanceBest -0.84796 0.44227 -1.917 0.055204 .
## YearsInCurrentRole -0.11146 0.04329 -2.575 0.010035 *
## YearsSinceLastPromotion 0.16997 0.04299 3.954 7.69e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1040.54 on 1176 degrees of freedom
## Residual deviance: 717.82 on 1144 degrees of freedom
## AIC: 783.82
##
## Number of Fisher Scoring iterations: 7
#to get accuracy of the training data
probabilities_attrition_train <- model_logistic %>% predict(attrition_train, type = "response")
predicted.classes_train <- ifelse(probabilities_attrition_train > 0.5, "Yes", "No")
mean(predicted.classes_train == attrition_train$Attrition)
## [1] 0.8810535
probabilities_attrition <- model_logistic %>% predict(attrition_test, type = "response")
predicted.classes <- as.factor(ifelse(probabilities_attrition > 0.5, "Yes", "No"))
#Accuracy: The model accuracy is measured as the proportion of observations that have been correctly classified.
#Inversely, the classification error is defined as the proportion of observations that have been misclassified.
#Proportion of correctly classified observations:
mean(predicted.classes == attrition_test$Attrition)
## [1] 0.887372
confusionMatrix(data = predicted.classes, attrition_test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 236 23
## Yes 10 24
##
## Accuracy : 0.8874
## 95% CI : (0.8455, 0.9212)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 0.01302
##
## Kappa : 0.5292
##
## Mcnemar's Test P-Value : 0.03671
##
## Sensitivity : 0.9593
## Specificity : 0.5106
## Pos Pred Value : 0.9112
## Neg Pred Value : 0.7059
## Prevalence : 0.8396
## Detection Rate : 0.8055
## Detection Prevalence : 0.8840
## Balanced Accuracy : 0.7350
##
## 'Positive' Class : No
##
Now, I create a model using a logistic regression and compare it with the SVM classifier. I select the variables that are significant for the logistic regression such as business travel, distance from home, and environment satisfaction. The logistic regression gives me only a probability for attrition; therefore, I need to set a cut-off value. I set the cut-off value at 50% (every value > 50% = “Yes”). The accuracy for this model is 88.11% which is better than both SVM models. Out of sample accuracy amounts to 88.74%.
##KNN
set.seed(1234) #I will use cross validation. To be able to replicate the results I set the seed to a fixed number
# Below I use 'train' function from caret library.
# 'preProcess': I use this option to center and scale the data
# 'method' is knn
# default 'metric' is accuracy
model_knn <- train(Attrition~., data=attrition_train,
method = "knn",
trControl = trainControl("cv", number = 10), #use cross validation with 10 data points
tuneLength = 10, #number of parameter values train function will try
preProcess = c("center", "scale")) #center and scale the data in k-nn this is pretty important
model_knn
## k-Nearest Neighbors
##
## 1177 samples
## 30 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1059, 1060, 1059, 1059, 1059, 1060, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8385702 0.12216802
## 7 0.8411415 0.11075104
## 9 0.8428147 0.08514906
## 11 0.8436694 0.08242729
## 13 0.8445386 0.07600373
## 15 0.8445241 0.07107637
## 17 0.8428220 0.05459990
## 19 0.8436694 0.05666231
## 21 0.8436622 0.05001399
## 23 0.8411198 0.02556860
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.
plot(model_knn) #we can plot the results

# very low accuracy compared to others - I don't continue to tune with grid.
## testing
#We test the model on the testing dataset
test_knn = predict(model_knn, attrition_test)
confusionMatrix(data = test_knn, attrition_test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 246 46
## Yes 0 1
##
## Accuracy : 0.843
## 95% CI : (0.7962, 0.8827)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 0.4754
##
## Kappa : 0.0352
##
## Mcnemar's Test P-Value : 3.247e-11
##
## Sensitivity : 1.00000
## Specificity : 0.02128
## Pos Pred Value : 0.84247
## Neg Pred Value : 1.00000
## Prevalence : 0.83959
## Detection Rate : 0.83959
## Detection Prevalence : 0.99659
## Balanced Accuracy : 0.51064
##
## 'Positive' Class : No
##
KNN Model As fourth model I use the k-Nearest Neighbours (k-NN) model. It predicts a value based on a datapoint’s k nearest neighbours. This means, that the probability of attrition is predicted based on the k most similar properties in the training dataset. Amongst the 10 randomly selected k, the model with k= 13 performs best, with an accuracy of 84.45%.
Comparing the models
Comparing the four models based on accuracy, all models other than KNN, perform similarly well. KNN is the weakest model. All four models are not overfitted, having the out of sample accuracy similar or even higher to the accuracy on the training data.
Accuracy (training data) Accuracy (testing data)
SVM RBF 0.8768 0.8805 SVM Linear 0.8700 0.9044 Logistic regression 0.8811 0.8874 KNN 0.8445 0.843
Below you find listed some differences between SVM and Logistic Regression:
SVM (RBF)
- Optimizes the margin that separates the two classes
- For unstructured and semi-structured data like text and images
- Less vulnerable to overfitting
- Based on geometrical properties
Logistic Regression - Can have different decision boundaries with different weights to predict - Works with already defined independent variables - Based on statistical approach - More vulnerable to overfitting
In general, with a high number of features and a small training set, it is better to use a logistic regression or SVM with linear kernel. Logistic regression and SVM with a linear kernel perform usually similarly but depending on the features, one could be more efficient than the other. In this case, I would recommend using the logistic regression as it is a less complex model, does not over-fit, is not computational heavy and performs as well as the other models.
Applying the best model
#load dataset
library(readr)
two_employees <- read_csv("data/two_employees.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Employee = col_character(),
## BusinessTravel = col_character(),
## Department = col_character(),
## Education = col_character(),
## EducationField = col_character(),
## EnvironmentSatisfaction = col_character(),
## Gender = col_character(),
## JobInvolvement = col_character(),
## JobRole = col_character(),
## JobSatisfaction = col_character(),
## MaritalStatus = col_character(),
## OverTime = col_character(),
## PerformanceRating = col_character(),
## RelationshipSatisfaction = col_character(),
## WorkLifeBalance = col_character()
## )
## See spec(...) for full column specifications.
two_employees <- two_employees %>%
mutate_if(is.ordered, factor, ordered = FALSE)
head(two_employees)
## # A tibble: 2 x 31
## Employee Age BusinessTravel DailyRate Department DistanceFromHome Education
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 A 48 Travel_Rarely 1202 Sales 8 College
## 2 B 38 Travel_Rarely 1218 Sales 9 Master
## # ... with 24 more variables: EducationField <chr>,
## # EnvironmentSatisfaction <chr>, Gender <chr>, HourlyRate <dbl>,
## # JobInvolvement <chr>, JobLevel <dbl>, JobRole <chr>, JobSatisfaction <chr>,
## # MaritalStatus <chr>, MonthlyIncome <dbl>, MonthlyRate <dbl>,
## # NumCompaniesWorked <dbl>, OverTime <chr>, PercentSalaryHike <dbl>,
## # PerformanceRating <chr>, RelationshipSatisfaction <chr>,
## # StockOptionLevel <dbl>, TotalWorkingYears <dbl>,
## # TrainingTimesLastYear <dbl>, WorkLifeBalance <chr>, YearsAtCompany <dbl>,
## # YearsInCurrentRole <dbl>, YearsSinceLastPromotion <dbl>,
## # YearsWithCurrManager <dbl>
#predict attrition
probabilities_attrition_two_employees <- model_logistic %>% predict(two_employees, type = "response")
predicted.classes_two_employees <- as.factor(ifelse(probabilities_attrition_two_employees > 0.5, "Yes", "No"))
predicted.classes_two_employees
## 1 2
## Yes No
## Levels: No Yes
In this exercise, I predict with the logistic regression if the two employees given their characteristics would be likely to leave their job.
I use the logistic regression to classify the two employees: According to the logistical model, employee A is likely to leave the job, while employee B not. Given the characteristics of employee A and B, without this prediction with the model I would have rather guessed that employee B could be likely to leave the job (given the los job satisfaction). But it seems that other characteristics have a higher impact on the likeliness of job attrition. To keep the employee A, probably a better life work balance, percentage salary hike, relationship satisfaction and no over time would be needed.