1 Statement of authorship

I have executed and prepared this assignment and document by myself without the help of any other person. Signature:

2 Background

knitr::include_graphics(path = "./airplane.jpg") 

Passenger loyalty is fundamental to any airline aiming to maintain a stable market share and revenue stream (Chang and Hung, 2013), particularly in a turbulent market. The competitive landscape of the global airline industry has been in a constant change in recent years, with a rapid growth of low cost carriers and high-speed railways, rising fuel costs, fluctuating demand, and tighter security, safety and quality requirements. This is all but not considering a global pandemic like COVID-19 and its effects on airlines. To survive and grow, airlines managers need to identify factors of their services that satisfy and retain customers (Chen, 2008).

3 Objective

In this report the objective is to analyze a data set and find which characteristics drive past customers to “fly again” with an airline, to develop a decent predictive model and provide insight to future marketing campaigns.

4 Methods

4.1 Return to airline data

The data is consisted of 1768 observation and a total of 24 columns. The features are a mix of opinion-based questions and demographic information for past customers. Each customer is also asked whether he/she would fly again with this airline. The opinion-based questions use a scale from 1, meaning strongly disagree, to 9, strongly agree.

Now to begin the analysis we read the data from the csv file.

library(dplyr)
library(tidyverse)
library(psych)
library(DT)
library(sjmisc)
library(sjPlot)
library( captioner)
library( knitr)
library(kableExtra)
data <- read.csv("Airline_Key_Drivers.csv")
headTail(data) %>% datatable(rownames = F, filter="top", options = list(pageLength = 10, scrollX=T), caption = "Airline data")

and the name and type of the variables in the data.

str(data)
'data.frame':   1768 obs. of  24 variables:
 $ RID             : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Easy_Reservation: int  1 1 1 1 1 1 3 3 1 NA ...
 $ Preferred_Seats : int  1 1 1 1 1 1 1 1 1 2 ...
 $ Flight_Options  : int  1 1 NA 1 NA 1 2 3 1 3 ...
 $ Ticket_Prices   : int  NA 1 1 1 1 1 2 1 2 2 ...
 $ Seat_Comfort    : int  NA 5 2 1 1 4 1 4 2 4 ...
 $ Seat_Roominess  : int  4 4 1 3 1 6 1 4 1 2 ...
 $ Overhead_Storage: int  4 3 1 4 1 1 3 3 1 1 ...
 $ Clean_Aircraft  : int  3 3 1 2 1 2 4 4 4 3 ...
 $ Courtesy        : int  1 5 4 1 7 2 4 5 4 2 ...
 $ Friendliness    : int  1 3 5 1 3 2 4 4 1 2 ...
 $ Helpfulness     : int  1 6 NA 1 4 2 4 7 7 NA ...
 $ Service         : int  1 4 1 1 5 5 4 9 5 3 ...
 $ Satisfaction    : int  2 2 1 1 1 3 4 3 5 3 ...
 $ Recommend       : int  NA 1 1 2 1 1 1 NA 1 2 ...
 $ Language.n      : int  1 1 NA 2 1 4 1 2 1 1 ...
 $ Smoker.n        : int  2 2 2 2 2 2 2 2 1 2 ...
 $ Employment.n    : int  2 NA 3 2 3 2 3 3 1 NA ...
 $ Education.n     : int  6 5 6 6 2 2 1 4 1 NA ...
 $ Marital.n       : int  1 3 2 3 3 3 2 1 3 NA ...
 $ Sex.n           : int  2 2 2 1 1 2 1 1 2 2 ...
 $ Age.n           : int  1 1 6 6 5 6 6 2 7 1 ...
 $ Income.n        : int  4 4 12 3 3 12 1 13 4 3 ...
 $ FlyAgain        : int  0 0 0 0 0 0 0 0 0 0 ...

4.2 Missing values Analysis

The amount of missing values in each predictor is calculated in next code chunks.

library(inspectdf)
data %>% inspect_na %>%
  kable("html", align = 'clc', caption = 'Missing values count and percentage', digits=2) %>%
    kable_styling(bootstrap_options = "striped", full_width = T, position = "center")
Missing values count and percentage
col_name cnt pcnt
Flight_Options 159 8.99
Language.n 159 8.99
Courtesy 155 8.77
Smoker.n 153 8.65
Helpfulness 152 8.60
Service 151 8.54
Sex.n 151 8.54
Education.n 147 8.31
Age.n 146 8.26
Seat_Roominess 142 8.03
Employment.n 141 7.98
Marital.n 141 7.98
Friendliness 137 7.75
Easy_Reservation 132 7.47
Satisfaction 132 7.47
Clean_Aircraft 131 7.41
Preferred_Seats 129 7.30
Income.n 129 7.30
Seat_Comfort 128 7.24
Recommend 126 7.13
Ticket_Prices 121 6.84
Overhead_Storage 120 6.79
RID 0 0.00
FlyAgain 0 0.00

The plot below shows the missing percentages in visually appealing format. Note that there are less than 10% missing data in all predictors and the dependent variable (FlyAgain) has no missing value for all participants.

data%>% inspect_na %>% show_plot(label_size = 8)

4.3 Imputing the missing percentage

For this we will use the Random Forest technique in the ‘mice’ package.

library(mice) 
set.seed(456)
tempData <- mice(data, m=5, maxit=50, meth='rf', seed=500, print=FALSE)

We pick the first of five imputed data sets and check if all missing values are correctly imputed.

data_imp1 <- mice::complete(tempData)
data_imp1 %>% inspect_na %>%
  kable("html", align = 'clc', caption = 'Missing values count and percentage', digits=2) %>%
    kable_styling(bootstrap_options = "striped", full_width = T, position = "center")
Missing values count and percentage
col_name cnt pcnt
RID 0 0
Easy_Reservation 0 0
Preferred_Seats 0 0
Flight_Options 0 0
Ticket_Prices 0 0
Seat_Comfort 0 0
Seat_Roominess 0 0
Overhead_Storage 0 0
Clean_Aircraft 0 0
Courtesy 0 0
Friendliness 0 0
Helpfulness 0 0
Service 0 0
Satisfaction 0 0
Recommend 0 0
Language.n 0 0
Smoker.n 0 0
Employment.n 0 0
Education.n 0 0
Marital.n 0 0
Sex.n 0 0
Age.n 0 0
Income.n 0 0
FlyAgain 0 0

Statistical tests on the imputed data shows that the means and variance are not significantly different.

library(matrixTests)

d1 <- col_t_welch(data[2:23], data_imp1[2:23] )[c(1:2, 4:5, 12)]
d2 <- col_f_var( data[2:23], data_imp1[2:23] )[c(4:5, 10)]

d1 %>%
  kable("html", align = 'clc', caption = 'First imputed data set', digits=2, col.names=c("fs n", "fs micerf n", "fs mean", "micerf mean", "p-value")) %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
First imputed data set
fs n fs micerf n fs mean micerf mean p-value
Easy_Reservation 1636 1768 8.14 8.16 0.69
Preferred_Seats 1639 1768 7.08 7.12 0.56
Flight_Options 1609 1768 6.91 6.94 0.67
Ticket_Prices 1647 1768 6.68 6.70 0.76
Seat_Comfort 1640 1768 7.81 7.85 0.48
Seat_Roominess 1626 1768 7.48 7.50 0.76
Overhead_Storage 1648 1768 7.10 7.13 0.64
Clean_Aircraft 1637 1768 7.74 7.79 0.43
Courtesy 1613 1768 8.23 8.27 0.42
Friendliness 1631 1768 8.15 8.19 0.42
Helpfulness 1616 1768 8.06 8.11 0.34
Service 1617 1768 8.03 8.08 0.36
Satisfaction 1636 1768 7.79 7.81 0.69
Recommend 1642 1768 7.05 7.07 0.76
Language.n 1609 1768 1.76 1.73 0.52
Smoker.n 1615 1768 1.81 1.82 0.49
Employment.n 1627 1768 1.98 1.97 0.70
Education.n 1621 1768 3.48 3.47 0.92
Marital.n 1627 1768 2.31 2.30 0.87
Sex.n 1617 1768 1.51 1.52 0.75
Age.n 1622 1768 4.38 4.40 0.76
Income.n 1639 1768 5.35 5.33 0.82
d2 %>%
  kable("html", align = 'clc', caption = 'Comparing variances', digits=2, col.names=c( "fs var", "micerf var", "p-value")) %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
Comparing variances
fs var micerf var p-value
Easy_Reservation 2.13 2.13 0.98
Preferred_Seats 3.47 3.42 0.73
Flight_Options 3.81 3.82 0.98
Ticket_Prices 3.83 3.80 0.88
Seat_Comfort 2.72 2.67 0.67
Seat_Roominess 3.15 3.12 0.84
Overhead_Storage 3.55 3.49 0.74
Clean_Aircraft 2.84 2.80 0.73
Courtesy 2.12 2.03 0.37
Friendliness 2.13 2.07 0.56
Helpfulness 2.21 2.13 0.48
Service 2.45 2.41 0.69
Satisfaction 2.78 2.80 0.89
Recommend 3.61 3.64 0.86
Language.n 1.45 1.40 0.50
Smoker.n 0.16 0.15 0.43
Employment.n 0.67 0.67 0.88
Education.n 4.97 4.93 0.86
Marital.n 2.34 2.32 0.83
Sex.n 0.25 0.25 0.99
Age.n 4.52 4.46 0.79
Income.n 9.91 9.76 0.74

Let’s merge the five imputed data sets using the ‘sjmisc’ package. After doing the statistical tests and checking every p-value, we pick the merged data set as the final imputed data set.

Merged of all 5 data sets
fs n fs micerf n fs mean micerf mean p-value
Easy_Reservation 1636 1768 8.14 8.17 0.56
Preferred_Seats 1639 1768 7.08 7.12 0.50
Flight_Options 1609 1768 6.91 6.94 0.61
Ticket_Prices 1647 1768 6.68 6.70 0.72
Seat_Comfort 1640 1768 7.81 7.84 0.61
Seat_Roominess 1626 1768 7.48 7.51 0.62
Overhead_Storage 1648 1768 7.10 7.14 0.56
Clean_Aircraft 1637 1768 7.74 7.79 0.41
Courtesy 1613 1768 8.23 8.28 0.35
Friendliness 1631 1768 8.15 8.18 0.52
Helpfulness 1616 1768 8.06 8.11 0.35
Service 1617 1768 8.03 8.07 0.39
Satisfaction 1636 1768 7.79 7.83 0.46
Recommend 1642 1768 7.05 7.08 0.64
Language.n 1609 1768 1.76 1.72 0.43
Smoker.n 1615 1768 1.81 1.82 0.24
Employment.n 1627 1768 1.98 1.98 0.94
Education.n 1621 1768 3.48 3.46 0.79
Marital.n 1627 1768 2.31 2.28 0.58
Sex.n 1617 1768 1.51 1.52 0.52
Age.n 1622 1768 4.38 4.39 0.89
Income.n 1639 1768 5.35 5.32 0.76
Comparing variances
fs var micerf var p-value
Easy_Reservation 2.13 2.07 0.52
Preferred_Seats 3.47 3.32 0.34
Flight_Options 3.81 3.65 0.38
Ticket_Prices 3.83 3.69 0.43
Seat_Comfort 2.72 2.65 0.55
Seat_Roominess 3.15 3.03 0.43
Overhead_Storage 3.55 3.38 0.30
Clean_Aircraft 2.84 2.74 0.44
Courtesy 2.12 1.97 0.16
Friendliness 2.13 2.08 0.63
Helpfulness 2.21 2.11 0.35
Service 2.45 2.37 0.48
Satisfaction 2.78 2.71 0.56
Recommend 3.61 3.54 0.67
Language.n 1.45 1.35 0.18
Smoker.n 0.16 0.15 0.18
Employment.n 0.67 0.63 0.25
Education.n 4.97 4.64 0.16
Marital.n 2.34 2.21 0.22
Sex.n 0.25 0.25 0.97
Age.n 4.52 4.22 0.17
Income.n 9.91 9.36 0.24

df is the final imputed data set and it will be used from now on.

df <- as.data.frame(cbind(mice_mrg$data, data$FlyAgain))
names(df) <- names(data)[2:24]
rm(data_imp, mice_mrg, tempData, data)#just for ease of use

It is better to Save and Read the imputed data to save time for next times! This is done below.

#write.csv(df, "./imputed_data.csv", row.names = F)
df <- read.csv("./imputed_data.csv")

4.4 A glimpse on predictor values

By running a value count on all predictors, it can be seen that the data is skewed to higher opinions and perceptions about the airline. Meaning the majority of customers had good opinion about the airline, which is very good for us. The downside however, is that it might be puzzling to analyze customers who will not return. We will see more of this in next analyses.

library(wrapr)

tabfun <- function(x){
  table(x)
}
1:23 %.>%  (function(x) {lapply(df[,(x)], tabfun)}) (.)
$Easy_Reservation
x
   1    2    3    4    5    6    7    8    9 
  20    4    9   26   37   84  191  319 1078 

$Preferred_Seats
x
  1   2   3   4   5   6   7   8   9 
 35  16  25  72 145 245 367 366 497 

$Flight_Options
x
  1   2   3   4   5   6   7   8   9 
 34  26  45  82 168 267 349 334 463 

$Ticket_Prices
x
  1   2   3   4   5   6   7   8   9 
 42  22  50 104 203 298 354 338 357 

$Seat_Comfort
x
  1   2   3   4   5   6   7   8   9 
 24   7  13  44  61 139 264 321 895 

$Seat_Roominess
x
  1   2   3   4   5   6   7   8   9 
 27  15  19  48 105 190 288 394 682 

$Overhead_Storage
x
  1   2   3   4   5   6   7   8   9 
 37  14  24  71 151 250 334 369 518 

$Clean_Aircraft
x
  1   2   3   4   5   6   7   8   9 
 22  12  18  36  73 152 257 335 863 

$Courtesy
x
   1    2    3    4    5    6    7    8    9 
  16   11   12   13   38   63  162  272 1181 

$Friendliness
x
   1    2    3    4    5    6    7    8    9 
  13   15   12   17   39   85  188  291 1108 

$Helpfulness
x
   1    2    3    4    5    6    7    8    9 
  14    9   10   24   46  109  202  301 1053 

$Service
x
   1    2    3    4    5    6    7    8    9 
  16   12   18   23   57  103  185  297 1057 

$Satisfaction
x
  1   2   3   4   5   6   7   8   9 
 23  14  14  30  81 129 244 352 881 

$Recommend
x
  1   2   3   4   5   6   7   8   9 
 38  19  34  73 154 237 355 355 503 

$Language.n
x
   1    2    3    4    5 
1099  394   15  185   75 

$Smoker.n
x
   1    2 
 314 1454 

$Employment.n
x
  1   2   3 
581 648 539 

$Education.n
x
  1   2   3   4   5   6   7   8   9  10 
323 461 201 351  52 257  22  14  79   8 

$Marital.n
x
  1   2   3   4   5   6 
784 264 466  46 103 105 

$Sex.n
x
  1   2 
843 925 

$Age.n
x
  1   2   3   4   5   6   7   8   9 
265  95 223 283 269 352 207  72   2 

$Income.n
x
  1   2   3   4   5   6   7   8   9  10  11  12  13 
 75 241 294 237 216 196 126 102  65  54  50  69  43 

$FlyAgain
x
   0    1 
 706 1062 

5 Analysis

5.1 Logistic Regression

Now we implement a binomial logistic regression model in base R. At first we use all the variables, except for the RID that was omitted from the data set, the results and the p-value below shows that in the demographic variables only Education and Age are important. Meaning they have significant effect on the model. Overall, the logistic regression indicates that 12 out of 22 predictor variable are some how more important (significant) on how they affect the passenger’s op pinon to fly with the airline again.

glm.all <- glm( FlyAgain ~ . , data = df, family = 'binomial')
summary(glm.all)

Call:
glm(formula = FlyAgain ~ ., family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4968  -0.7934   0.4275   0.7441   2.8841  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -14.43271    1.05088 -13.734  < 2e-16 ***
Easy_Reservation   0.12834    0.06069   2.115 0.034457 *  
Preferred_Seats    0.13601    0.04303   3.161 0.001571 ** 
Flight_Options     0.19887    0.03986   4.989 6.06e-07 ***
Ticket_Prices      0.14091    0.03941   3.575 0.000350 ***
Seat_Comfort       0.14328    0.05412   2.647 0.008110 ** 
Seat_Roominess     0.04712    0.04569   1.031 0.302349    
Overhead_Storage   0.17586    0.04525   3.887 0.000102 ***
Clean_Aircraft     0.11012    0.04972   2.215 0.026789 *  
Courtesy           0.08667    0.06443   1.345 0.178577    
Friendliness       0.14192    0.06229   2.278 0.022706 *  
Helpfulness        0.08049    0.06456   1.247 0.212505    
Service           -0.03066    0.06031  -0.508 0.611262    
Satisfaction       0.10440    0.05033   2.074 0.038060 *  
Recommend          0.35787    0.04339   8.247  < 2e-16 ***
Language.n        -0.01485    0.05113  -0.290 0.771499    
Smoker.n           0.11579    0.15586   0.743 0.457556    
Employment.n       0.06997    0.07539   0.928 0.353361    
Education.n        0.07730    0.02829   2.733 0.006284 ** 
Marital.n          0.02299    0.03973   0.579 0.562756    
Sex.n              0.12093    0.12004   1.007 0.313719    
Age.n              0.05409    0.02952   1.833 0.066872 .  
Income.n           0.02019    0.01989   1.015 0.310145    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2378.8  on 1767  degrees of freedom
Residual deviance: 1711.7  on 1745  degrees of freedom
AIC: 1757.7

Number of Fisher Scoring iterations: 5

To get a sense about the coefficient in the table above, the odds ratio for Flight_Options is \(e^{0.19887}\) = 1.22, meaning that the odds of returning to fly with this airline is increased by 1.22 times for every one scale point increase in a customer’s perception that the airline provides many valuable flight options.

Of course this is not the only contributing factor to customer’s return to fly and each predictor has an effect and our job is to find the most significant drivers to flying again with this airline.

The ANOVA table below indicates the deviance improvements when each of the predictor candidates is added to the model.

anova(glm.all, test = "Chisq") 
Analysis of Deviance Table

Model: binomial, link: logit

Response: FlyAgain

Terms added sequentially (first to last)

                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                              1767     2378.8              
Easy_Reservation  1  191.711      1766     2187.1 < 2.2e-16 ***
Preferred_Seats   1   98.035      1765     2089.1 < 2.2e-16 ***
Flight_Options    1   56.584      1764     2032.5 5.384e-14 ***
Ticket_Prices     1   37.494      1763     1995.0 9.169e-10 ***
Seat_Comfort      1  104.751      1762     1890.2 < 2.2e-16 ***
Seat_Roominess    1   18.840      1761     1871.4 1.422e-05 ***
Overhead_Storage  1   39.390      1760     1832.0 3.470e-10 ***
Clean_Aircraft    1   12.394      1759     1819.6 0.0004306 ***
Courtesy          1    5.671      1758     1813.9 0.0172498 *  
Friendliness      1    6.332      1757     1807.6 0.0118599 *  
Helpfulness       1    1.288      1756     1806.3 0.2564326    
Service           1    0.264      1755     1806.0 0.6073177    
Satisfaction      1    8.383      1754     1797.7 0.0037873 ** 
Recommend         1   71.822      1753     1725.8 < 2.2e-16 ***
Language.n        1    0.063      1752     1725.8 0.8020028    
Smoker.n          1    0.295      1751     1725.5 0.5873378    
Employment.n      1    0.968      1750     1724.5 0.3250803    
Education.n       1    7.209      1749     1717.3 0.0072552 ** 
Marital.n         1    0.286      1748     1717.0 0.5930444    
Sex.n             1    0.893      1747     1716.1 0.3445965    
Age.n             1    3.432      1746     1712.7 0.0639337 .  
Income.n          1    1.035      1745     1711.7 0.3089706    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the next step, another logistic regression model is built on the 12 predictors that were more promising. In this model which is called “glm.2”, it can be seen that Age is no longer a significant contributing predictor.

Also it is evident that Flight_Options, Overhead_Storage and Recommendation to other are the major drivers of customer return in this model.

glm.2 <- glm( FlyAgain ~ . -Marital.n -Sex.n -Income.n -Employment.n -Smoker.n -Language.n -Seat_Roominess - Courtesy -Helpfulness -Service , data = df, family = 'binomial')
summary(glm.2)

Call:
glm(formula = FlyAgain ~ . - Marital.n - Sex.n - Income.n - Employment.n - 
    Smoker.n - Language.n - Seat_Roominess - Courtesy - Helpfulness - 
    Service, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4609  -0.7934   0.4312   0.7586   2.8752  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -13.04135    0.86291 -15.113  < 2e-16 ***
Easy_Reservation   0.13036    0.06033   2.161 0.030721 *  
Preferred_Seats    0.14137    0.04273   3.308 0.000938 ***
Flight_Options     0.19931    0.03958   5.035 4.78e-07 ***
Ticket_Prices      0.14230    0.03922   3.629 0.000285 ***
Seat_Comfort       0.15781    0.05248   3.007 0.002639 ** 
Overhead_Storage   0.18540    0.04371   4.241 2.22e-05 ***
Clean_Aircraft     0.11937    0.04903   2.435 0.014906 *  
Friendliness       0.18031    0.05605   3.217 0.001295 ** 
Satisfaction       0.12140    0.04862   2.497 0.012525 *  
Recommend          0.36062    0.04300   8.387  < 2e-16 ***
Education.n        0.07814    0.02808   2.783 0.005390 ** 
Age.n              0.04794    0.02917   1.643 0.100308    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2378.8  on 1767  degrees of freedom
Residual deviance: 1720.3  on 1755  degrees of freedom
AIC: 1746.3

Number of Fisher Scoring iterations: 5

In this next step, the model is reduced to only variables that have coefficients significantly different from zero at the 5% level of risk. Therefore, Age is removed from the predictors and now the only non-attitudinal predictor is Education and we are not using the other demographic data.

In the next two code chunks, summary of the reduced model is presented.

glm.2.2 <- glm( FlyAgain ~ . -Marital.n -Sex.n -Income.n -Employment.n
              -Smoker.n -Language.n -Seat_Roominess - Courtesy -Helpfulness
              -Service -Age.n , data = df, family = 'binomial')
summary(glm.2.2)

Call:
glm(formula = FlyAgain ~ . - Marital.n - Sex.n - Income.n - Employment.n - 
    Smoker.n - Language.n - Seat_Roominess - Courtesy - Helpfulness - 
    Service - Age.n, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5176  -0.8020   0.4308   0.7577   2.8108  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -12.74533    0.84015 -15.170  < 2e-16 ***
Easy_Reservation   0.13021    0.06037   2.157  0.03102 *  
Preferred_Seats    0.14115    0.04275   3.302  0.00096 ***
Flight_Options     0.19840    0.03952   5.020 5.16e-07 ***
Ticket_Prices      0.14224    0.03915   3.633  0.00028 ***
Seat_Comfort       0.15947    0.05245   3.040  0.00236 ** 
Overhead_Storage   0.18450    0.04366   4.226 2.38e-05 ***
Clean_Aircraft     0.11849    0.04902   2.417  0.01566 *  
Friendliness       0.17221    0.05574   3.090  0.00200 ** 
Satisfaction       0.12226    0.04860   2.516  0.01188 *  
Recommend          0.35884    0.04296   8.352  < 2e-16 ***
Education.n        0.07710    0.02803   2.750  0.00595 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2378.8  on 1767  degrees of freedom
Residual deviance: 1723.0  on 1756  degrees of freedom
AIC: 1747

Number of Fisher Scoring iterations: 5
anova(glm.2.2, test = "Chisq") 
Analysis of Deviance Table

Model: binomial, link: logit

Response: FlyAgain

Terms added sequentially (first to last)

                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                              1767     2378.8              
Easy_Reservation  1  191.711      1766     2187.1 < 2.2e-16 ***
Preferred_Seats   1   98.035      1765     2089.1 < 2.2e-16 ***
Flight_Options    1   56.584      1764     2032.5 5.384e-14 ***
Ticket_Prices     1   37.494      1763     1995.0 9.169e-10 ***
Seat_Comfort      1  104.751      1762     1890.2 < 2.2e-16 ***
Overhead_Storage  1   51.802      1761     1838.4 6.139e-13 ***
Clean_Aircraft    1   14.734      1760     1823.7 0.0001238 ***
Friendliness      1    9.546      1759     1814.1 0.0020038 ** 
Satisfaction      1   10.506      1758     1803.6 0.0011900 ** 
Recommend         1   72.926      1757     1730.7 < 2.2e-16 ***
Education.n       1    7.690      1756     1723.0 0.0055531 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing the metrics for all three models in the table below shows that the final reduced model is the performing better according to its BIC.

library(texreg)
screenreg( list(glm.all, glm.2, glm.2.2), custom.model.names=c("Return based on all", "Return based on 12 predictors", "Return based on 11 predictors"), digits=3 )

===================================================================================================
                  Return based on all  Return based on 12 predictors  Return based on 11 predictors
---------------------------------------------------------------------------------------------------
(Intercept)        -14.433 ***          -13.041 ***                    -12.745 ***                 
                    (1.051)              (0.863)                        (0.840)                    
Easy_Reservation     0.128 *              0.130 *                        0.130 *                   
                    (0.061)              (0.060)                        (0.060)                    
Preferred_Seats      0.136 **             0.141 ***                      0.141 ***                 
                    (0.043)              (0.043)                        (0.043)                    
Flight_Options       0.199 ***            0.199 ***                      0.198 ***                 
                    (0.040)              (0.040)                        (0.040)                    
Ticket_Prices        0.141 ***            0.142 ***                      0.142 ***                 
                    (0.039)              (0.039)                        (0.039)                    
Seat_Comfort         0.143 **             0.158 **                       0.159 **                  
                    (0.054)              (0.052)                        (0.052)                    
Seat_Roominess       0.047                                                                         
                    (0.046)                                                                        
Overhead_Storage     0.176 ***            0.185 ***                      0.185 ***                 
                    (0.045)              (0.044)                        (0.044)                    
Clean_Aircraft       0.110 *              0.119 *                        0.118 *                   
                    (0.050)              (0.049)                        (0.049)                    
Courtesy             0.087                                                                         
                    (0.064)                                                                        
Friendliness         0.142 *              0.180 **                       0.172 **                  
                    (0.062)              (0.056)                        (0.056)                    
Helpfulness          0.080                                                                         
                    (0.065)                                                                        
Service             -0.031                                                                         
                    (0.060)                                                                        
Satisfaction         0.104 *              0.121 *                        0.122 *                   
                    (0.050)              (0.049)                        (0.049)                    
Recommend            0.358 ***            0.361 ***                      0.359 ***                 
                    (0.043)              (0.043)                        (0.043)                    
Language.n          -0.015                                                                         
                    (0.051)                                                                        
Smoker.n             0.116                                                                         
                    (0.156)                                                                        
Employment.n         0.070                                                                         
                    (0.075)                                                                        
Education.n          0.077 **             0.078 **                       0.077 **                  
                    (0.028)              (0.028)                        (0.028)                    
Marital.n            0.023                                                                         
                    (0.040)                                                                        
Sex.n                0.121                                                                         
                    (0.120)                                                                        
Age.n                0.054                0.048                                                    
                    (0.030)              (0.029)                                                   
Income.n             0.020                                                                         
                    (0.020)                                                                        
---------------------------------------------------------------------------------------------------
AIC               1757.652             1746.309                       1747.013                     
BIC               1883.637             1817.518                       1812.744                     
Log Likelihood    -855.826             -860.154                       -861.507                     
Deviance          1711.652             1720.309                       1723.013                     
Num. obs.         1768                 1768                           1768                         
===================================================================================================
*** p < 0.001; ** p < 0.01; * p < 0.05

Below the 11 selected variables are listed. These are the predictors used in the final model.

var <- c("Easy_Reservation", "Preferred_Seats"  ,"Flight_Options", "Ticket_Prices", "Seat_Comfort",
         "Overhead_Storage", "Clean_Aircraft", "Friendliness", "Satisfaction", "Recommend", "Education.n")

5.2 Multicollinearity, or high correlation among the predictors

Correlation can harm the logistic regression model. So, a multicollinearity analysis is needed to see which predictors are associated with each other. Removing the duplicate information may result in a better and more simple model.

Figure 1: imp_all

df %>% corPlot(  numbers=TRUE, stars=TRUE, upper=FALSE, diag=FALSE, main= "Correlation matrix of predictor perceptions",
                         cex = .8, xlas=2) 

The finding deducted from Figure 1: imp_all are as follows:

  • Service and helpfulness are correlated at a high level of 0.69. Moreover, Service was correlated with Friendliness and Courtesy. Therefore, Service was purged from the final model rightfully.

  • Helpfulness and Friendliness were also correlated together and Helpfulness was purged correctly due to not having a significant coefficient.

  • Seat Roominess and Seat comfort are correlated (Roominess is purged).

  • Overhead Storage & Clean Aircraft & Seat Comfort are correlated (all are in latest model). We will not touch them for now.

5.3 Correlation with the selected 11 variables.

library(magrittr)
library(kableExtra)
df[var] %>% cor( )  %>% round( 2) %>% kable( caption = "Flight Correlation Table") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "left", font_size = 12 )%>%
  footnote(general = "These are Pearson correlation coefficients with 2 significant digits") 
Flight Correlation Table
Easy_Reservation Preferred_Seats Flight_Options Ticket_Prices Seat_Comfort Overhead_Storage Clean_Aircraft Friendliness Satisfaction Recommend Education.n
Easy_Reservation 1.00 0.55 0.55 0.52 0.44 0.41 0.43 0.47 0.49 0.53 -0.03
Preferred_Seats 0.55 1.00 0.53 0.52 0.40 0.39 0.37 0.40 0.42 0.51 -0.01
Flight_Options 0.55 0.53 1.00 0.51 0.37 0.34 0.34 0.36 0.41 0.47 -0.02
Ticket_Prices 0.52 0.52 0.51 1.00 0.40 0.38 0.37 0.39 0.43 0.48 -0.07
Seat_Comfort 0.44 0.40 0.37 0.40 1.00 0.62 0.60 0.48 0.53 0.53 -0.02
Overhead_Storage 0.41 0.39 0.34 0.38 0.62 1.00 0.57 0.44 0.49 0.51 -0.04
Clean_Aircraft 0.43 0.37 0.34 0.37 0.60 0.57 1.00 0.47 0.50 0.50 -0.02
Friendliness 0.47 0.40 0.36 0.39 0.48 0.44 0.47 1.00 0.56 0.38 -0.04
Satisfaction 0.49 0.42 0.41 0.43 0.53 0.49 0.50 0.56 1.00 0.48 -0.06
Recommend 0.53 0.51 0.47 0.48 0.53 0.51 0.50 0.38 0.48 1.00 -0.03
Education.n -0.03 -0.01 -0.02 -0.07 -0.02 -0.04 -0.02 -0.04 -0.06 -0.03 1.00
Note:
These are Pearson correlation coefficients with 2 significant digits

After reducing the model to use only 11 predictors, the only remaining concern in multicollinearity of the predictors is the group of (Seat Comfort, Seat Roominess and Overhead Storage) variables.

5.4 Variance Inflation Factors – a measure of likely harm from multicollinearity

VIF is computed for the model with all variable and the model using the most significant predictors. And, these VIFs are well within the acceptability limit of <4 . However, it can be seen that the predictors “Service” and “Satisfaction” that were removed in the latest model had the most VIF in all variables.

library(car)
(vf <- vif(glm.all) )
Easy_Reservation  Preferred_Seats   Flight_Options    Ticket_Prices 
        1.155084         1.179599         1.177781         1.165260 
    Seat_Comfort   Seat_Roominess Overhead_Storage   Clean_Aircraft 
        1.348916         1.267461         1.346092         1.218201 
        Courtesy     Friendliness      Helpfulness          Service 
        1.159587         1.324058         1.495542         1.547692 
    Satisfaction        Recommend       Language.n         Smoker.n 
        1.197295         1.136932         1.010601         1.020559 
    Employment.n      Education.n        Marital.n            Sex.n 
        1.012851         1.022416         1.015127         1.014310 
           Age.n         Income.n 
        1.027196         1.012337 
(vf <- vif(glm.2.2) )
Easy_Reservation  Preferred_Seats   Flight_Options    Ticket_Prices 
        1.150700         1.172121         1.165858         1.155431 
    Seat_Comfort Overhead_Storage   Clean_Aircraft     Friendliness 
        1.282218         1.271782         1.194666         1.079793 
    Satisfaction        Recommend      Education.n 
        1.125685         1.119711         1.016929 

6 Analysis in h2o

6.1 Logistic Regression using H2o

The previous analyses was done on the whole data set, but a good predictor model is all about how it reacts to new and unseen data. In this chapter, we use h2o to divided the data into training and test(validation) subsets and then train the logistic regression to find the best model based on validation results and also find the key drivers of customer return to airline. However, the insights gained from previous chapter are used in the new h2o models as well.

First we set up the h2o environment.

library(h2o)
h2o.init()

Then the imputed data frame from before is converted to h2o-readable format and splitted to train and test samples. The train-test ratio for this analysis is 70/30.

Note: If your response column is binomial, then you must convert that column to a categorical (.asfactor() in Python and as.factor() in R) and set family = binomial. (From h2o online manual for GLMs)

df$FlyAgain <- as.factor(df$FlyAgain)
df_h2o <- as.h2o(df)
df.split<- h2o.splitFrame(df_h2o, ratios=c(0.7), seed = 45)

Finally, the first logistic regression model is built, initially using all 22 predictors. Notice that model’s coefficients are calculated with the training subset. Therefore, the results and metrics are different from what seen in Base-R method.

# logistic regression
h2o_glm.all <- h2o.glm(
  family= "binomial",  
  training_frame = df.split[[1]],        ## the H2O frame for training
  validation_frame = df.split[[2]],      ## the H2O frame for validation (not required)
  x=1:22,                        ## the predictor columns, by column index
  y=23,
  model_id = "df_GLM_all",
  compute_p_values=TRUE, lambda=0
)

The table below shows the coefficients of the model.

h2o_glm.all@model$coefficients_table %>% datatable(rownames = F, filter="top", options = list(pageLength = 10, scrollX=T))

Below is a plot of the relative importance of variables in the model. It is based on the standardized coefficients magnitudes. Meaning bigger coefficients have higher impact on the log-odds of a customers returning to the airline.

Figure 3:

h2o.std_coef_plot(h2o_glm.all, num_of_features = 12)

pvalue <- h2o_glm.all@model$coefficients_table[order(h2o_glm.all@model$coefficients_table$p_value, decreasing = F), c("names", "p_value")]

To analyze the values in Figure 3: better, the table below shows the key drivers with a p-value below the 5% limit. These are based on the model trained with the training data and using all predictors. It is evident that Recommend is the most significant predictor of customers return. Meaning customers who will recommend this airline to others are most likely to fly again with the airline.

It is important to note that the logistic regression and its coefficients are highly dependent on the data used, therefore, when a different seed is used to divide train-test subsets, coefficients and Figure 3: change too.

pvalue <- h2o_glm.all@model$coefficients_table[order(h2o_glm.all@model$coefficients_table$p_value, decreasing = F), c("names", "p_value")]
pvalue %>% filter(p_value < 0.05) %>% filter(names != 'Intercept') %>% datatable(rownames = F, filter="top", options = list(pageLength = 12, scrollX=T), caption = "Top Predictors in 95% CI", width = 500)

6.2 Using each predictor

In the code chunk below, the logistic regression model was trained using only one predictor at each time. Interestingly, all opinion-based questions had significant p-value (zero) and all personal and demographic variables in the questionnaire were non significant. Which makes sense as non-perceptional predictors performed very poorly in the model trained on whole date and again encourages us to purge them from the next models.

Table 2:

h2o_glm.each <- h2o.glm(
  family= "binomial",  
  training_frame = df.split[[1]],        ## the H2O frame for training
  validation_frame = df.split[[2]],      ## the H2O frame for validation (not required)
  x=16,                        ## the predictor columns, by column index
  y=23,
  model_id = "df_GLM_smk",
  compute_p_values=TRUE, lambda=0
)
h2o_glm.each@model$coefficients_table %>% datatable(rownames = F, filter="top", options = list(pageLength = 10, scrollX=T))

In Table 2: , the logistic regression is trained on the smoking characteristic of customers. The p-value as mentioned is non significant.

6.3 Model using the selected variables from base-R

For this reduced model, the same final 11 variables in base R are used.

var #containing the variables of interest
idx <- which(names(df) %in%  var)

h2o_glm.2.2 <- h2o.glm(
  family= "binomial",  
  training_frame = df.split[[1]],        ## the H2O frame for training
  validation_frame = df.split[[2]],      ## the H2O frame for validation (not required)
  x=idx,                        ## the predictor columns, by column index
  y=23,
  model_id = "df_GLM_2",
  compute_p_values=TRUE, lambda=0
)

The intresting finding is resulted from the coefficients below, they are different than what we had in base-R. So what is the reason for that? The reason is that these coefficients have been computed using the training data only and before the logisitc regression was done on all data. So the latter results are more realistic although they are subject to change when different seeds are used to split the data.

#options(max.print = 100000)
h2o_glm.2.2@model$coefficients_table %>% datatable(rownames = F, filter="top", options = list(pageLength = 10, scrollX=T))

Below is a quick plot of the relative importance of variables in explaining Return based on the standardized coefficients. The plot does change when using different seeds, especially in the middle where most coeffs are not significantly different. Therefore, this plot could not be used with certainty to identify the key drivers of customer return. Further analysis is needed.

h2o.std_coef_plot(h2o_glm.2.2, num_of_features = 14)

pvalue <- h2o_glm.2.2@model$coefficients_table[order(h2o_glm.2.2@model$coefficients_table$p_value, decreasing = F), c("names", "p_value")]
#pvalue %>% datatable(rownames = F, filter="top", options = list(pageLength = 10, scrollX=T))

Below shows the significant predictors of FlyAgain in the reduced model. As you can see there are differences between the

pvalue %>% filter(p_value < 0.05) %>% filter(names != 'Intercept') %>% datatable(rownames = F, filter="top", options = list(pageLength = 12, scrollX=T), caption = "Top Predictors in 95% CI", width = 500)

6.4 Measuring Performance

As you know the quality of the model is based on how it performs on the testing, or holdout, dataset. We would like a model that performs very well and is relatively simple, easy to understand and use for marketing strategy development. In this section we look at the performance of the two main models created before and try to find areas for improvment in order to select one or create a new better and reduced model.

In the next code chunk the performance of the first model (based on all predictors)

perf <- h2o.performance(h2o_glm.all, df.split[[2]])
perf
H2OBinomialMetrics: glm

MSE:  0.1723331
RMSE:  0.4151302
LogLoss:  0.5120329
Mean Per-Class Error:  0.3182479
AUC:  0.8041192
AUCPR:  0.8497892
Gini:  0.6082385
R^2:  0.2800489
Residual Deviance:  526.3698
AIC:  572.3698

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
         0   1    Error      Rate
0       86 118 0.578431  =118/204
1       18 292 0.058065   =18/310
Totals 104 410 0.264591  =136/514

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.375948   0.811111 314
2                       max f2  0.192605   0.909358 357
3                 max f0point5  0.641802   0.790922 204
4                 max accuracy  0.585020   0.747082 231
5                max precision  0.969653   1.000000   0
6                   max recall  0.065825   1.000000 381
7              max specificity  0.969653   1.000000   0
8             max absolute_mcc  0.585020   0.468380 231
9   max min_per_class_accuracy  0.648602   0.725490 200
10 max mean_per_class_accuracy  0.641802   0.733713 204
11                     max tns  0.969653 204.000000   0
12                     max fns  0.969653 309.000000   0
13                     max fps  0.000030 204.000000 399
14                     max tps  0.065825 310.000000 381
15                     max tnr  0.969653   1.000000   0
16                     max fnr  0.969653   0.996774   0
17                     max fpr  0.000030   1.000000 399
18                     max tpr  0.065825   1.000000 381

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

The hit ratio (overall) is 73.541 and is the proportion of the time that the model produces correct prediction of Return.

The overall hit ratio, expressed as one minus the error rate above, can be misleading if the hit ratio for one category is much different from another. In this analysis, the hit ratio for predicting Return = No is 42.157 and the hit ratio for predicting Return = Yes is 94.194.

The model is performing much better when predicting customers who will return to airline than those who won’t.

h2o.confusionMatrix(h2o_glm.all, df.split[[2]]) %>% kable( caption = "Confusion Matrix for model h2o_glm.all") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = T, position = "center", font_size = 20)%>%
  footnote(general = "These are error rates of validation sample")
Confusion Matrix for model h2o_glm.all
0 1 Error Rate
0 86 118 0.5784314 =118/204
1 18 292 0.0580645 =18/310
Totals 104 410 0.2645914 =136/514
Note:
These are error rates of validation sample

Now let’s see how the reduced model with 11 predictors performed.

perf.2.2 <- h2o.performance(h2o_glm.2.2, df.split[[2]])
perf.2.2
H2OBinomialMetrics: glm

MSE:  0.1715682
RMSE:  0.4142079
LogLoss:  0.5124333
Mean Per-Class Error:  0.3282416
AUC:  0.8043248
AUCPR:  0.8413802
Gini:  0.6086496
R^2:  0.2832442
Residual Deviance:  526.7815
AIC:  550.7815

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
        0   1    Error      Rate
0      76 128 0.627451  =128/204
1       9 301 0.029032    =9/310
Totals 85 429 0.266537  =137/514

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.297726   0.814614 332
2                       max f2  0.175638   0.906416 359
3                 max f0point5  0.605268   0.789987 230
4                 max accuracy  0.526762   0.752918 262
5                max precision  0.970616   1.000000   0
6                   max recall  0.079372   1.000000 380
7              max specificity  0.970616   1.000000   0
8             max absolute_mcc  0.526762   0.473400 262
9   max min_per_class_accuracy  0.665945   0.725490 206
10 max mean_per_class_accuracy  0.605268   0.735073 230
11                     max tns  0.970616 204.000000   0
12                     max fns  0.970616 309.000000   0
13                     max fps  0.000051 204.000000 399
14                     max tps  0.079372 310.000000 380
15                     max tnr  0.970616   1.000000   0
16                     max fnr  0.970616   0.996774   0
17                     max fpr  0.000051   1.000000 399
18                     max tpr  0.079372   1.000000 380

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

The hit ratio 73.346 is almost the same but there was a decent decrease in AIC. The second model achieved a AIC=550.781455 improving on the 572.3698141. It is not a huge improvement but shows that we are on the right path. The second model is relatively more simple and has the same accuracy as the previous one.

6.5 Room for improvement?

The second model was more simple but it didn’t perform any better. In fact it performed even more poorly when predicting the customers who said no to returning to airline! We saw that Overhead_storage was one of the key drivers in all model. Also in the multicollinearity analysis, we found out that it was highly correlated with Seat_Comfort and Clean_Aircraft. Therefore in an alternative model we purge these two variables and look at the results in the chunks below.

But first Let’s Make a function to easier compare and print each models results.

results <- data.frame(Prediction_model=character(),
                  hit_ratio=numeric(),
                  MSE=numeric(),
                  RMSE=numeric(),
                  R2=numeric(),
                  AIC=numeric(),
                  AUC=numeric(),
                  mean_per_class_error=numeric(),
                  stringsAsFactors=FALSE) 

newModel <- function(results, model, name){
    if (name %in% results$Prediction_model) {return(results)}
    i <- dim(results)[1]
    i <- i + 1 
    results[i, 1] <- name
    results[i, 2] <- round( (1 - h2o.performance(model, newdata = df.split[[2]] )@metrics$cm$table$Error[3] ), digits = 3 )
    results[i, 3] <- round( model@model$validation_metrics@metrics$MSE, digits = 3 )   
    results[i, 4] <- round( model@model$validation_metrics@metrics$RMSE, digits = 3 )
    results[i, 5] <- round( model@model$validation_metrics@metrics$r2, digits = 3 )
    results[i, 6] <- round( model@model$validation_metrics@metrics$AIC, digits = 3 )
    results[i, 7] <- round( h2o.performance(model, newdata = df.split[[2]] )@metrics$AUC, digits = 3 )
    results[i, 8] <- round( model@model$validation_metrics@metrics$ mean_per_class_error, digits = 3 ) 
    return(results)
}

The following performance metrics is stated for each of first two models.

results <- newModel(results, h2o_glm.all, "GLM_Logistic_regression_allVar")
results <- newModel(results, h2o_glm.2.2, "GLM_Logistic_regression_reduced_Var")
results %>% datatable(rownames = F, filter="top", options = list(pageLength = 12, scrollX=T), caption = "Logistic Regression Model Performance metrics on validation data")

Now the alternative model is trained using 11-2 = 9 predictors. Removing the supposedly duplicate information in Seat_Comfort and Clean_Aircraft columns.

# var
# names(df)[!(names(df) %in% var)]
# idx
# names(df)

idx.4 <- idx[!(idx %in% c(5, 8))]
idx.4
#"These are the predictors in model glm.2.4"
print(names(df)[idx.4])
h2o_glm.2.4 <- h2o.glm(x=idx.4,
                       y=23,
                       family= "binomial",
                        training_frame = df.split[[1]],        ## the H2O frame for training
                        validation_frame = df.split[[2]],      ## the H2O frame for validation (not required)
                        model_id = "df_GLM_4",
                        compute_p_values=TRUE, lambda=0
                       )
h2o.performance(h2o_glm.2.4, df.split[[2]])
H2OBinomialMetrics: glm

MSE:  0.1732875
RMSE:  0.4162782
LogLoss:  0.5164697
Mean Per-Class Error:  0.3165718
AUC:  0.799581
AUCPR:  0.8405807
Gini:  0.5991619
R^2:  0.2760614
Residual Deviance:  530.9308
AIC:  550.9308

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
         0   1    Error      Rate
0       88 116 0.568627  =116/204
1       20 290 0.064516   =20/310
Totals 108 406 0.264591  =136/514

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.369065   0.810056 311
2                       max f2  0.220121   0.907210 354
3                 max f0point5  0.727145   0.791599 171
4                 max accuracy  0.544427   0.747082 252
5                max precision  0.968666   1.000000   0
6                   max recall  0.070074   1.000000 383
7              max specificity  0.968666   1.000000   0
8             max absolute_mcc  0.544427   0.462634 252
9   max min_per_class_accuracy  0.653631   0.725490 210
10 max mean_per_class_accuracy  0.606884   0.730171 230
11                     max tns  0.968666 204.000000   0
12                     max fns  0.968666 309.000000   0
13                     max fps  0.000089 204.000000 399
14                     max tps  0.070074 310.000000 383
15                     max tnr  0.968666   1.000000   0
16                     max fnr  0.968666   0.996774   0
17                     max fpr  0.000089   1.000000 399
18                     max tpr  0.070074   1.000000 383

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

In Table 1: the performance metrics off all three models is included. The third model is with a good chance the most accurate and most simple among the rest.

Table 1:

results <- newModel(results, h2o_glm.2.4, "GLM_Logistic_regression_9_Var")
results%>% datatable(rownames = F, filter="top", options = list(pageLength = 12, scrollX=T), caption = "Logistic Regression Model Performance metrics on validation data")

7 Key Findings

The goal of this analysis was to produce a model to effectively predict chances of customers return to fly with the airline again. Through multiple analyses we came upon a logistic regression model that is relatively simple and has a good performance on the hold-out data.

This model uses Easy_Reservation, Preferred_Seats, Flight_Options, Ticket_Prices, Overhead_Storage, Friendliness, Satisfaction, Recommend, Education.n to find the FlyAgain probability. The logistic regression computes coefficients for all predictors and uses them to find the log-odds of the probability of customers flying again. The formula is:

\(log-odds=logit(P_{return})=log(\frac{P}{1-P})= \beta_0 + \beta_1 x_1 +...+\beta_9 x_9\)

\(\beta_0\) to \(\beta_9\) are the coefficients the model gives us and \(x_0\) to \(x_9\) are the values of predictors. This gives us the probability of a customer returning to fly again with the airline.

These coefficients for the final selected model are as below:

h2o_glm.2.4@model$coefficients
       Intercept Easy_Reservation  Preferred_Seats   Flight_Options 
    -12.56532647       0.18769463       0.11378399       0.17423105 
   Ticket_Prices Overhead_Storage     Friendliness     Satisfaction 
      0.15603471       0.27877267       0.23085040       0.18390499 
       Recommend      Education.n 
      0.37465406       0.09959908 

The first columns is the predicted Fly_Again and p0 and p1 are the probabilities of a customer’s return (p1) or not return (p0) to fly with airline.

y <- h2o.predict(h2o_glm.2.4, df.split[[2]])
y1 <- as.data.frame(y)
y1["id"] <- rownames(y1) #add id col for merge function! otherwise will not work!
test <- as.data.frame(df.split[[2]])
test["id"] <- rownames(test)
pred <- merge( y1,test[,c(idx.4,23:24)], by = "id", sort = T)
#pred <- pred %>% arrange(id)
pred <-pred[, 2:15] 
pred[c("p0", "p1", "StdErr")] <- pred[c("p0", "p1", "StdErr")] %>% round(3)
head(pred)  %>%  datatable(rownames = F, filter="top", options = list(pageLength = 10, scrollX=T), caption = "Predicted Probability for 5 random customers")

The Confusion Matrix for the final model is as below. As seen before, the model performs a lot better when predicting people who said Yes to returning than those who said no.

h2o.confusionMatrix(h2o_glm.2.4, df.split[[2]]) %>% kable( caption = "Confusion Matrix for model 9_Var model") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = T, position = "center", font_size = 20)%>%
  footnote(general = "These are error rates for the validation sample")
Confusion Matrix for model 9_Var model
0 1 Error Rate
0 88 116 0.5686275 =116/204
1 20 290 0.0645161 =20/310
Totals 108 406 0.2645914 =136/514
Note:
These are error rates for the validation sample

ROC curve for final model.

plot(h2o.performance(h2o_glm.2.4, df.split[[2]]))

Looking at the variable importance in the final model. Recommend, Overhead_Storage are the top drivers of customers return with a large margin. After them, the next 5 predictors have nearly the same significance. These are Flight_Options, Friendliness, Ticket_Prices, Satisfaction, Easy_Reservation. The marketing team can use this 5 or 6 predictors as the key drivers of customers return to airline.

Figure 2:

rf_variable_importances <-  data.frame( h2o.varimp(h2o_glm.2.4)$variable, h2o.varimp(h2o_glm.2.4)$percentage)
colnames(rf_variable_importances) <- c("variable_name", "importances")
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
library(plotly)
plot_ly(rf_variable_importances, 
        #        x = rf_variable_importances$percentage, 
        y=reorder(rf_variable_importances$variable_name,
                  rf_variable_importances$importances),
        x = rf_variable_importances$importances,
        color = rf_variable_importances$variable_name,
        type = 'bar', orientation = 'h') %>%
  layout( title = "Variable Importance",
          xaxis = list(title = "Percentage Importance"),
          ylim=c(0,1),
          margin = list(l = 120)) 

8 Conclusions

The goal of the previous chapters was to find a good predictive model. This was done, the model using 9 predictors does a decent job of finding customers who will return to airline. However, the model performs not goodly enough in predicting non-returning customers. This might be because of the limitations of data or the logistic regression model. It might not be a bad idea to try other models such as Random Forest or Gradient Boost for future works.

The other objective was to find the key drivers of customers return. Through many different models created on the data set, using different groups of predictors and as seen in Figure 2: , the following variables are believed to be of most value for future marketing campaigns.

  1. Recommend: How much likely a customer recommends the airline to others, is definitely how likely they will return to fly again.
  2. Overhead_Storage: It was the second most key driver in almost all models. The adequate size of overhead storage is a important feature of a good flight and an airline.
  3. Flight_Options: The quality and quantity of airline options was among the top drivers in predicting return to airline among customers. This was evident in almost all models.
  4. Ticket_Prices: Having a higher quality airline causes the ticket prices to rise as well, however, different segments could react differently to ticket prices as a driver to return. But as we are not analyzing segments of the market in this report, it is safe to say that an optimal price could be used to drive customers to return yet not damaging the quality of other factors.
  5. Easy_reservation: Although not being in the top 5 predictors of the final model, it was repetitively among the most significant coefficients and how could it not be important? How could customers buy tickets again when the reservation process is not easy.

The preceding predictors could be named as the key drivers of customer return and are of great value to the marketing team. Of course there is always room for improvement and other machine learning techniques could offer more valuable insights.