1 Statement of authorship

I have executed and prepared this assignment and document by myself without the help of any other person. Signature:

2 Background

knitr::include_graphics(path = "./airplane.jpg") 

Passenger loyalty is fundamental to any airline aiming to maintain a stable market share and revenue stream (Chang and Hung, 2015), particularly in a turbulent market. The competitive landscape of the global airline industry has been in a constant change in recent years, with a rapid growth of low cost carriers and high-speed railways, rising fuel costs, fluctuating demand, and tighter security, safety and quality requirements. This is all but not considering a global pandemic like COVID-19 and its effects on airlines. To survive and grow, airlines managers need to identify factors of their services that satisfy and retain customers (Chen, 2008).

3 Objective

The key objective of this report is to find segments in the key airline drivers customers.

4 Methods

1- k-means on attitudinal variables (Level-1 Segmentation)

2- Predictive model for Level-1

3- Using the demographic variables to predict Level-1 segments (Level-2 segmentation)

4- Interpreting the demographic data in each segment (Level-2 segmentation)

5- Implementing a marketing strategy based on Level-1 and Level-2

5 Loading the data

df <- read.csv("Airline_Key_Drivers_mimv.csv")
headTail(df) %>% datatable(rownames = F, filter="top", options = list(pageLength = 10, scrollX=T), caption = "Airline data")

6 Level 1 Segmentation

Converting data to h2o object

#Making the target variables to factors.
df[,c("FlyAgain", "FlyAgain2")] <- lapply(df[,c("FlyAgain", "FlyAgain2")], as.factor)
df.h2o <- as.h2o(df)

Doing k-means clustering using the 12 attitudinal variables.

nSeg <- 2
clus.2 <- h2o.kmeans(df.h2o[2:15], k = nSeg, estimate_k = F, 
                   init=c("PlusPlus"), standardize= FALSE,
                   score_each_iteration= FALSE, seed = 7238,
                   keep_cross_validation_predictions=TRUE)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
h2o.cluster_sizes(clus.2)
[1] 1718   50
frq(df$FlyAgain2, out = 'v', title = "Frequency of Customer Return to Airline")
Frequency of Customer Return to Airline
val label frq raw.prc valid.prc cum.prc
No 706 39.93 39.93 39.93
Yes 1062 60.07 60.07 100
NA NA 0 0 NA NA
total N=1768 · valid N=1768 · x̄=1.60 · σ=0.49

Saving the metrics of each clustering to select the best at the end. These metrics are sum squares (SS) of the distance between points in each cluster and among other clusters.

clustMetrics <- data.frame(Numbr_Segments=numeric(),
                  TotWithinSS=numeric(),
                  BetweenSS=numeric(),
                  TotSS=numeric(),
                  stringsAsFactors=FALSE) 

clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.2)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.2)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.2)
tab_df(clustMetrics)
Numbr_Segments TotWithinSS BetweenSS TotSS
2 47737.14 21589.53 69326.67

The following presents only cluster means, with centroid numbering

clus.2@model$centers 
Cluster Means: 
  centroid easy_reservation preferred_seats flight_options ticket_prices
1        1         8.331781        7.281723       7.096042      6.856810
2        2         2.640000        1.460000       1.580000      1.320000
  seat_comfort seat_roominess overhead_storage clean_aircraft courtesy
1     8.004657       7.674622         7.295111       7.943539 8.440047
2     2.420000       2.000000         1.740000       2.120000 2.620000
  friendliness helpfulness  service satisfaction recommend
1     8.345169    8.251455 8.232247     7.992433  7.231665
2     2.780000    2.980000 2.820000     2.080000  1.580000

plotting the attribute difference in between clusters.

library(reshape2) # it's necessary to reshape the data into 'long' format
df2c_long<- melt(clus.2@model$centers ) # need to reshape to 'long' form
#df2c_long
ggplot(data=df2c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (1-9)") + 
  ylim(1, 9) +
  annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4)  

total layering between segments is a bad thing so we move on to 3 segments!

6.1 3 segments

h2o.cluster_sizes(clus.3)
[1]  612   50 1106
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.3)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.3)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.3)

tab_df(clustMetrics, out= 'v')
Numbr_Segments TotWithinSS BetweenSS TotSS
2 47737.14 21589.53 69326.67
3 38797.92 30528.75 69326.67
df3c_long<- melt(clus.3@model$centers ) # need to reshape to 'long' form

ggplot(data=df3c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (1-9)") + 
  ylim(1, 9) +
  annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4)

Still layered. The second higher attitude segment is just cut in to two segments. We must add another segment to see.

6.2 4 segment solution

h2o.cluster_sizes(clus.4)
[1] 433  50 340 945
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.4)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.4)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.4)

tab_df(clustMetrics, out= 'v')
Numbr_Segments TotWithinSS BetweenSS TotSS
2 47737.14 21589.53 69326.67
3 38797.92 30528.75 69326.67
4 35964.73 33361.94 69326.67
df4c_long<- melt(clus.4@model$centers ) # need to reshape to 'long' form

ggplot(data=df4c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (1-9)") + 
  ylim(1, 9) +
  annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4)

There is a crossover in seat_comfort and recommend variables between segments 1 and 3, however, we still need to increase our segments to get more information.

6.3 5 Segment solution

h2o.cluster_sizes(clus.5)
[1] 359  22 924  29 434
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.5)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.5)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.5)

tab_df(clustMetrics, out= 'v')
Numbr_Segments TotWithinSS BetweenSS TotSS
2 47737.14 21589.53 69326.67
3 38797.92 30528.75 69326.67
4 35964.73 33361.94 69326.67
5 35712.45 33614.22 69326.67
df5c_long<- melt(clus.5@model$centers ) # need to reshape to 'long' form

ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14)) +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (1-9)") + 
  ylim(1, 9) +
  annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4)

6.4 6 Segment solution

h2o.cluster_sizes(clus.6)
[1] 572  29 613 204  21 329
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.6)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.6)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.6)

tab_df(clustMetrics, out= 'v')
Numbr_Segments TotWithinSS BetweenSS TotSS
2 47737.14 21589.53 69326.67
3 38797.92 30528.75 69326.67
4 35964.73 33361.94 69326.67
5 35712.45 33614.22 69326.67
6 33648.88 35677.79 69326.67
df5c_long<- melt(clus.6@model$centers ) # need to reshape to 'long' form

ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14) , legend.position = "bottom") +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (1-9)") + 
  ylim(1, 9) +
  annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4) +
  scale_color_discrete(name = "clusters",
                       labels = paste("cluster", c(1:nSeg), " (", "size=", h2o.cluster_sizes(clus.6), ")" ))

6.5 7 Segment solution

h2o.cluster_sizes(clus.7)
[1] 461  31 172 224 207 654  19
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.7)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.7)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.7)

tab_df(clustMetrics, out= 'v')
Numbr_Segments TotWithinSS BetweenSS TotSS
2 47737.14 21589.53 69326.67
3 38797.92 30528.75 69326.67
4 35964.73 33361.94 69326.67
5 35712.45 33614.22 69326.67
6 33648.88 35677.79 69326.67
7 32044.81 37281.86 69326.67
df5c_long<- melt(clus.7@model$centers ) # need to reshape to 'long' form

ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14), legend.position = "bottom") +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (1-9)") + 
  ylim(1, 9) +
  annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4) +
  scale_color_discrete(name = "clusters",
                       labels = paste("cluster", c(1:nSeg), " (", "size=", h2o.cluster_sizes(clus.7), ")" ))

6.6 8 Segment solution - final solution!>?

h2o.cluster_sizes(clus.8)
[1] 222  29 403 125  21 570 206 192
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.8)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.8)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.8)

tab_df(clustMetrics, out= 'v')
Numbr_Segments TotWithinSS BetweenSS TotSS
2 47737.14 21589.53 69326.67
3 38797.92 30528.75 69326.67
4 35964.73 33361.94 69326.67
5 35712.45 33614.22 69326.67
6 33648.88 35677.79 69326.67
7 32044.81 37281.86 69326.67
8 30940.73 38385.94 69326.67
df5c_long<- melt(clus.8@model$centers ) # need to reshape to 'long' form

g8 <- ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14) , legend.position = "bottom") +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (1-9)") + 
  ylim(1, 9) +
  annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4) +
  scale_color_discrete(name = "clusters",
                       labels = paste("cluster", c(1:nSeg), " (", "size=", h2o.cluster_sizes(clus.8), ")" )) 

g8

This seems like a decent segmentation but before finalizing it as a viable segmentation, let’s look at a 9 cluster model too.

6.7 9 Segment solution - final solution!>?

h2o.cluster_sizes(clus.9)
[1] 243  29 111 327 151 514  21 122 250
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.9)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.9)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.9)

tab_df(clustMetrics, out= 'v')
Numbr_Segments TotWithinSS BetweenSS TotSS
2 47737.14 21589.53 69326.67
3 38797.92 30528.75 69326.67
4 35964.73 33361.94 69326.67
5 35712.45 33614.22 69326.67
6 33648.88 35677.79 69326.67
7 32044.81 37281.86 69326.67
8 30940.73 38385.94 69326.67
9 30457.74 38868.93 69326.67
df5c_long<- melt(clus.9@model$centers ) # need to reshape to 'long' form

ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
  geom_line(aes( color= centroid ), size=1.2)+
  geom_point(aes( color= centroid ), size=3) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14) , legend.position = "bottom") +
  labs( title= "Centroids (means) of Segments over Basis Variables", 
        x= "Basis Variables", y= "Means (1-9)") + 
  ylim(1, 9) +
  annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4) +
  scale_color_discrete(name = "clusters",
                       labels = paste("cluster", c(1:nSeg), " (", "size=", h2o.cluster_sizes(clus.9), ")" ))

The 9 segment clustering seems to complicated to be interpretable. Therefore, the we go with 8 segments to the next steps of this analysis.

6.8 Segment (cluster) assignments

These segment assignments are obtained by predicting each respondent’s segment based on the 8 Segments Clustering results. Below are the predicted segment assignments for the 6-segment solution.

clusters.hex <- h2o.predict(clus.8, df.h2o) 

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
clusters.hex <- as.factor(clusters.hex) # cluster 'names' must be factors for modeling
clusters.hex$predict # segment assignments for first 6 respondents of 1036

And, this table shows the segment sizes printed earlier in a slightly different manner.

h2o.table(clusters.hex$predict) # table of assignments over all respondents 
  predict Count
1       0   222
2       1    29
3       2   403
4       3   125
5       4    21
6       5   570

[8 rows x 2 columns] 

We change the levels of segments to be more pretty! After that, the a segment column is added to our new data frame along with all the other variables in the airline data set. This new data frame will be used for predictive modeling in the next sections.

Below are the new segments names and their count numbers, for some reason it only shows segments 1-6 and not 7 & 8.

h2o.table(clusters.hex$predict)[1:8,] # table of assignments over all respondents 
  predict Count
1    Seg1   222
2    Seg2    29
3    Seg3   403
4    Seg4   125
5    Seg5    21
6    Seg6   570

[8 rows x 2 columns] 

7 Building models that predict segment membership accurately

As mentioned above, I’ll continue investigating the 8-segment solution and build a predictive model. There are several model forms that could be used. These include random forests, logistic regression, gradient boosting and deep learning.

I’ll first build a prediction model using the random forest method.

I want to find the best model for predicting segment membership for each segmentation solution that I consider to be a serious candidate as the best segmentation.

This is still part of the level 1 analysis, i.e., using the attitudinal data prior to bringing in demographics and other covariates.

7.1 Splitting the sample for cross validation

Some form of cross-validation is absolutely necessary. I’ll use the relatively straight-forward procedure of splitting the overall sample into testing (70%) and training (validation) data sets. The 70% is not a required number, but often used.

Below is the training sample formed using the “h2o.splitFrame()” function on which the predictive model will be built. That function uses the result of the 6-segment k-means analysis that produced the df8C.class data frame and randomly splits it into a training sample and a testing sample. The “predict” column in the table below are those segments into which each respondent was assigned by the k-means segmentation algorithm. The “random forest”, “logistic regression” and “gradient boosting” models will be used now to develop models for predicting those segment assignments. The best of those models will be used to score the customer database if the predictor variables are in that database.

fs.split<- h2o.splitFrame(df8C.class, ratios=c(0.7) )
fs.split[[1]] 
  RID ClustSeg Easy_Reservation Preferred_Seats Flight_Options Ticket_Prices
1   1     Seg2                1               1              1             2
2   2     Seg5                1               1              1             1
3   4     Seg2                1               1              1             1
4   5     Seg5                1               1              3             1
5   6     Seg5                1               1              1             1
6   7     Seg5                3               1              2             2
  Seat_Comfort Seat_Roominess Overhead_Storage Clean_Aircraft Courtesy
1            5              4                4              3        1
2            5              4                3              3        5
3            1              3                4              2        1
4            1              1                1              1        7
5            4              6                1              2        2
6            1              1                3              4        4
  Friendliness Helpfulness Service Satisfaction Recommend Language Smoker
1            1           1       1            2         1  English     No
2            3           6       4            2         1  English     No
3            1           1       1            1         2  Spanish     No
4            3           4       5            1         1  English     No
5            2           2       5            3         1   French     No
6            4           4       4            4         1  English     No
          Employment                                   Education
1 Employed part-time                           Bachelor's degree
2               <NA> University certificate below bachelor level
3 Employed part-time                           Bachelor's degree
4       Not employed          High school diploma or certificate
5 Employed part-time          High school diploma or certificate
6       Not employed           No certificate, diploma or degree
                                                Marital    Sex           Age
1                                               Married Female 24 or younger
2                                     Living common law Female 24 or younger
3                                     Living common law   Male      65 to 74
4                                     Living common law   Male      55 to 64
5                                     Living common law Female      65 to 74
6 Single, never married, not living with a life partner   Male      65 to 74
                 Income FlyAgain FlyAgain2 Easy_Reservation.1 Preferred_Seats.1
1      $40000 to $49999        0        No  Strongly Disagree Strongly Disagree
2      $40000 to $49999        0        No  Strongly Disagree Strongly Disagree
3      $30000 to $39999        0        No  Strongly Disagree Strongly Disagree
4      $30000 to $39999        0        No  Strongly Disagree Strongly Disagree
5 $175,000 to $199,999,        0        No  Strongly Disagree Strongly Disagree
6     Less than $20,000        0        No                  3 Strongly Disagree
   Flight_Options.1   Ticket_Prices.1    Seat_Comfort.1  Seat_Roominess.1
1 Strongly Disagree                 2                 5                 4
2 Strongly Disagree Strongly Disagree                 5                 4
3 Strongly Disagree Strongly Disagree Strongly Disagree                 3
4                 3 Strongly Disagree Strongly Disagree Strongly Disagree
5 Strongly Disagree Strongly Disagree                 4                 6
6                 2                 2 Strongly Disagree Strongly Disagree
  Overhead_Storage.1  Clean_Aircraft.1        Courtesy.1    Friendliness.1
1                  4                 3 Strongly Disagree Strongly Disagree
2                  3                 3                 5                 3
3                  4                 2 Strongly Disagree Strongly Disagree
4  Strongly Disagree Strongly Disagree                 7                 3
5  Strongly Disagree                 2                 2                 2
6                  3                 4                 4                 4
      Helpfulness.1         Service.1
1 Strongly Disagree Strongly Disagree
2                 6                 4
3 Strongly Disagree Strongly Disagree
4                 4                 5
5                 2                 5
6                 4                 4

[1234 rows x 38 columns] 

The testing or holdout or validation sample also has a column indicating the assigned segment for each individual.

fs.split[[2]]
  RID ClustSeg Easy_Reservation Preferred_Seats Flight_Options Ticket_Prices
1   3     Seg2                1               1              3             1
2   9     Seg5                1               1              1             2
3  11     Seg2                3               1              1             1
4  14     Seg5                5               1              1             3
5  16     Seg5                5               2              3             1
6  17     Seg2                3               1              1             1
  Seat_Comfort Seat_Roominess Overhead_Storage Clean_Aircraft Courtesy
1            2              1                1              1        4
2            2              1                1              4        4
3            1              1                1              1        3
4            1              2                1              2        3
5            2              1                1              4        6
6            1              1                1              1        5
  Friendliness Helpfulness Service Satisfaction Recommend Language Smoker
1            5           5       1            1         1     <NA>     No
2            1           7       5            5         1  English    Yes
3            2           4       4            3         1  English    Yes
4            6           3       1            4         2     <NA>    Yes
5            4           7       5            4         3  English     No
6            1           2       1            2         1   French     No
          Employment                          Education
1       Not employed                  Bachelor's degree
2     Fully employed  No certificate, diploma or degree
3     Fully employed High school diploma or certificate
4     Fully employed                 trades certificate
5 Employed part-time High school diploma or certificate
6       Not employed                    College diploma
                                                Marital    Sex           Age
1 Single, never married, not living with a life partner Female      65 to 74
2                                     Living common law Female      75 to 84
3             Separated, not living with a life partner Female 24 or younger
4                                               Married   <NA> 24 or younger
5                                     Living common law   <NA>      65 to 74
6                                     Living common law   Male      45 to 54
                 Income FlyAgain FlyAgain2 Easy_Reservation.1 Preferred_Seats.1
1 $175,000 to $199,999,        0        No  Strongly Disagree Strongly Disagree
2      $40000 to $49999        0        No  Strongly Disagree Strongly Disagree
3      $30000 to $39999        0        No                  3 Strongly Disagree
4      $50000 to $59999        0        No                  5 Strongly Disagree
5  $150,000 to $174,999        0        No                  5                 2
6      $30000 to $39999        0        No                  3 Strongly Disagree
   Flight_Options.1   Ticket_Prices.1    Seat_Comfort.1  Seat_Roominess.1
1                 3 Strongly Disagree                 2 Strongly Disagree
2 Strongly Disagree                 2                 2 Strongly Disagree
3 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
4 Strongly Disagree                 3 Strongly Disagree                 2
5                 3 Strongly Disagree                 2 Strongly Disagree
6 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
  Overhead_Storage.1  Clean_Aircraft.1 Courtesy.1    Friendliness.1
1  Strongly Disagree Strongly Disagree          4                 5
2  Strongly Disagree                 4          4 Strongly Disagree
3  Strongly Disagree Strongly Disagree          3                 2
4  Strongly Disagree                 2          3                 6
5  Strongly Disagree                 4          6                 4
6  Strongly Disagree Strongly Disagree          5 Strongly Disagree
  Helpfulness.1         Service.1
1             5 Strongly Disagree
2             7                 5
3             4                 4
4             3 Strongly Disagree
5             7                 5
6             2 Strongly Disagree

[534 rows x 38 columns] 

7.2 The first predictive model using randomForest

Random forest is a flexible and highly-valued methodology. As the name implies, it splits the data using a tree-splitting methodology. Instead of building just one tree, the model below builds 200 trees.

df8C.class_rf <- h2o.randomForest(         ## h2o.randomForest function
  training_frame   = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],        ## the H2O frame for validation (not required)
  x=3:16,                       ## the training sample predictor columns, by column index
  y=2,                          ## the target index (what we are predicting)
  model_id = "Random_Forest",    ## name the model in H2O
  ##   not required, but helps use Flow
  ntrees = 200,                  ## use a maximum of 200 trees to create the
  ##  random forest model. The default is 50.
  ##  I have increased it because I will let 
  ##  the early stopping criteria decide when
  ##  the random forest is sufficiently accurate
  stopping_rounds = 2,           ## Stop fitting new trees when the 2-tree
  ##  average is within 0.001 (default) of 
  ##  the prior two 2-tree averages.
  ##  Can be thought of as a convergence setting
  score_each_iteration = T,      ## Predict against training and validation for
  ##  each tree. Default will skip several.
  seed = 1000000) ## Set the random seed so that this result can be reproduced. 

The simple “summary()” function can produce a great deal of information about the model. This might be a bit too much to have in one chunk.

summary(df8C.class_rf)  
Model Details:
==============

H2OMultinomialModel: drf
Model Key:  Random_Forest 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              22                      176              181731         2
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        20   11.17046          3        179    78.04546

H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

Training Set Metrics: 
=====================

Extract training frame with `h2o.getFrame("RTMP_sid_b243_18")`
MSE: (Extract with `h2o.mse`) 0.1982057
RMSE: (Extract with `h2o.rmse`) 0.445203
Logloss: (Extract with `h2o.logloss`) 1.287398
Mean Per-Class Error: 0.2517982
R^2: (Extract with `h2o.r2`) 0.9593309
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error          Rate
Seg1     95    2   13    4    0   24    9   10 0.3949 =    62 / 157
Seg2      0   17    0    0    1    0    0    0 0.0556 =      1 / 18
Seg3      3    0  238    1    0   28   12    9 0.1821 =    53 / 291
Seg4      3    0    2   55    0    0   10   15 0.3529 =     30 / 85
Seg5      0    5    0    0    8    0    0    0 0.3846 =      5 / 13
Seg6      6    0   14    0    0  370    0    3 0.0585 =    23 / 393
Seg7      6    0   29    2    0    0  105    5 0.2857 =    42 / 147
Seg8      3    0   17    3    0   15    1   91 0.3000 =    39 / 130
Totals  116   24  313   65    9  437  137  133 0.2066 = 255 / 1,234

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.793355
2 2  0.926256
3 3  0.954619
4 4  0.962723
5 5  0.964344
6 6  0.965964
7 7  0.965964
8 8  1.000000


H2OMultinomialMetrics: drf
** Reported on validation data. **

Validation Set Metrics: 
=====================

Extract validation frame with `h2o.getFrame("RTMP_sid_b243_20")`
MSE: (Extract with `h2o.mse`) 0.1876194
RMSE: (Extract with `h2o.rmse`) 0.4331505
Logloss: (Extract with `h2o.logloss`) 0.5385996
Mean Per-Class Error: 0.238413
R^2: (Extract with `h2o.r2`) 0.961275
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error       Rate
Seg1     46    0    2    0    0    8    7    2 0.2923 =  19 / 65
Seg2      0    8    0    0    3    0    0    0 0.2727 =   3 / 11
Seg3      2    0   96    0    0   11    3    0 0.1429 = 16 / 112
Seg4      3    0    0   29    0    0    5    3 0.2750 =  11 / 40
Seg5      0    1    0    1    6    0    0    0 0.2500 =    2 / 8
Seg6      0    0    0    0    0  177    0    0 0.0000 =  0 / 177
Seg7      5    0   10    1    0    0   43    0 0.2712 =  16 / 59
Seg8      6    0   11    1    0    6    1   37 0.4032 =  25 / 62
Totals   62    9  119   32    9  202   59   42 0.1723 = 92 / 534

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.827715
2 2  0.940075
3 3  0.981273
4 4  0.988764
5 5  0.990637
6 6  0.990637
7 7  0.990637
8 8  1.000000




Scoring History: 
            timestamp   duration number_of_trees training_rmse training_logloss
1 2020-11-26 10:01:36  0.003 sec               0            NA               NA
2 2020-11-26 10:01:36  0.013 sec               1       0.65195         13.52251
3 2020-11-26 10:01:36  0.024 sec               2       0.62843         12.21591
4 2020-11-26 10:01:36  0.034 sec               3       0.59609         10.43573
5 2020-11-26 10:01:36  0.046 sec               4       0.57458          9.18947
  training_classification_error validation_rmse validation_logloss
1                            NA              NA                 NA
2                       0.44105         0.64458           12.63004
3                       0.41083         0.53233            6.55984
4                       0.38397         0.49046            3.94032
5                       0.36837         0.47017            2.81891
  validation_classification_error
1                              NA
2                         0.46629
3                         0.36330
4                         0.30899
5                         0.26404

---
             timestamp   duration number_of_trees training_rmse
18 2020-11-26 10:01:36  0.256 sec              17       0.45407
19 2020-11-26 10:01:36  0.282 sec              18       0.45176
20 2020-11-26 10:01:36  0.304 sec              19       0.44884
21 2020-11-26 10:01:36  0.329 sec              20       0.44702
22 2020-11-26 10:01:36  0.352 sec              21       0.44637
23 2020-11-26 10:01:36  0.375 sec              22       0.44520
   training_logloss training_classification_error validation_rmse
18          1.66526                       0.23177         0.43337
19          1.55729                       0.22366         0.43175
20          1.42207                       0.21556         0.43284
21          1.34107                       0.21232         0.43247
22          1.34019                       0.21556         0.43298
23          1.28740                       0.20665         0.43315
   validation_logloss validation_classification_error
18            0.60841                         0.18165
19            0.59516                         0.17041
20            0.53911                         0.17416
21            0.53846                         0.16479
22            0.53924                         0.17228
23            0.53860                         0.17228

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
           variable relative_importance scaled_importance percentage
1    Flight_Options         1954.231812          1.000000   0.134817
2     Ticket_Prices         1739.467163          0.890103   0.120001
3   Preferred_Seats         1398.059082          0.715401   0.096449
4         Recommend         1341.355591          0.686385   0.092537
5  Overhead_Storage         1133.051514          0.579794   0.078166
6    Seat_Roominess         1074.776489          0.549974   0.074146
7      Seat_Comfort          963.105713          0.492831   0.066442
8           Service          942.373169          0.482222   0.065012
9       Helpfulness          785.609009          0.402004   0.054197
10     Satisfaction          717.310120          0.367055   0.049485
11   Clean_Aircraft          689.196289          0.352669   0.047546
12     Friendliness          638.604553          0.326780   0.044056
13 Easy_Reservation          586.051025          0.299888   0.040430
14         Courtesy          532.199463          0.272332   0.036715

7.2.1 Confusion matrix for the training sample.

While this might be interesting, it is best not to place too much credence on the hit-ratio for the training sample since that data was used to build the model.

h2o.confusionMatrix(df8C.class_rf,  fs.split[[1]]   )  
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1    157    0    0    0    0    0    0    0 0.0000 =   0 / 157
Seg2      0   18    0    0    0    0    0    0 0.0000 =    0 / 18
Seg3      0    0  291    0    0    0    0    0 0.0000 =   0 / 291
Seg4      0    0    0   85    0    0    0    0 0.0000 =    0 / 85
Seg5      0    0    0    0   13    0    0    0 0.0000 =    0 / 13
Seg6      0    0    0    0    0  393    0    0 0.0000 =   0 / 393
Seg7      0    0    0    0    0    0  147    0 0.0000 =   0 / 147
Seg8      0    0    0    0    0    0    0  130 0.0000 =   0 / 130
Totals  157   18  291   85   13  393  147  130 0.0000 = 0 / 1,234

7.2.2 Confusion matrix for testing sample, best to use.

This confusion matrix shows the ability of the model to predict segment membership for data that was not used in the model construction. This is much better information for partially answering the question, “How good is your model?”

h2o.confusionMatrix(df8C.class_rf,  fs.split[[2]]   )  
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error       Rate
Seg1     46    0    2    0    0    8    7    2 0.2923 =  19 / 65
Seg2      0    8    0    0    3    0    0    0 0.2727 =   3 / 11
Seg3      2    0   96    0    0   11    3    0 0.1429 = 16 / 112
Seg4      3    0    0   29    0    0    5    3 0.2750 =  11 / 40
Seg5      0    1    0    1    6    0    0    0 0.2500 =    2 / 8
Seg6      0    0    0    0    0  177    0    0 0.0000 =  0 / 177
Seg7      5    0   10    1    0    0   43    0 0.2712 =  16 / 59
Seg8      6    0   11    1    0    6    1   37 0.4032 =  25 / 62
Totals   62    9  119   32    9  202   59   42 0.1723 = 92 / 534

7.2.3 Creating an object to hold important diagnostic information

A receptacle for diagnostic statistics is created below so that this information can be compared efficiently at the end of the predictive modeling.

modH <- data.frame(Prediction_model=character(),
                  hit_ratio=numeric(),
                  MSE=numeric(),
                  RMSE=numeric(),
                  logloss=numeric(),
                  mean_per_class_error=numeric(),
                  stringsAsFactors=FALSE) 
modH[1, 1] <- "Random_forest"
modH[1, 2] <- df8C.class_rf@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
modH[1, 3] <- df8C.class_rf@model$validation_metrics@metrics$MSE   #  
modH[1, 4] <- df8C.class_rf@model$validation_metrics@metrics$RMSE       #  
modH[1, 5] <- df8C.class_rf@model$validation_metrics@metrics$ logloss
modH[1, 6] <- df8C.class_rf@model$validation_metrics@metrics$ mean_per_class_error 

7.2.4 Plotting variable importance

h2o allows for extracting variable importance from the predictive model results. That information is graphed below using the ‘plotly’ package. These should be compared across the three models used.

rf_variable_importances <- as.data.frame(df8C.class_rf@model$variable_importances)

library(plotly)
plot_ly(rf_variable_importances,
        y=reorder(rf_variable_importances$variable,
                  rf_variable_importances$percentage),
        x = rf_variable_importances$percentage,
        color = rf_variable_importances$variable,
        type = 'bar', orientation = 'h') %>%
  layout( title = "Variable Importance for the random forest model",
          xaxis = list(title = "Percentage Importance"),
          ylim=c(0,1),
          margin = list(l = 120)) 

It’s important to find the best model for use in the segmentation. This involves using random forest (above) and comparing to other analytics such as logistic regression, deep learning, gradient boosting model and others.

Compare the predictive ability of each type of model and pick the best.

Analyses that follow use different models to develop the best predictive model as alternative to random forest.

7.3 Building a predictive model using logistic regression

The glm option in h2o, h2o.glm, conducts logistic regression. The following is a simple setup for conducting glm on the 6-segment solution.

df8C.class_glm <- h2o.glm(
  family= "multinomial",  
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=3:16,                        ## the predictor columns, by column index
  y=2,
  lambda=0
) 

Once again, the “summary()” function produces a lot of information. To find out how to reference parts of the output, run “str(df8C.class_glm)” and follow down that tree until you find what you want to extract.

summary(df8C.class_glm) 
Model Details:
==============

H2OMultinomialModel: glm
Model Key:  GLM_model_R_1605846379958_105 
GLM Model: summary
       family        link regularization number_of_predictors_total
1 multinomial multinomial           None                        120
  number_of_active_predictors number_of_iterations   training_frame
1                         112                   50 RTMP_sid_b243_18

H2OMultinomialMetrics: glm
** Reported on training data. **

Training Set Metrics: 
=====================

Extract training frame with `h2o.getFrame("RTMP_sid_b243_18")`
MSE: (Extract with `h2o.mse`) 0.008719055
RMSE: (Extract with `h2o.rmse`) 0.09337588
Logloss: (Extract with `h2o.logloss`) 0.04084536
Mean Per-Class Error: 0.003244478
Null Deviance: (Extract with `h2o.nulldeviance`) 4323.575
Residual Deviance: (Extract with `h2o.residual_deviance`) 100.8064
R^2: (Extract with `h2o.r2`) 0.998211
AIC: (Extract with `h2o.aic`) NaN
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1    156    0    0    0    0    1    0    0 0.0064 =   1 / 157
Seg2      0   18    0    0    0    0    0    0 0.0000 =    0 / 18
Seg3      0    0  290    0    0    1    0    0 0.0034 =   1 / 291
Seg4      0    0    0   85    0    0    0    0 0.0000 =    0 / 85
Seg5      0    0    0    0   13    0    0    0 0.0000 =    0 / 13
Seg6      1    0    0    0    0  392    0    0 0.0025 =   1 / 393
Seg7      0    0    2    0    0    0  145    0 0.0136 =   2 / 147
Seg8      0    0    0    0    0    0    0  130 0.0000 =   0 / 130
Totals  157   18  292   85   13  394  145  130 0.0041 = 5 / 1,234

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.995948
2 2  1.000000
3 3  1.000000
4 4  1.000000
5 5  1.000000
6 6  1.000000
7 7  1.000000
8 8  1.000000


H2OMultinomialMetrics: glm
** Reported on validation data. **

Validation Set Metrics: 
=====================

Extract validation frame with `h2o.getFrame("RTMP_sid_b243_20")`
MSE: (Extract with `h2o.mse`) 0.0379163
RMSE: (Extract with `h2o.rmse`) 0.1947211
Logloss: (Extract with `h2o.logloss`) 0.5469952
Mean Per-Class Error: 0.1195723
Null Deviance: (Extract with `h2o.nulldeviance`) 1906.692
Residual Deviance: (Extract with `h2o.residual_deviance`) 1214.378
R^2: (Extract with `h2o.r2`) 0.992174
AIC: (Extract with `h2o.aic`) NaN
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error       Rate
Seg1     63    0    0    1    0    1    0    0 0.0308 =   2 / 65
Seg2      0    7    0    1    3    0    0    0 0.3636 =   4 / 11
Seg3      0    0  111    0    0    1    0    0 0.0089 =  1 / 112
Seg4      0    0    0   37    1    0    2    0 0.0750 =   3 / 40
Seg5      0    2    0    1    5    0    0    0 0.3750 =    3 / 8
Seg6      0    0    1    0    0  176    0    0 0.0056 =  1 / 177
Seg7      0    0    0    0    1    0   58    0 0.0169 =   1 / 59
Seg8      2    0    2    0    0    0    1   57 0.0806 =   5 / 62
Totals   65    9  114   40   10  178   61   57 0.0375 = 20 / 534

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.962547
2 2  0.990637
3 3  0.998127
4 4  0.998127
5 5  0.998127
6 6  0.998127
7 7  1.000000
8 8  1.000000




Scoring History: 
            timestamp   duration iterations negative_log_likelihood objective
1 2020-11-26 10:01:38  0.000 sec          0              2161.78770   1.75185
2 2020-11-26 10:01:38  0.026 sec          1               983.44076   0.79695
3 2020-11-26 10:01:38  0.045 sec          2               528.28160   0.42811
4 2020-11-26 10:01:38  0.060 sec          3               362.66936   0.29390
5 2020-11-26 10:01:38  0.072 sec          4               304.05046   0.24639
  training_rmse training_logloss training_r2 training_classification_error
1            NA               NA          NA                            NA
2            NA               NA          NA                            NA
3            NA               NA          NA                            NA
4            NA               NA          NA                            NA
5            NA               NA          NA                            NA
  validation_rmse validation_logloss validation_r2
1              NA                 NA            NA
2              NA                 NA            NA
3              NA                 NA            NA
4              NA                 NA            NA
5              NA                 NA            NA
  validation_classification_error
1                              NA
2                              NA
3                              NA
4                              NA
5                              NA

---
             timestamp   duration iterations negative_log_likelihood objective
46 2020-11-26 10:01:39  0.650 sec         45                56.03445   0.04541
47 2020-11-26 10:01:39  0.660 sec         46                54.83831   0.04444
48 2020-11-26 10:01:39  0.669 sec         47                53.69041   0.04351
49 2020-11-26 10:01:39  0.680 sec         48                52.58921   0.04262
50 2020-11-26 10:01:39  0.692 sec         49                51.49931   0.04173
51 2020-11-26 10:01:39  0.702 sec         50                50.40318   0.04085
   training_rmse training_logloss training_r2 training_classification_error
46            NA               NA          NA                            NA
47            NA               NA          NA                            NA
48            NA               NA          NA                            NA
49            NA               NA          NA                            NA
50            NA               NA          NA                            NA
51       0.09338          0.04085     0.99821                       0.00405
   validation_rmse validation_logloss validation_r2
46              NA                 NA            NA
47              NA                 NA            NA
48              NA                 NA            NA
49              NA                 NA            NA
50              NA                 NA            NA
51         0.19472            0.54700       0.99217
   validation_classification_error
46                              NA
47                              NA
48                              NA
49                              NA
50                              NA
51                         0.03745

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

           variable relative_importance scaled_importance percentage
1   Preferred_Seats           123.81388         1.0000000 0.13452129
2     Ticket_Prices           115.56724         0.9333949 0.12556149
3         Recommend            85.74906         0.6925642 0.09316463
4  Overhead_Storage            79.80632         0.6445669 0.08670797
5      Satisfaction            77.83319         0.6286306 0.08456420
6  Easy_Reservation            77.37700         0.6249461 0.08406855
7    Flight_Options            60.88457         0.4917427 0.06614987
8           Service            59.96018         0.4842768 0.06514554
9    Clean_Aircraft            56.14288         0.4534458 0.06099812
10   Seat_Roominess            43.91615         0.3546949 0.04771402
11     Seat_Comfort            42.17984         0.3406714 0.04582755
12      Helpfulness            34.94279         0.2822203 0.03796464
13     Friendliness            34.36959         0.2775907 0.03734187
14         Courtesy            27.86086         0.2250221 0.03027027

7.3.1 Confusion table for the training sample

The logistic regression model provides extremely good predictions for the training model, but it would be cheating and unbelievable to use this to support the modeling.

df8C.class_glm@ model$ training_metrics@ metrics$ cm$ table
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1    156    0    0    0    0    1    0    0 0.0064 =   1 / 157
Seg2      0   18    0    0    0    0    0    0 0.0000 =    0 / 18
Seg3      0    0  290    0    0    1    0    0 0.0034 =   1 / 291
Seg4      0    0    0   85    0    0    0    0 0.0000 =    0 / 85
Seg5      0    0    0    0   13    0    0    0 0.0000 =    0 / 13
Seg6      1    0    0    0    0  392    0    0 0.0025 =   1 / 393
Seg7      0    0    2    0    0    0  145    0 0.0136 =   2 / 147
Seg8      0    0    0    0    0    0    0  130 0.0000 =   0 / 130
Totals  157   18  292   85   13  394  145  130 0.0041 = 5 / 1,234

7.3.2 Confusion table for the testing or holdout or validation sample

The following chronicles the contents of the generated model. Here, the predictive ability is still unusually good with 95.5752212% being predicted accurately.

df8C.class_glm@ model$ validation_metrics@ metrics$ cm$ table  
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error       Rate
Seg1     63    0    0    1    0    1    0    0 0.0308 =   2 / 65
Seg2      0    7    0    1    3    0    0    0 0.3636 =   4 / 11
Seg3      0    0  111    0    0    1    0    0 0.0089 =  1 / 112
Seg4      0    0    0   37    1    0    2    0 0.0750 =   3 / 40
Seg5      0    2    0    1    5    0    0    0 0.3750 =    3 / 8
Seg6      0    0    1    0    0  176    0    0 0.0056 =  1 / 177
Seg7      0    0    0    0    1    0   58    0 0.0169 =   1 / 59
Seg8      2    0    2    0    0    0    1   57 0.0806 =   5 / 62
Totals   65    9  114   40   10  178   61   57 0.0375 = 20 / 534

Look at the code and see another command for the confusion matrix.

h2o.confusionMatrix(df8C.class_glm, valid = TRUE) # df8C.class
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error       Rate
Seg1     63    0    0    1    0    1    0    0 0.0308 =   2 / 65
Seg2      0    7    0    1    3    0    0    0 0.3636 =   4 / 11
Seg3      0    0  111    0    0    1    0    0 0.0089 =  1 / 112
Seg4      0    0    0   37    1    0    2    0 0.0750 =   3 / 40
Seg5      0    2    0    1    5    0    0    0 0.3750 =    3 / 8
Seg6      0    0    1    0    0  176    0    0 0.0056 =  1 / 177
Seg7      0    0    0    0    1    0   58    0 0.0169 =   1 / 59
Seg8      2    0    2    0    0    0    1   57 0.0806 =   5 / 62
Totals   65    9  114   40   10  178   61   57 0.0375 = 20 / 534

7.3.3 Variable importances

h2o provides an easy way for obtaining the variable importances.

h2o.varimp(df8C.class_glm)
           variable relative_importance scaled_importance percentage
1   Preferred_Seats           123.81388         1.0000000 0.13452129
2     Ticket_Prices           115.56724         0.9333949 0.12556149
3         Recommend            85.74906         0.6925642 0.09316463
4  Overhead_Storage            79.80632         0.6445669 0.08670797
5      Satisfaction            77.83319         0.6286306 0.08456420
6  Easy_Reservation            77.37700         0.6249461 0.08406855
7    Flight_Options            60.88457         0.4917427 0.06614987
8           Service            59.96018         0.4842768 0.06514554
9    Clean_Aircraft            56.14288         0.4534458 0.06099812
10   Seat_Roominess            43.91615         0.3546949 0.04771402
11     Seat_Comfort            42.17984         0.3406714 0.04582755
12      Helpfulness            34.94279         0.2822203 0.03796464
13     Friendliness            34.36959         0.2775907 0.03734187
14         Courtesy            27.86086         0.2250221 0.03027027

7.3.4 Plotting variable importance

Notice that the variable importances for this logistic regression model is not the same as produced by the random forest model.

# glm_variable_importances <- as.data.frame(df8C.class_glm@model$variable_importances)
glm_variable_importances <- as.data.frame(h2o.varimp(df8C.class_glm))
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
pacman::p_load(plotly)
plot_ly(glm_variable_importances,
        #        x = rf_variable_importances$percentage,
        y=reorder(glm_variable_importances$variable,
                  glm_variable_importances$percentage),
        x = glm_variable_importances$percentage,
        color = glm_variable_importances$variable,
        type = 'bar', orientation = 'h') %>%
  layout( title = "Variable Importance for the logistic regression model on 6 segments",
          xaxis = list(title = "Percentage Importance"),
          ylim=c(0,1),
          margin = list(l = 120))
modH[2, 1] <- "GLM_log_regr"
modH[2, 2] <- df8C.class_glm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
modH[2, 3] <- df8C.class_glm@model$validation_metrics@metrics$MSE   #  
modH[2, 4] <- df8C.class_glm@model$validation_metrics@metrics$RMSE       #  
modH[2, 5] <- df8C.class_glm@model$validation_metrics@metrics$ logloss
modH[2, 6] <- df8C.class_glm@model$validation_metrics@metrics$ mean_per_class_error 

7.4 Gradient boosting machine

The gradient boosting machine model is a type of neural network algorithm that can be very effective. It is quite difficult to foretell which of these models is likely to be the best predictor of segment membership. It is important to try several models and compare the results.

#GBM Gradient Boosting Machine
df8C.class_gbm<- h2o.gbm(
  distribution="AUTO",
  training_frame   = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=3:16,                        ## the predictor columns, by column index
  y=2,
  model_id = "fs.gbm",
  stopping_rounds = 3,
  histogram_type = "UniformAdaptive" ,
  stopping_tolerance = 1e-2,
  seed = 1234
)

7.4.1 Coefficients

GBM models do not produce coefficients of the variables.

7.4.2 Model Performance

The “h2o.performance()” function provides a tidy list of many important analytical metrics. The “str(perf)” command will provide locations for the following metrics. Notice that the performance is requested on the holdout sample.

perf <- h2o.performance(df8C.class_gbm, fs.split[[2]])
perf
H2OMultinomialMetrics: gbm

Test Set Metrics: 
=====================

MSE: (Extract with `h2o.mse`) 0.1327061
RMSE: (Extract with `h2o.rmse`) 0.3642884
Logloss: (Extract with `h2o.logloss`) 0.4226855
Mean Per-Class Error: 0.2296039
R^2: (Extract with `h2o.r2`) 0.9726092
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error       Rate
Seg1     51    0    2    0    0    7    5    0 0.2154 =  14 / 65
Seg2      0    8    0    0    3    0    0    0 0.2727 =   3 / 11
Seg3      0    0   95    0    0    7    7    3 0.1518 = 17 / 112
Seg4      2    0    1   26    1    0    6    4 0.3500 =  14 / 40
Seg5      0    1    0    2    5    0    0    0 0.3750 =    3 / 8
Seg6      0    0    5    0    0  172    0    0 0.0282 =  5 / 177
Seg7      2    0    8    0    0    0   49    0 0.1695 =  10 / 59
Seg8      3    0    8    0    0    5    1   45 0.2742 =  17 / 62
Totals   58    9  119   28    9  191   68   52 0.1554 = 83 / 534

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.844569
2 2  0.953183
3 3  0.986891
4 4  0.996255
5 5  1.000000
6 6  1.000000
7 7  1.000000
8 8  1.000000

7.4.3 Confusion table for the training sample

# str(df8C.class_glm)
df8C.class_gbm@ model$ training_metrics@ metrics$ cm$ table
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1    157    0    0    0    0    0    0    0 0.0000 =   0 / 157
Seg2      0   18    0    0    0    0    0    0 0.0000 =    0 / 18
Seg3      0    0  291    0    0    0    0    0 0.0000 =   0 / 291
Seg4      0    0    0   85    0    0    0    0 0.0000 =    0 / 85
Seg5      0    0    0    0   13    0    0    0 0.0000 =    0 / 13
Seg6      0    0    0    0    0  393    0    0 0.0000 =   0 / 393
Seg7      0    0    0    0    0    0  147    0 0.0000 =   0 / 147
Seg8      0    0    0    0    0    0    0  130 0.0000 =   0 / 130
Totals  157   18  291   85   13  393  147  130 0.0000 = 0 / 1,234
# h2o.confusionMatrix( df8C.class_glm,train = TRUE)

7.4.4 Confusion table for the testing sample

The following chronicles the contents of the generated model.

The confusion matrix can be pulled from the model output file or from the performance output shown above.

h2o.confusionMatrix(df8C.class_gbm, fs.split[[2]]) # CONFUSION TABLE FOR HOLDOUT
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error       Rate
Seg1     51    0    2    0    0    7    5    0 0.2154 =  14 / 65
Seg2      0    8    0    0    3    0    0    0 0.2727 =   3 / 11
Seg3      0    0   95    0    0    7    7    3 0.1518 = 17 / 112
Seg4      2    0    1   26    1    0    6    4 0.3500 =  14 / 40
Seg5      0    1    0    2    5    0    0    0 0.3750 =    3 / 8
Seg6      0    0    5    0    0  172    0    0 0.0282 =  5 / 177
Seg7      2    0    8    0    0    0   49    0 0.1695 =  10 / 59
Seg8      3    0    8    0    0    5    1   45 0.2742 =  17 / 62
Totals   58    9  119   28    9  191   68   52 0.1554 = 83 / 534
#h2o.confusionMatrix(perf) # DIFFERENT WAY

7.4.5 Variable importances

h2o.varimp(df8C.class_gbm)
Variable Importances: 
           variable relative_importance scaled_importance percentage
1    Flight_Options          524.496399          1.000000   0.156150
2     Ticket_Prices          440.558929          0.839966   0.131160
3  Overhead_Storage          360.092560          0.686549   0.107204
4         Recommend          351.703003          0.670554   0.104707
5           Service          346.527100          0.660685   0.103166
6   Preferred_Seats          339.407776          0.647112   0.101046
7    Seat_Roominess          241.974472          0.461346   0.072039
8       Helpfulness          167.264206          0.318904   0.049797
9      Seat_Comfort          166.859787          0.318133   0.049676
10     Satisfaction          117.915565          0.224817   0.035105
11     Friendliness          100.707726          0.192008   0.029982
12   Clean_Aircraft           83.652298          0.159491   0.024904
13 Easy_Reservation           60.830585          0.115979   0.018110
14         Courtesy           56.941521          0.108564   0.016952

7.4.6 Plotting variable importance

The variable importances are similar to those for rf and glm models, but with some differences.

# glm_variable_importances <- as.data.frame(df8C.class_glm@model$variable_importances)
gbm_variable_importances <- as.data.frame(h2o.varimp(df8C.class_gbm))
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
pacman::p_load(plotly)
plot_ly(gbm_variable_importances,
        #        x = rf_variable_importances$percentage,
        y=reorder(gbm_variable_importances$variable,
                  gbm_variable_importances$percentage),
        x = gbm_variable_importances$percentage,
        color = gbm_variable_importances$variable,
        type = 'bar', orientation = 'h') %>%
  layout( title = "Variable Importance for GBM model of 6-segment solution",
          xaxis = list(title = "Percentage Importance"),
          ylim=c(0,1),
          margin = list(l = 120))

7.4.7 Predicting probabilities of segment membership

The table below shows the probabilities of each several respondents being in each of the 8 segments and then assigns the person to the segment having the highest probability. This can be very valuable information and used in different ways.

gbm_8_pred <- h2o.predict(df8C.class_gbm, newdata=fs.split[[2]])
head(gbm_8_pred, 10) %>% tab_df( show.rownames = T, title = "Predicted segments for testing sample")
Predicted segments for testing sample
predict Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
Seg2 0.00 0.88 0.00 0.00 0.11 0.00 0.00 0.00
Seg5 0.00 0.04 0.00 0.15 0.78 0.00 0.01 0.00
Seg5 0.00 0.08 0.00 0.00 0.91 0.00 0.00 0.00
Seg2 0.00 0.61 0.00 0.11 0.26 0.00 0.00 0.00
Seg4 0.00 0.00 0.00 0.67 0.32 0.00 0.01 0.00
Seg2 0.00 0.91 0.00 0.00 0.07 0.00 0.00 0.00
Seg5 0.00 0.26 0.00 0.01 0.72 0.00 0.00 0.00
Seg2 0.00 0.99 0.00 0.00 0.01 0.00 0.00 0.00
Seg2 0.00 0.92 0.00 0.00 0.05 0.00 0.01 0.00
Seg2 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00

8 Statistics of predictive models for 8-segment solution, Level 1

modH[3, 1] <- "Gradient Boosting"
modH[3, 2] <- df8C.class_gbm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
modH[3, 3] <- df8C.class_gbm@model$validation_metrics@metrics$MSE   #  
modH[3, 4] <- df8C.class_gbm@model$validation_metrics@metrics$RMSE       #  
modH[3, 5] <- df8C.class_gbm@model$validation_metrics@metrics$ logloss
modH[3, 6] <- df8C.class_gbm@model$validation_metrics@metrics$ mean_per_class_error 

The table below provides the diagnostic statistics produced by the predictive models.

modH %>% tab_df( show.rownames = TRUE, sort.column = -2, title = "Statistics of predictive models for the 6-segment solution")
Statistics of predictive models for the 6-segment solution
Row Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
2 GLM_log_regr 0.96 0.04 0.19 0.55 0.12
3 Gradient Boosting 0.84 0.13 0.36 0.42 0.23
1 Random_forest 0.83 0.19 0.43 0.54 0.24


The GLM (logistic regression) model seems to be better with a higher hit-ratio of 0.9625468 compared to hit-ratios for the other models. All other statistics indicate that GLM is the superior model.

modH_hit <- modH[, 1:2]
modH_hit_long <- melt(modH_hit)
# Using Prediction_model as id variables
ggplot(data=modH_hit_long, aes(x= reorder(Prediction_model, -value), y=value ) ) +
   geom_bar(  stat="identity" , fill = "lightblue", width = 0.3)+
   geom_point(aes( color= Prediction_model ), size=6) +
   theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
   labs( title= "Hit ratios for predictive models", y="Proportion correct",
         x= "Prediction Model")

9 Basic interpretation of segments based on level 1 attributes

Before moving on to the Level-2 segmentation and using demographic variables, in this section we try to interpret the clusters using the mean of their attitudinal variables.

g8 + theme(legend.direction = "vertical",
           legend.text = element_text(size=15),
           legend.key.size = unit(.7, "cm"))

  • Segment 2: Have the least scoring in all attitude variables, however their sample size is 1.64 percent of all customers.

  • Segment 5: One of the two least scoring segments, although its customers had a better opinion than segment 2 in helpfulness, service and easy_reservation variables.

  • Segment 6: Our best customers! They had the highest average in all categories with no particular peak or trough.

  • Segment 1: Thinking highly of the service variables but were not satisfied with the flight option variables. overhead storage was a huge turn-off for them.

  • Segment 3: Near perfect in almost all categories (like segment 6) but, flight options, ticket prices and preferred seats were a concern for this segment. However, it didn’t affect their attitude toward recommending the airline to others, this segment is in the top 3 in the recommend variable.

  • Segment 4: Medium satisfied, Overhead storage and ticket prices was a concern for this segment.

  • Segment 7: This is one of interesting segments. They were near the top with seat_comfort, courtesy, friendliness, etc. (quality of service variables). However, not as much on the flight options and ticket prices variables. Also, they were not very eager to recommend the airline to others as they had a around average result in the recommend variable.

  • Segment 8: They had the same linear pattern in their variable means as in our highest segment (Segment 6) with about 1 point difference (lower). No particular peaks

10 Level 2 of the segmentation process

The objective of Level 1 segmentation was 1) to use a partitioning method, or several, to find several segmentation solutions; and 2) to find a model that is best at predicting membership in the segments based on the Level 1 variables, normally attitudinal variables based on the product.

Level 2 of the segmentation process focuses on the very practical problem of attempting to assign people to segments based on variables that may be more accessible than the attitudinal variables, i.e., demographic and geo-demographic variables. Most large customer data bases contain individual demographic information but rarely attitudes, including product perceptions and preferences.

The process of applying a predictive model to a customer data base is often called scoring or tagging. The development of a very simple predictive model that can be used quickly by sales representatives to estimate segment membership of a prospect is referred to as a typing tool.

10.1 Tuning the demographic variables

The demographic variables in the airline data need to be analysed before put in the model. After using a simple describe function. It is found out that they are not factors. In the next code chunk we change turn these variables in to factors. Also, the demographics contain missing points and missing point causes all the next predictive models to produce an error. By omitting the rows containing a missing value, we lose 50 percent of customers! But as the model performance in level 2 is not comparable to the performance in Level 1, we proceed for now.

h2o.describe(df8C.class[,17:24])
       Label   Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
1   Language string     159     0      0      0 NaN NaN   NA    NA          NA
2     Smoker string     153     0      0      0 NaN NaN   NA    NA          NA
3 Employment string     141     0      0      0 NaN NaN   NA    NA          NA
4  Education string     147     0      0      0 NaN NaN   NA    NA          NA
5    Marital string     141     0      0      0 NaN NaN   NA    NA          NA
6        Sex string     151     0      0      0 NaN NaN   NA    NA          NA
7        Age string     146     0      0      0 NaN NaN   NA    NA          NA
8     Income string     129     0      0      0 NaN NaN   NA    NA          NA
df8C.class[,17:24] <- h2o.asfactor(df8C.class[,17:24])

df8C.class.na <- df8C.class %>% h2o.na_omit()
nrow(df8C.class.na)
[1] 895

After that the reduced (by 50%) data is splitted into a train and validation subset.

#__splitting df into training (70%) and testing (validation) datasets________
fs.split<- h2o.splitFrame(df8C.class.na, ratios=c(0.7))

11 A random forest model

This random forest model attempts to predict segment membership from the 8-segment solution developed in Level 1 by using only the demographic variables. The demographic variables were not used in the construction of the predictive model. But, they might be the only variables in the customer database and are easier to obtain than are the values of the attitudinal variables. This begins the process to answer the question, “How well do the covariates predict segment membership?”

fs3Cov.class_rf <- h2o.randomForest(         
  training_frame   = fs.split[[1]],        
  validation_frame = fs.split[[2]],     
  x=17:24,                        ## the predictor columns, by column index
    y=2,                          ## the target index (what we are predicting)
  model_id = "RF_cov",    ## name the model in H2O
  ##   not required, but helps use Flow
  ntrees = 200,                  ## use a maximum of 200 trees to create the
  ##  random forest model. The default is 50.
  ##  I have increased it because I will let 
  ##  the early stopping criteria decide when
  ##  the random forest is sufficiently accurate
  stopping_rounds = 2,           ## Stop fitting new trees when the 2-tree
  ##  average is within 0.001 (default) of 
  ##  the prior two 2-tree averages.
  ##  Can be thought of as a convergence setting
  score_each_iteration = T,      ## Predict against training and validation for
  ##  each tree. Default will skip several.
  seed = 1000000)                ## Set the random seed so that this can be reproduced.

11.1 Check-out the hit-ratio

The summary() function brings forth a great deal of output. Most of this is useful but might be too much to digest in one chunk. Using the str() function on the model output will list the smaller pieces of output that can be extracted in more usable pieces. The demographics do not seem to very accurately predict segment membership.

Both train and validation metrics indicate that the demographic variables are not very effective in predicting segment membership.

summary(fs3Cov.class_rf)  
Model Details:
==============

H2OMultinomialModel: drf
Model Key:  RF_cov 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              30                      240              226294         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        18   11.89583          9        161    71.02500

H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

Training Set Metrics: 
=====================

Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.6623662
RMSE: (Extract with `h2o.rmse`) 0.8138588
Logloss: (Extract with `h2o.logloss`) 5.199065
Mean Per-Class Error: 0.8732956
R^2: (Extract with `h2o.r2`) 0.8610053
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1     10    1   18    2    0   38    5    4 0.8718 =   68 / 78
Seg2      2    0    3    1    0    6    0    0 1.0000 =   12 / 12
Seg3     14    1   35    9    1   75    9   10 0.7727 = 119 / 154
Seg4      6    2   10    3    0   27    1    3 0.9423 =   49 / 52
Seg5      0    0    1    0    0    8    0    0 1.0000 =     9 / 9
Seg6     15    1   67    4    1   91   14   12 0.5561 = 114 / 205
Seg7      6    0   14    3    0   40    7    1 0.9014 =   64 / 71
Seg8      3    0   13    4    0   40    5    4 0.9420 =   65 / 69
Totals   56    5  161   26    2  325   41   34 0.7692 = 500 / 650

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.230769
2 2  0.429231
3 3  0.615385
4 4  0.740000
5 5  0.827692
6 6  0.872308
7 7  0.889231
8 8  1.000000


H2OMultinomialMetrics: drf
** Reported on validation data. **

Validation Set Metrics: 
=====================

Extract validation frame with `h2o.getFrame("RTMP_sid_b243_25")`
MSE: (Extract with `h2o.mse`) 0.6634014
RMSE: (Extract with `h2o.rmse`) 0.8144946
Logloss: (Extract with `h2o.logloss`) 3.606853
Mean Per-Class Error: 0.8715399
R^2: (Extract with `h2o.r2`) 0.8626447
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      0    0   14    0    0   17    0    0 1.0000 =   31 / 31
Seg2      0    0    4    0    0    2    0    0 1.0000 =     6 / 6
Seg3      2    0   21    1    0   26    3    3 0.6250 =   35 / 56
Seg4      1    0    6    0    0    8    0    1 1.0000 =   16 / 16
Seg5      0    0    0    0    0    1    2    0 1.0000 =     3 / 3
Seg6      7    0   15    3    2   45    6    0 0.4231 =   33 / 78
Seg7      3    0    6    0    0   22    1    1 0.9697 =   32 / 33
Seg8      0    0    6    0    0   14    1    1 0.9545 =   21 / 22
Totals   13    0   72    4    2  135   13    6 0.7224 = 177 / 245

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.277551
2 2  0.436735
3 3  0.616326
4 4  0.726531
5 5  0.820408
6 6  0.897959
7 7  0.934694
8 8  1.000000




Scoring History: 
            timestamp   duration number_of_trees training_rmse training_logloss
1 2020-11-26 10:01:44  0.002 sec               0            NA               NA
2 2020-11-26 10:01:45  0.012 sec               1       0.87477         23.42415
3 2020-11-26 10:01:45  0.019 sec               2       0.86543         22.31440
4 2020-11-26 10:01:45  0.027 sec               3       0.86228         21.29809
5 2020-11-26 10:01:45  0.035 sec               4       0.85881         19.85723
  training_classification_error validation_rmse validation_logloss
1                            NA              NA                 NA
2                       0.76371         0.89835           25.44157
3                       0.77696         0.87445           20.27283
4                       0.77439         0.84749           16.53726
5                       0.78766         0.84085           13.41488
  validation_classification_error
1                              NA
2                         0.80000
3                         0.81224
4                         0.76735
5                         0.79592

---
             timestamp   duration number_of_trees training_rmse
26 2020-11-26 10:01:45  0.312 sec              25       0.81133
27 2020-11-26 10:01:45  0.330 sec              26       0.81171
28 2020-11-26 10:01:45  0.349 sec              27       0.81249
29 2020-11-26 10:01:45  0.370 sec              28       0.81351
30 2020-11-26 10:01:45  0.391 sec              29       0.81330
31 2020-11-26 10:01:45  0.412 sec              30       0.81386
   training_logloss training_classification_error validation_rmse
26          6.08735                       0.74308         0.81237
27          5.75777                       0.74462         0.81253
28          5.43538                       0.75692         0.81259
29          5.38847                       0.75846         0.81289
30          5.29201                       0.75846         0.81375
31          5.19907                       0.76923         0.81449
   validation_logloss validation_classification_error
26            3.82308                         0.73061
27            3.82442                         0.73469
28            3.58487                         0.73061
29            3.59066                         0.73061
30            3.59965                         0.71837
31            3.60685                         0.72245

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
    variable relative_importance scaled_importance percentage
1     Income         1835.585205          1.000000   0.246807
2        Age         1492.825073          0.813269   0.200721
3  Education         1338.323120          0.729099   0.179947
4    Marital          932.513550          0.508020   0.125383
5   Language          614.997009          0.335041   0.082691
6 Employment          611.182190          0.332963   0.082178
7        Sex          344.215271          0.187523   0.046282
8     Smoker          267.674744          0.145825   0.035991

11.2 Keys to look for are validation performance and variable importance.

More direct ways to access the training and validation metrics are shown in the code-chunks below. Performance metrics depend on the type of model being built. With a multinomial classification, we will primarily look at the confusion matrix, and overall accuracy via the hit-ratio. Once again, error rate and the hit-ratio are very poor.

11.3 Confusion matrix for the training sample

The predictive ability of the random forest model on the training sample is quite poor. However, this information should not be relied upon.

fs3Cov.class_rf@model$training_metrics     ## 
H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

Training Set Metrics: 
=====================

Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.6623662
RMSE: (Extract with `h2o.rmse`) 0.8138588
Logloss: (Extract with `h2o.logloss`) 5.199065
Mean Per-Class Error: 0.8732956
R^2: (Extract with `h2o.r2`) 0.8610053
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1     10    1   18    2    0   38    5    4 0.8718 =   68 / 78
Seg2      2    0    3    1    0    6    0    0 1.0000 =   12 / 12
Seg3     14    1   35    9    1   75    9   10 0.7727 = 119 / 154
Seg4      6    2   10    3    0   27    1    3 0.9423 =   49 / 52
Seg5      0    0    1    0    0    8    0    0 1.0000 =     9 / 9
Seg6     15    1   67    4    1   91   14   12 0.5561 = 114 / 205
Seg7      6    0   14    3    0   40    7    1 0.9014 =   64 / 71
Seg8      3    0   13    4    0   40    5    4 0.9420 =   65 / 69
Totals   56    5  161   26    2  325   41   34 0.7692 = 500 / 650

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.230769
2 2  0.429231
3 3  0.615385
4 4  0.740000
5 5  0.827692
6 6  0.872308
7 7  0.889231
8 8  1.000000

11.4 Confusion matrix for the holdout sample

The error rate and the corresponding hit-ratio for the testing sample are very poor.

h2o.confusionMatrix(fs3Cov.class_rf,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      0    0   14    0    0   17    0    0 1.0000 =   31 / 31
Seg2      0    0    4    0    0    2    0    0 1.0000 =     6 / 6
Seg3      2    0   21    1    0   26    3    3 0.6250 =   35 / 56
Seg4      1    0    6    0    0    8    0    1 1.0000 =   16 / 16
Seg5      0    0    0    0    0    1    2    0 1.0000 =     3 / 3
Seg6      7    0   15    3    2   45    6    0 0.4231 =   33 / 78
Seg7      3    0    6    0    0   22    1    1 0.9697 =   32 / 33
Seg8      0    0    6    0    0   14    1    1 0.9545 =   21 / 22
Totals   13    0   72    4    2  135   13    6 0.7224 = 177 / 245

Or, the same result can be obtained as follows.

fs3Cov.class_rf@model$validation_metrics     ## 
H2OMultinomialMetrics: drf
** Reported on validation data. **

Validation Set Metrics: 
=====================

Extract validation frame with `h2o.getFrame("RTMP_sid_b243_25")`
MSE: (Extract with `h2o.mse`) 0.6634014
RMSE: (Extract with `h2o.rmse`) 0.8144946
Logloss: (Extract with `h2o.logloss`) 3.606853
Mean Per-Class Error: 0.8715399
R^2: (Extract with `h2o.r2`) 0.8626447
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      0    0   14    0    0   17    0    0 1.0000 =   31 / 31
Seg2      0    0    4    0    0    2    0    0 1.0000 =     6 / 6
Seg3      2    0   21    1    0   26    3    3 0.6250 =   35 / 56
Seg4      1    0    6    0    0    8    0    1 1.0000 =   16 / 16
Seg5      0    0    0    0    0    1    2    0 1.0000 =     3 / 3
Seg6      7    0   15    3    2   45    6    0 0.4231 =   33 / 78
Seg7      3    0    6    0    0   22    1    1 0.9697 =   32 / 33
Seg8      0    0    6    0    0   14    1    1 0.9545 =   21 / 22
Totals   13    0   72    4    2  135   13    6 0.7224 = 177 / 245

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.277551
2 2  0.436735
3 3  0.616326
4 4  0.726531
5 5  0.820408
6 6  0.897959
7 7  0.934694
8 8  1.000000

11.5 Hit ratio tables for the training and testing samples

h2o.hit_ratio_table(fs3Cov.class_rf, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.230769
2 2  0.429231
3 3  0.615385
4 4  0.740000
5 5  0.827692
6 6  0.872308
7 7  0.889231
8 8  1.000000

$valid
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.277551
2 2  0.436735
3 3  0.616326
4 4  0.726531
5 5  0.820408
6 6  0.897959
7 7  0.934694
8 8  1.000000

12 Importance of the predicting demographic variables

While education, occupation, age and income seem to be the most important variables, they are not doing a very good job of predicting.

fs3Cov.class_rf@model$variable_importances
Variable Importances: 
    variable relative_importance scaled_importance percentage
1     Income         1835.585205          1.000000   0.246807
2        Age         1492.825073          0.813269   0.200721
3  Education         1338.323120          0.729099   0.179947
4    Marital          932.513550          0.508020   0.125383
5   Language          614.997009          0.335041   0.082691
6 Employment          611.182190          0.332963   0.082178
7        Sex          344.215271          0.187523   0.046282
8     Smoker          267.674744          0.145825   0.035991

12.1 The diagnostic statistics

The table below shows the hit-ratio, MSE (mean square error), RMSE (root mean square error), logloss and mean per class error for the random forest model. This table will be expanded to include the diagnostic statistics from each successive predictive model. After all of the analyses, we’ll select the model with the best stats, i.e., highest hit ratio and lowest error rates.

# THE FOLLOWING covH data frame IS TO HOLD STATISTICS FROM EACH MODEL
covH <- data.frame(Prediction_model=character(),
                   hit_ratio=numeric(),
                   MSE=numeric(),
                   RMSE=numeric(),
                   logloss=numeric(),
                   mean_per_class_error=numeric(),
                   stringsAsFactors=FALSE)

covH[1, 1] <- "Random_forest"
covH[1, 2] <- fs3Cov.class_rf@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[1, 3] <- fs3Cov.class_rf@model$validation_metrics@metrics$MSE   #  
covH[1, 4] <- fs3Cov.class_rf@model$validation_metrics@metrics$RMSE       #  
covH[1, 5] <- fs3Cov.class_rf@model$validation_metrics@metrics$ logloss
covH[1, 6] <- fs3Cov.class_rf@model$validation_metrics@metrics$ mean_per_class_error
covH
  Prediction_model hit_ratio       MSE      RMSE  logloss mean_per_class_error
1    Random_forest  0.277551 0.6634014 0.8144946 3.606853            0.8715399

13 Logistic regression

Logistic regression using the GLM function will be the second predictive model that we’ll investigate.

fs3Cov.class_glm<- h2o.glm(
  family= "multinomial",  
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=17:24,                        ## the predictor columns, by column index
  y=2,
  lambda=0
)

Logistic regression does provide coefficients but they can be difficult to interpret.

fs3Cov.class_glm@model$coefficients_table
Coefficients: glm multinomial coefficients
                         names coefs_class_0 coefs_class_1 coefs_class_2
1                    Intercept     -2.789630    -29.177628     -1.748118
2  Income.$150,000 to $174,999    -12.301131     14.376549     -1.050206
3 Income.$175,000 to $199,999,     -0.419814     -0.424694      0.173737
4      Income.$200,000 or more     -0.372210      1.875082     -0.494475
5      Income.$20000 to $29999     -0.328354     12.463572     -0.328955
  coefs_class_3 coefs_class_4 coefs_class_5 coefs_class_6 coefs_class_7
1     -1.317261    -31.382642     -1.503323     -2.020717     -1.514884
2     -1.340962     -0.056789      0.976500     -0.094116      0.818264
3     -1.370077     14.454119      0.443560     -0.868560      0.111476
4    -12.078823      0.852225      0.689659     -0.019912      0.252480
5     -0.870218     12.568106      0.205512     -0.378688      0.294759
  std_coefs_class_0 std_coefs_class_1 std_coefs_class_2 std_coefs_class_3
1         -2.789630        -29.177628         -1.748118         -1.317261
2        -12.301131         14.376549         -1.050206         -1.340962
3         -0.419814         -0.424694          0.173737         -1.370077
4         -0.372210          1.875082         -0.494475        -12.078823
5         -0.328354         12.463572         -0.328955         -0.870218
  std_coefs_class_4 std_coefs_class_5 std_coefs_class_6 std_coefs_class_7
1        -31.382642         -1.503323         -2.020717         -1.514884
2         -0.056789          0.976500         -0.094116          0.818264
3         14.454119          0.443560         -0.868560          0.111476
4          0.852225          0.689659         -0.019912          0.252480
5         12.568106          0.205512         -0.378688          0.294759

---
                       names coefs_class_0 coefs_class_1 coefs_class_2
38          Language.Italian      0.872204    -12.074147      0.432465
39          Language.Spanish      0.180752      0.538956      0.390999
40 Employment.Fully employed     -0.039444      1.336692      0.047779
41   Employment.Not employed     -0.219908      0.705802      0.053990
42                Smoker.Yes      0.205536    -12.207787      0.220553
43                  Sex.Male      0.531955      0.541160     -0.092518
   coefs_class_3 coefs_class_4 coefs_class_5 coefs_class_6 coefs_class_7
38     -0.419141    -12.239819      0.442000     -0.526422      0.075546
39     -0.850755     -0.492135      0.084140     -0.524931      0.439032
40     -0.064321     -0.829027     -0.156842     -0.053498      0.282103
41      0.180056      1.062008     -0.232056     -0.154852      0.342767
42     -0.601104     -0.221593     -0.163952      0.441261     -0.294634
43     -0.287154     -0.946531      0.053459      0.164545     -0.112481
   std_coefs_class_0 std_coefs_class_1 std_coefs_class_2 std_coefs_class_3
38          0.872204        -12.074147          0.432465         -0.419141
39          0.180752          0.538956          0.390999         -0.850755
40         -0.039444          1.336692          0.047779         -0.064321
41         -0.219908          0.705802          0.053990          0.180056
42          0.205536        -12.207787          0.220553         -0.601104
43          0.531955          0.541160         -0.092518         -0.287154
   std_coefs_class_4 std_coefs_class_5 std_coefs_class_6 std_coefs_class_7
38        -12.239819          0.442000         -0.526422          0.075546
39         -0.492135          0.084140         -0.524931          0.439032
40         -0.829027         -0.156842         -0.053498          0.282103
41          1.062008         -0.232056         -0.154852          0.342767
42         -0.221593         -0.163952          0.441261         -0.294634
43         -0.946531          0.053459          0.164545         -0.112481

13.1 Confusion matrix for the training sample

The error rate is quite poor on the training sample for the random forest model.

fs3Cov.class_glm@model$training_metrics  
H2OMultinomialMetrics: glm
** Reported on training data. **

Training Set Metrics: 
=====================

Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.5700341
RMSE: (Extract with `h2o.rmse`) 0.7550061
Logloss: (Extract with `h2o.logloss`) 1.551365
Mean Per-Class Error: 0.7689441
Null Deviance: (Extract with `h2o.nulldeviance`) 2306.878
Residual Deviance: (Extract with `h2o.residual_deviance`) 2016.775
R^2: (Extract with `h2o.r2`) 0.8803808
AIC: (Extract with `h2o.aic`) NaN
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      7    0   25    1    0   39    4    2 0.9103 =   71 / 78
Seg2      1    3    1    0    0    7    0    0 0.7500 =    9 / 12
Seg3      6    1   69    1    1   69    4    3 0.5519 =  85 / 154
Seg4      2    1    9    6    0   32    1    1 0.8846 =   46 / 52
Seg5      0    0    1    0    0    8    0    0 1.0000 =     9 / 9
Seg6      8    0   45    3    0  138    7    4 0.3268 =  67 / 205
Seg7      4    1   14    4    0   40    8    0 0.8873 =   63 / 71
Seg8      3    0   19    3    0   33    0   11 0.8406 =   58 / 69
Totals   31    6  183   18    1  366   24   21 0.6277 = 408 / 650

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.372308
2 2  0.613846
3 3  0.778462
4 4  0.880000
5 5  0.947692
6 6  0.984615
7 7  0.998462
8 8  1.000000

13.2 Confusion matrix for the holdout sample

The holdout sample error rate is worse than for the training sample, which is what should be expected. Unfortunately, the prediction error rate is very high.

h2o.confusionMatrix(fs3Cov.class_glm,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      2    2   10    0    0   16    1    0 0.9355 =   29 / 31
Seg2      0    1    2    0    0    3    0    0 0.8333 =     5 / 6
Seg3      2    0   14    0    1   32    3    4 0.7500 =   42 / 56
Seg4      1    1    6    1    0    6    1    0 0.9375 =   15 / 16
Seg5      0    0    1    0    0    2    0    0 1.0000 =     3 / 3
Seg6      3    0   23    1    0   46    4    1 0.4103 =   32 / 78
Seg7      2    2   11    0    0   16    1    1 0.9697 =   32 / 33
Seg8      1    1    5    0    0   13    2    0 1.0000 =   22 / 22
Totals   11    7   72    2    1  134   12    6 0.7347 = 180 / 245

13.3 Hit ratio table for the training and testing samples

Of course the hit-ratios, which are just 1 minus the error rates, are quite poor.

h2o.hit_ratio_table(fs3Cov.class_glm, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.372308
2 2  0.613846
3 3  0.778462
4 4  0.880000
5 5  0.947692
6 6  0.984615
7 7  0.998462
8 8  1.000000

$valid
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.265306
2 2  0.453061
3 3  0.624490
4 4  0.783673
5 5  0.881633
6 6  0.955102
7 7  0.979592
8 8  1.000000

13.4 Diagnostic statistics for the GLM model

The table below shows that the diagnostic statistics for the random forest model and logistic regression model. Neither of those results are not at all impressive.

covH[2, 1] <- "GLM"
covH[2, 2] <- fs3Cov.class_glm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[2, 3] <- fs3Cov.class_glm@model$validation_metrics@metrics$MSE   #  
covH[2, 4] <- fs3Cov.class_glm@model$validation_metrics@metrics$RMSE       #  
covH[2, 5] <- fs3Cov.class_glm@model$validation_metrics@metrics$ logloss
covH[2, 6] <- fs3Cov.class_glm@model$validation_metrics@metrics$ mean_per_class_error
covH
  Prediction_model hit_ratio       MSE      RMSE  logloss mean_per_class_error
1    Random_forest 0.2775510 0.6634014 0.8144946 3.606853            0.8715399
2              GLM 0.2653061 0.6644477 0.8151366 2.407140            0.8545338

14 Deep Learning

The basic deep learning model has a very simple structure that can be expanded.

fs3Cov.class_dl<- h2o.deeplearning(
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=17:24,                        ## the predictor columns, by column index
  y=2
)

14.1 Confusion matrix for the training sample

The training sample error rate is a little better than for the earlier models. However, this does not really mean anything.

fs3Cov.class_dl@model$training_metrics  
H2OMultinomialMetrics: deeplearning
** Reported on training data. **
** Metrics reported on full training frame **

Training Set Metrics: 
=====================

Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.4472733
RMSE: (Extract with `h2o.rmse`) 0.6687849
Logloss: (Extract with `h2o.logloss`) 1.254786
Mean Per-Class Error: 0.5981942
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1     31    0    4    4    0   28    7    4 0.6026 =   47 / 78
Seg2      1    0    2    1    0    5    0    3 1.0000 =   12 / 12
Seg3      7    0   58    6    0   54   23    6 0.6234 =  96 / 154
Seg4      1    0    2   24    0   17    4    4 0.5385 =   28 / 52
Seg5      0    0    0    0    0    6    3    0 1.0000 =     9 / 9
Seg6      5    0    6    6    0  172   12    4 0.1610 =  33 / 205
Seg7      0    0    1    2    0   19   48    1 0.3239 =   23 / 71
Seg8      1    0    5    4    0   24    3   32 0.5362 =   37 / 69
Totals   46    0   78   47    0  325  100   54 0.4385 = 285 / 650

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.561538
2 2  0.749231
3 3  0.869231
4 4  0.938462
5 5  0.963077
6 6  0.976923
7 7  1.000000
8 8  1.000000

14.2 Confusion matrix for the holdout sample

Unfortunately, the validation sample error rate is worse than that produced by the random forest and GLM models.

h2o.confusionMatrix(fs3Cov.class_dl,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      1    0    4    1    0   18    6    1 0.9677 =   30 / 31
Seg2      1    0    0    0    0    3    2    0 1.0000 =     6 / 6
Seg3      1    0    7    3    0   33    8    4 0.8750 =   49 / 56
Seg4      1    0    5    0    0    7    2    1 1.0000 =   16 / 16
Seg5      0    0    0    1    0    2    0    0 1.0000 =     3 / 3
Seg6      3    0    8    6    0   44   12    5 0.4359 =   34 / 78
Seg7      4    0    1    0    0   24    2    2 0.9394 =   31 / 33
Seg8      0    0    3    0    0   15    2    2 0.9091 =   20 / 22
Totals   11    0   28   11    0  146   34   15 0.7714 = 189 / 245

14.3 Hit ratio table for the training and testing samples

Again, the hit ratios are very poor.

h2o.hit_ratio_table(fs3Cov.class_dl, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.561538
2 2  0.749231
3 3  0.869231
4 4  0.938462
5 5  0.963077
6 6  0.976923
7 7  1.000000
8 8  1.000000

$valid
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.228571
2 2  0.424490
3 3  0.542857
4 4  0.718367
5 5  0.840816
6 6  0.959184
7 7  0.983673
8 8  1.000000

14.4 Diagnostic statistics

The diagnostics statistics table indicates that the GLM model is the best performer in a very dismal race.

covH[3, 1] <- "Deep_Learning"
covH[3, 2] <- fs3Cov.class_dl@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[3, 3] <- fs3Cov.class_dl@model$validation_metrics@metrics$MSE   #  
covH[3, 4] <- fs3Cov.class_dl@model$validation_metrics@metrics$RMSE       #  
covH[3, 5] <- fs3Cov.class_dl@model$validation_metrics@metrics$ logloss
covH[3, 6] <- fs3Cov.class_dl@model$validation_metrics@metrics$ mean_per_class_error
covH
  Prediction_model hit_ratio       MSE      RMSE  logloss mean_per_class_error
1    Random_forest 0.2775510 0.6634014 0.8144946 3.606853            0.8715399
2              GLM 0.2653061 0.6644477 0.8151366 2.407140            0.8545338
3    Deep_Learning 0.2285714 0.6793221 0.8242100 2.218547            0.8908905

15 Gradient Boosting Model

The basic gradient boosting model, using neural networks, is specified below.

fs3Cov.class_gbm<- h2o.gbm(
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=17:24,                        ## the predictor columns, by column index
  y=2
)

15.1 Confusion matrix for the training sample

The incredibly low error rate shown below for the training sample is quite suspicious compared to the earlier models.

fs3Cov.class_gbm@model$training_metrics  
H2OMultinomialMetrics: gbm
** Reported on training data. **

Training Set Metrics: 
=====================

Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.102122
RMSE: (Extract with `h2o.rmse`) 0.3195654
Logloss: (Extract with `h2o.logloss`) 0.3592171
Mean Per-Class Error: 0.02954064
R^2: (Extract with `h2o.r2`) 0.9785701
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error       Rate
Seg1     73    1    0    1    0    2    1    0 0.0641 =   5 / 78
Seg2      0   12    0    0    0    0    0    0 0.0000 =   0 / 12
Seg3      1    0  146    0    0    6    0    1 0.0519 =  8 / 154
Seg4      0    0    2   49    0    1    0    0 0.0577 =   3 / 52
Seg5      0    0    0    0    9    0    0    0 0.0000 =    0 / 9
Seg6      0    0    2    1    0  201    0    1 0.0195 =  4 / 205
Seg7      0    0    0    0    0    1   70    0 0.0141 =   1 / 71
Seg8      1    0    1    0    0    0    0   67 0.0290 =   2 / 69
Totals   75   13  151   51    9  211   71   69 0.0354 = 23 / 650

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.964615
2 2  0.996923
3 3  1.000000
4 4  1.000000
5 5  1.000000
6 6  1.000000
7 7  1.000000
8 8  1.000000

15.2 Confusion matrix for the holdout sample

However, the holdout sample error rate is about as bad as seen earlier.

h2o.confusionMatrix(fs3Cov.class_gbm,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      0    0   12    0    0   18    1    0 1.0000 =   31 / 31
Seg2      0    0    3    1    0    2    0    0 1.0000 =     6 / 6
Seg3      3    0   21    0    0   24    2    6 0.6250 =   35 / 56
Seg4      1    0    7    0    0    7    1    0 1.0000 =   16 / 16
Seg5      0    0    1    0    0    0    2    0 1.0000 =     3 / 3
Seg6      9    1   19    4    1   31   10    3 0.6026 =   47 / 78
Seg7      4    1    7    1    0   18    1    1 0.9697 =   32 / 33
Seg8      0    0    8    1    0    9    2    2 0.9091 =   20 / 22
Totals   17    2   78    7    1  109   19   12 0.7755 = 190 / 245

15.3 Hit ratio table for the training and testing samples

h2o.hit_ratio_table(fs3Cov.class_gbm, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.964615
2 2  0.996923
3 3  1.000000
4 4  1.000000
5 5  1.000000
6 6  1.000000
7 7  1.000000
8 8  1.000000

$valid
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.224490
2 2  0.444898
3 3  0.587755
4 4  0.746939
5 5  0.857143
6 6  0.959184
7 7  0.987755
8 8  1.000000

15.4 Diagnostic statistics

The GLM model is maintaining its narrow edge over the other three models.

covH[4, 1] <- "GBM_Boosting"
covH[4, 2] <- fs3Cov.class_gbm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[4, 3] <- fs3Cov.class_gbm@model$validation_metrics@metrics$MSE   #  
covH[4, 4] <- fs3Cov.class_gbm@model$validation_metrics@metrics$RMSE       #  
covH[4, 5] <- fs3Cov.class_gbm@model$validation_metrics@metrics$ logloss
covH[4, 6] <- fs3Cov.class_gbm@model$validation_metrics@metrics$ mean_per_class_error
covH
  Prediction_model hit_ratio       MSE      RMSE  logloss mean_per_class_error
1    Random_forest 0.2775510 0.6634014 0.8144946 3.606853            0.8715399
2              GLM 0.2653061 0.6644477 0.8151366 2.407140            0.8545338
3    Deep_Learning 0.2285714 0.6793221 0.8242100 2.218547            0.8908905
4     GBM_Boosting 0.2244898 0.6720699 0.8197987 2.361972            0.8882940

16 Naive Bayes Model

fs3Cov.class_nB <- h2o.naiveBayes(
  training_frame = fs.split[[1]],        ## the H2O frame for training
  validation_frame = fs.split[[2]],      ## the H2O frame for validation (not required)
  x=17:24,                        ## the predictor columns, by column index
  y=2,
  laplace = 3)

16.1 Confusion matrix for the training sample

The naive Bayes model error rate for the training sample is quite similar to the results of the other four models.

fs3Cov.class_nB@model$training_metrics  
H2OMultinomialMetrics: naivebayes
** Reported on training data. **

Training Set Metrics: 
=====================

Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.5938202
RMSE: (Extract with `h2o.rmse`) 0.7705973
Logloss: (Extract with `h2o.logloss`) 1.643498
Mean Per-Class Error: 0.8257825
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      4    0   27    0    0   44    2    1 0.9487 =   74 / 78
Seg2      0    0    1    0    0   10    0    1 1.0000 =   12 / 12
Seg3      4    0   66    1    0   79    3    1 0.5714 =  88 / 154
Seg4      2    0    8    1    0   36    3    2 0.9808 =   51 / 52
Seg5      0    0    0    0    0    9    0    0 1.0000 =     9 / 9
Seg6      3    0   39    0    0  157    3    3 0.2341 =  48 / 205
Seg7      1    0   18    0    0   48    4    0 0.9437 =   67 / 71
Seg8      0    0   18    1    0   45    0    5 0.9275 =   64 / 69
Totals   14    0  177    3    0  428   15   13 0.6354 = 413 / 650

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.364615
2 2  0.596923
3 3  0.763077
4 4  0.853846
5 5  0.933846
6 6  0.973846
7 7  0.998462
8 8  1.000000

16.2 Confusion matrix for the holdout sample

The holdout error rate is also about the same as is those rates of the other models.

h2o.confusionMatrix(fs3Cov.class_nB,  fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
       Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8  Error        Rate
Seg1      1    0   11    0    0   19    0    0 0.9677 =   30 / 31
Seg2      0    0    3    0    0    3    0    0 1.0000 =     6 / 6
Seg3      1    0   14    0    0   39    2    0 0.7500 =   42 / 56
Seg4      0    0    5    0    0   11    0    0 1.0000 =   16 / 16
Seg5      0    0    1    0    0    2    0    0 1.0000 =     3 / 3
Seg6      1    0   23    0    0   51    3    0 0.3462 =   27 / 78
Seg7      1    0    9    0    0   23    0    0 1.0000 =   33 / 33
Seg8      0    0    4    0    0   17    1    0 1.0000 =   22 / 22
Totals    4    0   70    0    0  165    6    0 0.7306 = 179 / 245

16.3 Hit ratio table for the training and testing samples

h2o.hit_ratio_table(fs3Cov.class_nB, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.364615
2 2  0.596923
3 3  0.763077
4 4  0.853846
5 5  0.933846
6 6  0.973846
7 7  0.998462
8 8  1.000000

$valid
Top-8 Hit Ratios: 
  k hit_ratio
1 1  0.269388
2 2  0.518367
3 3  0.657143
4 4  0.779592
5 5  0.877551
6 6  0.959184
7 7  0.983673
8 8  1.000000

17 Comparing the five predictive models’ diagnostic statistics

covH[5, 1] <- "Naive_Bayes"
covH[5, 2] <- fs3Cov.class_nB@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #  
covH[5, 3] <- fs3Cov.class_nB@model$validation_metrics@metrics$MSE   #  
covH[5, 4] <- fs3Cov.class_nB@model$validation_metrics@metrics$RMSE       #  
covH[5, 5] <- fs3Cov.class_nB@model$validation_metrics@metrics$ logloss
covH[5, 6] <- fs3Cov.class_nB@model$validation_metrics@metrics$ mean_per_class_error

The naive Bayes model maintained its narrow edge over the other models. However, the hit ratio is nothing to be proud of. The naive Bayes model seems to be superior on several of the error statistics.

pacman::p_load(sjPlot)
tab_df(covH, sort.column = -2, show.rownames = FALSE, digits = 3,
       title = "Statistics of five Level 2 models of the 8-segment solution")
Statistics of five Level 2 models of the 8-segment solution
Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
Random_forest 0.278 0.663 0.814 3.607 0.872
Naive_Bayes 0.269 0.653 0.808 1.891 0.883
GLM 0.265 0.664 0.815 2.407 0.855
Deep_Learning 0.229 0.679 0.824 2.219 0.891
GBM_Boosting 0.224 0.672 0.820 2.362 0.888

18 Graphically comparing the hit ratios for the 5 models

As seen in the following graph, Naive bayes were the top model in predicting the segments. Altough the general resutls is not comparable to the numbers we had in level-1 segmentation!

pacman::p_load(reshape2)
# covH[,1:2] 
hits_long<- melt(covH[,1:2] ) # need to reshape to 'long' form
# head(hits_long )
pacman::p_load(ggplot2)
ggplot(data=hits_long, aes(x= reorder(Prediction_model, -value), y=value ) ) +
  geom_bar(  stat="identity" , fill = "pink", width = 0.3) +
  geom_point(aes( color= Prediction_model ), size=3) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1, size=10)) +
  labs( title= "Hit ratios for predictive models using covariates" , 
        y= "Proportion correct", x= "Prediction Model") +
  ylim(0, 1)

19 Obtaining the segment assignments using the naive Bayes model

Predict each respondents’ segment based on fs3Cov.class_nB model.

cov.assign.hex = h2o.predict(fs3Cov.class_nB, df8C.class[,17:24]) 
cov.assign.hex<- as.factor(cov.assign.hex) # cluster 'names' must be factors for modeling
h2o.print(cov.assign.hex$predict, n = 10L)
h2o.table(cov.assign.hex$predict) # table of assignments over all respondents

20 combine L1 segments and L2 segments

The segments developed from the 8-segment k-means solution and those predicted based on the demographic variables are combined into a single dataframe below and then saved.

fs6.L1.L2.segs <- h2o.cbind(df8C.class[,c(1,2)],  cov.assign.hex$predict, df8C.class[,3:38])
colnames(fs6.L1.L2.segs)[ c(2,3)] <- c("L1_Segments", "L2_Segments")
fs6.L1.L2.segs[1:6, 1:5]
  RID L1_Segments L2_Segments Easy_Reservation Preferred_Seats
1   1        Seg2        Seg6                1               1
2   2        Seg5        Seg6                1               1
3   3        Seg2        Seg6                1               1
4   4        Seg2        Seg6                1               1
5   5        Seg5        Seg6                1               1
6   6        Seg5        Seg6                1               1

[6 rows x 5 columns] 

This table shows the coincidence and divergence of prediction using the two models.

print( h2o.table(fs6.L1.L2.segs$L1_Segments,  fs6.L1.L2.segs$L2_Segments), n=64L)
   L1_Segments L2_Segments Counts
1         Seg1        Seg1      6
2         Seg1        Seg3     70
3         Seg1        Seg6    141
4         Seg1        Seg7      2
5         Seg1        Seg8      3
6         Seg2        Seg3      9
7         Seg2        Seg6     19
8         Seg2        Seg8      1
9         Seg3        Seg1      6
10        Seg3        Seg3    133
11        Seg3        Seg4      2
12        Seg3        Seg6    251
13        Seg3        Seg7      8
14        Seg3        Seg8      3
15        Seg4        Seg1      3
16        Seg4        Seg3     23
17        Seg4        Seg4      1
18        Seg4        Seg6     92
19        Seg4        Seg7      3
20        Seg4        Seg8      3
21        Seg5        Seg3      3
22        Seg5        Seg6     17
23        Seg5        Seg7      1
24        Seg6        Seg1      8
25        Seg6        Seg3    136
26        Seg6        Seg4      1
27        Seg6        Seg6    413
28        Seg6        Seg7      9
29        Seg6        Seg8      3
30        Seg7        Seg1      3
31        Seg7        Seg3     50
32        Seg7        Seg6    147
33        Seg7        Seg7      5
34        Seg7        Seg8      1
35        Seg8        Seg1      2
36        Seg8        Seg3     49
37        Seg8        Seg4      3
38        Seg8        Seg6    131
39        Seg8        Seg7      2
40        Seg8        Seg8      5

[40 rows x 3 columns] 

21 Visualizing & Summarizing customers in segments

21.1 Table of attribute attitudes and segments

The table below lists the mean ratings for each attitudinal variable for those in each of the 8 Level 1 segments.

Use the information below and the earlier information to determine if this 8-segment solution has good interpretability. If not, go back to one of the other segmentation solutions.

library("dplyr")
library("kableExtra")
library(psych)
library(data.table)

df_final_L1_L2 <- as.data.frame(fs6.L1.L2.segs)

df_final_L1_L2 %>% 
  mutate(Group = as.factor(L1_Segments)) %>%
  group_by(Group) %>%
  summarize(Avg_v075 = round(mean(Easy_Reservation),2), 
            Avg_v076 = round(mean(Preferred_Seats),2), 
            Avg_v077 = round(mean(Flight_Options),2),
            Avg_v078 = round(mean(Ticket_Prices),2),
            Avg_v079 = round(mean(Seat_Comfort),2),
            Avg_v080 = round(mean(Seat_Roominess),2),
            Avg_v081 = round(mean(Overhead_Storage),2),
            Avg_v082 = round(mean(Clean_Aircraft),2),
            Avg_v083 = round(mean(Courtesy),2),
            Avg_v084 = round(mean(Friendliness),2),
            Avg_v085 = round(mean(Helpfulness),2),
            Avg_v086 = round(mean(Service),2),
            Avg_v087 = round(mean(Satisfaction),2),
            Avg_v088 = round(mean(Recommend), 2),
            Count_of_Members = n()
  ) %>%
  arrange(Group) %>% 
  transpose() -> cd

colnames(cd) <-  cd[1,] 
cd <- cd[-1,]
cd$order <- 1:nrow(cd)
rownames(cd)[1:14] <- colnames(df_final_L1_L2[,4:17])
rownames(cd)[15] <- c("Segment_Size")
# cd$variable <- rownames(cd)
cd$variable <- c(names(df_final_L1_L2)[4:17],
                 "Segment Size")
cd <- cd[, c(10, 1:9)]
cd[,2:9] <- lapply(cd[,2:9], function(x) as.numeric(as.character(x)))

cd[1:14, 1:9] %>% 
  arrange(variable ) %>% 
  mutate_if(is.numeric, function(x) {
    cell_spec(x, bold = T, 
              color = spec_color(x, end = 0.9),
              font_size = spec_font_size(x))
  }) %>%
  kable(escape = F, align = "c") %>%
  kable_styling(c("striped", "condensed"), full_width = T, position = "left")
variable Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
Clean_Aircraft 6.92 1.52 8.52 6.61 2.95 8.41 7.75 7.59
Courtesy 8.56 2.03 8.6 7.41 3.43 8.68 8.63 7.72
Easy_Reservation 8.47 1.86 8.26 7.41 3.71 8.79 7.37 8.6
Flight_Options 7.53 1.38 6.09 5.69 1.86 8.28 5.53 7.8
Friendliness 8.57 2.28 8.56 7.09 3.48 8.71 8.57 7.14
Helpfulness 8.5 2.14 8.47 6.47 4.14 8.75 8.38 7.05
Overhead_Storage 5.7 1.45 8.01 5.26 2.14 8.17 6.86 6.83
Preferred_Seats 7.45 1.34 7.09 6.18 1.62 8.25 5.31 7.44
Recommend 6.42 1.38 7.52 5.33 1.86 8.28 5.53 7.51
Satisfaction 7.84 1.59 8.24 6.14 2.76 8.6 7.82 7.24
Seat_Comfort 6.71 1.86 8.48 6.11 3.19 8.64 7.91 7.94
Seat_Roominess 5.96 1.48 8.35 6.45 2.71 8.34 7.35 7.42
Service 8.43 1.9 8.6 6.22 4.1 8.74 8.54 6.7
Ticket_Prices 7.14 1.17 6.39 5.05 1.52 8.02 5.11 7.09

21.2 Correspondence analysis of attributes and segments

There is not much that could be imputed from this plot. Some known infomation could be related to it, for example, the CA shows that Segments 2 and 5 that had the least score in attitudinal variables are close in the plot as well as segments 6 and 8 which had the best scores. Other than that, it shows the closeness of service-related variables (e.g. Helpfulness, Friendliness, Courtesy) but it does not give us much personified information about the segments.

library(FactoMineR)
cd.m <- cd[1:14, 2:9]
c <- CA(cd.m, graph=FALSE)
plot(c, title="Correspondence Analysis of Attributes and Segments", col.main="blue" )

22 Interpreting the segments using demographic data

The several tables that follow provide the row and column percents for each level of each demographic for the 8 Level 1 segments. While prediction using the demographic variables has been shown to be very poor, there may be some insights that could help to better describe the segments.

The Chi-square statistics beneath each table provide some indication of whether the table shows any significant relationship between the variables and the segments.

22.1 Segments by Language

The majority of customers are english speaking. And the second most is Spanish. The two most populated segments in each language are segments 3 and 6, that also were the most populated segments in total. Not much insight here!

library(sjPlot)
sjt.xtab(df_final_L1_L2$Language, df_final_L1_L2$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Language L1_Segments Total
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
English 117
11.7 %
57.4 %
16
1.6 %
57.1 %
227
22.8 %
60.4 %
76
7.6 %
66.1 %
14
1.4 %
73.7 %
315
31.6 %
61.6 %
122
12.2 %
67 %
110
11 %
63.2 %
997
100 %
62 %
French 27
14.6 %
13.2 %
6
3.2 %
21.4 %
41
22.2 %
10.9 %
11
5.9 %
9.6 %
2
1.1 %
10.5 %
69
37.3 %
13.5 %
17
9.2 %
9.3 %
12
6.5 %
6.9 %
185
100 %
11.5 %
German 1
11.1 %
0.5 %
0
0 %
0 %
1
11.1 %
0.3 %
2
22.2 %
1.7 %
1
11.1 %
5.3 %
0
0 %
0 %
2
22.2 %
1.1 %
2
22.2 %
1.1 %
9
100 %
0.6 %
Italian 14
18.7 %
6.9 %
0
0 %
0 %
20
26.7 %
5.3 %
4
5.3 %
3.5 %
0
0 %
0 %
25
33.3 %
4.9 %
4
5.3 %
2.2 %
8
10.7 %
4.6 %
75
100 %
4.7 %
Spanish 45
13.1 %
22.1 %
6
1.7 %
21.4 %
87
25.4 %
23.1 %
22
6.4 %
19.1 %
2
0.6 %
10.5 %
102
29.7 %
20 %
37
10.8 %
20.3 %
42
12.2 %
24.1 %
343
100 %
21.3 %
Total 204
12.7 %
100 %
28
1.7 %
100 %
376
23.4 %
100 %
115
7.1 %
100 %
19
1.2 %
100 %
511
31.8 %
100 %
182
11.3 %
100 %
174
10.8 %
100 %
1609
100 %
100 %
χ2=38.233 · df=28 · Cramer’s V=0.077 · Fisher’s p=0.119

22.2 Segments by Smoking habits

Aboth 80 percent of customers are non-smokers.

sjt.xtab(df_final_L1_L2$Smoker, df_final_L1_L2$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Smoker L1_Segments Total
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
No 165
12.7 %
83.8 %
24
1.8 %
92.3 %
293
22.5 %
79.4 %
94
7.2 %
83.2 %
15
1.2 %
71.4 %
422
32.4 %
81.3 %
142
10.9 %
75.9 %
148
11.4 %
80.9 %
1303
100 %
80.7 %
Yes 32
10.3 %
16.2 %
2
0.6 %
7.7 %
76
24.4 %
20.6 %
19
6.1 %
16.8 %
6
1.9 %
28.6 %
97
31.1 %
18.7 %
45
14.4 %
24.1 %
35
11.2 %
19.1 %
312
100 %
19.3 %
Total 197
12.2 %
100 %
26
1.6 %
100 %
369
22.8 %
100 %
113
7 %
100 %
21
1.3 %
100 %
519
32.1 %
100 %
187
11.6 %
100 %
183
11.3 %
100 %
1615
100 %
100 %
χ2=8.282 · df=7 · Cramer’s V=0.072 · Fisher’s p=0.304

22.3 Segments by Employment

One of the more important demographic variable according to the Chi-Square tests. Most segemnts are nearly equally distributed between the three categories of employment type other than , again, segments 2 & 5 which have a small sample size.

sjt.xtab(df_final_L1_L2$Employment, df_final_L1_L2$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Employment L1_Segments Total
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
Employed part-time 69
12.8 %
33.7 %
7
1.3 %
25 %
118
21.9 %
31.8 %
37
6.9 %
32.5 %
10
1.9 %
52.6 %
178
33.1 %
33.8 %
69
12.8 %
35.6 %
50
9.3 %
29.6 %
538
100 %
33.1 %
Fully employed 73
13 %
35.6 %
12
2.1 %
42.9 %
123
21.9 %
33.2 %
37
6.6 %
32.5 %
3
0.5 %
15.8 %
188
33.5 %
35.7 %
65
11.6 %
33.5 %
61
10.9 %
36.1 %
562
100 %
34.5 %
Not employed 63
12 %
30.7 %
9
1.7 %
32.1 %
130
24.7 %
35 %
40
7.6 %
35.1 %
6
1.1 %
31.6 %
161
30.6 %
30.6 %
60
11.4 %
30.9 %
58
11 %
34.3 %
527
100 %
32.4 %
Total 205
12.6 %
100 %
28
1.7 %
100 %
371
22.8 %
100 %
114
7 %
100 %
19
1.2 %
100 %
527
32.4 %
100 %
194
11.9 %
100 %
169
10.4 %
100 %
1627
100 %
100 %
χ2=9.428 · df=14 · Cramer’s V=0.054 · p=0.803

22.4 Segments by Education

sjt.xtab(df_final_L1_L2$Education, df_final_L1_L2$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Education L1_Segments Total
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
Bachelor’s degree 35
13.9 %
17.3 %
7
2.8 %
28 %
65
25.8 %
17.5 %
14
5.6 %
11.8 %
1
0.4 %
5.3 %
71
28.2 %
13.4 %
31
12.3 %
16.7 %
28
11.1 %
16.4 %
252
100 %
15.5 %
College diploma 38
12.2 %
18.8 %
3
1 %
12 %
67
21.5 %
18.1 %
32
10.3 %
26.9 %
3
1 %
15.8 %
103
33.1 %
19.5 %
35
11.3 %
18.8 %
30
9.6 %
17.5 %
311
100 %
19.2 %
Earned doctorate 0
0 %
0 %
0
0 %
0 %
2
25 %
0.5 %
2
25 %
1.7 %
0
0 %
0 %
2
25 %
0.4 %
1
12.5 %
0.5 %
1
12.5 %
0.6 %
8
100 %
0.5 %
High school diploma
or certificate
50
11.6 %
24.8 %
5
1.2 %
20 %
105
24.4 %
28.3 %
24
5.6 %
20.2 %
6
1.4 %
31.6 %
141
32.8 %
26.7 %
56
13 %
30.1 %
43
10 %
25.1 %
430
100 %
26.5 %
Master’s degree 10
12.7 %
5 %
0
0 %
0 %
19
24.1 %
5.1 %
8
10.1 %
6.7 %
1
1.3 %
5.3 %
22
27.8 %
4.2 %
9
11.4 %
4.8 %
10
12.7 %
5.8 %
79
100 %
4.9 %
Medical degree 1
7.1 %
0.5 %
1
7.1 %
4 %
4
28.6 %
1.1 %
1
7.1 %
0.8 %
0
0 %
0 %
4
28.6 %
0.8 %
2
14.3 %
1.1 %
1
7.1 %
0.6 %
14
100 %
0.9 %
No certificate,
diploma or degree
46
14.3 %
22.8 %
4
1.2 %
16 %
65
20.2 %
17.5 %
26
8.1 %
21.8 %
3
0.9 %
15.8 %
117
36.3 %
22.2 %
32
9.9 %
17.2 %
29
9 %
17 %
322
100 %
19.9 %
trades certificate 13
9.5 %
6.4 %
3
2.2 %
12 %
31
22.6 %
8.4 %
8
5.8 %
6.7 %
1
0.7 %
5.3 %
48
35 %
9.1 %
11
8 %
5.9 %
22
16.1 %
12.9 %
137
100 %
8.5 %
University
certificate above
bachelor’s degree
3
13.6 %
1.5 %
2
9.1 %
8 %
3
13.6 %
0.8 %
1
4.5 %
0.8 %
2
9.1 %
10.5 %
7
31.8 %
1.3 %
3
13.6 %
1.6 %
1
4.5 %
0.6 %
22
100 %
1.4 %
University
certificate below
bachelor level
6
13 %
3 %
0
0 %
0 %
10
21.7 %
2.7 %
3
6.5 %
2.5 %
2
4.3 %
10.5 %
13
28.3 %
2.5 %
6
13 %
3.2 %
6
13 %
3.5 %
46
100 %
2.8 %
Total 202
12.5 %
100 %
25
1.5 %
100 %
371
22.9 %
100 %
119
7.3 %
100 %
19
1.2 %
100 %
528
32.6 %
100 %
186
11.5 %
100 %
171
10.5 %
100 %
1621
100 %
100 %
χ2=68.773 · df=63 · Cramer’s V=0.078 · Fisher’s p=0.482

22.5 Segments by Marital

sjt.xtab(df_final_L1_L2$Marital, df_final_L1_L2$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Marital L1_Segments Total
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
Divorced, not living
with a life partner
12
11.7 %
5.9 %
1
1 %
3.6 %
24
23.3 %
6.4 %
7
6.8 %
5.8 %
0
0 %
0 %
33
32 %
6.3 %
10
9.7 %
5.3 %
16
15.5 %
9.1 %
103
100 %
6.3 %
Living common law 51
11.6 %
25.2 %
9
2.1 %
32.1 %
81
18.5 %
21.7 %
42
9.6 %
35 %
8
1.8 %
44.4 %
154
35.2 %
29.4 %
54
12.3 %
28.9 %
39
8.9 %
22.3 %
438
100 %
26.9 %
Married 100
13.4 %
49.5 %
13
1.7 %
46.4 %
192
25.8 %
51.3 %
44
5.9 %
36.7 %
9
1.2 %
50 %
218
29.3 %
41.7 %
88
11.8 %
47.1 %
81
10.9 %
46.3 %
745
100 %
45.8 %
Separated, not
living with a life
partner
12
11.4 %
5.9 %
1
1 %
3.6 %
24
22.9 %
6.4 %
8
7.6 %
6.7 %
0
0 %
0 %
37
35.2 %
7.1 %
13
12.4 %
7 %
10
9.5 %
5.7 %
105
100 %
6.5 %
Single, never
married, not living
with a life partner
23
12 %
11.4 %
4
2.1 %
14.3 %
43
22.4 %
11.5 %
16
8.3 %
13.3 %
1
0.5 %
5.6 %
64
33.3 %
12.2 %
17
8.9 %
9.1 %
24
12.5 %
13.7 %
192
100 %
11.8 %
Widowed 4
9.1 %
2 %
0
0 %
0 %
10
22.7 %
2.7 %
3
6.8 %
2.5 %
0
0 %
0 %
17
38.6 %
3.3 %
5
11.4 %
2.7 %
5
11.4 %
2.9 %
44
100 %
2.7 %
Total 202
12.4 %
100 %
28
1.7 %
100 %
374
23 %
100 %
120
7.4 %
100 %
18
1.1 %
100 %
523
32.1 %
100 %
187
11.5 %
100 %
175
10.8 %
100 %
1627
100 %
100 %
χ2=30.855 · df=35 · Cramer’s V=0.062 · Fisher’s p=0.774

22.6 Segments by Gender

The ratio is almost always 1:1 for men and female but in segment 5 we have a huge difference 1:2 in the ratio. Although the sample size for this segment is quite low (21).

sjt.xtab(df_final_L1_L2$Sex, df_final_L1_L2$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Sex L1_Segments Total
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
Female 106
12.8 %
51 %
15
1.8 %
55.6 %
193
23.3 %
52.2 %
57
6.9 %
49.6 %
12
1.4 %
66.7 %
257
31 %
49.8 %
95
11.5 %
49.2 %
93
11.2 %
54.7 %
828
100 %
51.2 %
Male 102
12.9 %
49 %
12
1.5 %
44.4 %
177
22.4 %
47.8 %
58
7.4 %
50.4 %
6
0.8 %
33.3 %
259
32.8 %
50.2 %
98
12.4 %
50.8 %
77
9.8 %
45.3 %
789
100 %
48.8 %
Total 208
12.9 %
100 %
27
1.7 %
100 %
370
22.9 %
100 %
115
7.1 %
100 %
18
1.1 %
100 %
516
31.9 %
100 %
193
11.9 %
100 %
170
10.5 %
100 %
1617
100 %
100 %
χ2=3.733 · df=7 · Cramer’s V=0.048 · p=0.810

22.7 Segments by Income

Income is normally distributed in almost each segment and the mean is around $60k.

levels(df_final_L1_L2$Income) <- c("Less than $20,000",
                                   "$20000 to $29999",
                                   "$30000 to $39999","$40000 to $49999",
                                   "$50000 to $59999",
                                   "$60000 to $69999",
                                   "$70000 to $79999",
                                   "$80000 to $89999",
                                   "$90000 to $99999",
                                   "$100000 to $149999",
                                   "$150,000 to $174,999",
                                   "$175,000 to $199,999",
                                   "$200,000 or more")

sjt.xtab(df_final_L1_L2$Income, df_final_L1_L2$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Income L1_Segments Total
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
Less than $20,000 6
11.3 %
3 %
0
0 %
0 %
17
32.1 %
4.6 %
7
13.2 %
6 %
1
1.9 %
4.8 %
15
28.3 %
2.9 %
4
7.5 %
2.1 %
3
5.7 %
1.7 %
53
100 %
3.2 %
$20000 to $29999 3
6 %
1.5 %
1
2 %
3.7 %
7
14 %
1.9 %
3
6 %
2.6 %
1
2 %
4.8 %
21
42 %
4 %
7
14 %
3.6 %
7
14 %
3.9 %
50
100 %
3.1 %
$30000 to $39999 7
10.1 %
3.5 %
1
1.4 %
3.7 %
21
30.4 %
5.6 %
5
7.2 %
4.3 %
1
1.4 %
4.8 %
22
31.9 %
4.2 %
7
10.1 %
3.6 %
5
7.2 %
2.8 %
69
100 %
4.2 %
$40000 to $49999 5
11.6 %
2.5 %
0
0 %
0 %
12
27.9 %
3.2 %
0
0 %
0 %
2
4.7 %
9.5 %
15
34.9 %
2.9 %
6
14 %
3.1 %
3
7 %
1.7 %
43
100 %
2.6 %
$50000 to $59999 35
14.6 %
17.3 %
4
1.7 %
14.8 %
56
23.4 %
15 %
16
6.7 %
13.7 %
3
1.3 %
14.3 %
72
30.1 %
13.7 %
22
9.2 %
11.3 %
31
13 %
17.4 %
239
100 %
14.6 %
$60000 to $69999 27
9.9 %
13.4 %
5
1.8 %
18.5 %
56
20.4 %
15 %
16
5.8 %
13.7 %
6
2.2 %
28.6 %
93
33.9 %
17.7 %
38
13.9 %
19.5 %
33
12 %
18.5 %
274
100 %
16.7 %
$70000 to $79999 16
7.9 %
7.9 %
4
2 %
14.8 %
49
24.1 %
13.1 %
23
11.3 %
19.7 %
3
1.5 %
14.3 %
57
28.1 %
10.8 %
25
12.3 %
12.8 %
26
12.8 %
14.6 %
203
100 %
12.4 %
$80000 to $89999 21
11.5 %
10.4 %
1
0.5 %
3.7 %
35
19.2 %
9.4 %
11
6 %
9.4 %
1
0.5 %
4.8 %
66
36.3 %
12.5 %
26
14.3 %
13.3 %
21
11.5 %
11.8 %
182
100 %
11.1 %
$90000 to $99999 25
14.2 %
12.4 %
5
2.8 %
18.5 %
44
25 %
11.8 %
12
6.8 %
10.3 %
0
0 %
0 %
49
27.8 %
9.3 %
21
11.9 %
10.8 %
20
11.4 %
11.2 %
176
100 %
10.7 %
$100000 to $149999 17
15 %
8.4 %
2
1.8 %
7.4 %
21
18.6 %
5.6 %
10
8.8 %
8.5 %
1
0.9 %
4.8 %
40
35.4 %
7.6 %
17
15 %
8.7 %
5
4.4 %
2.8 %
113
100 %
6.9 %
$150,000 to $174,999 17
17.3 %
8.4 %
2
2 %
7.4 %
22
22.4 %
5.9 %
6
6.1 %
5.1 %
1
1 %
4.8 %
29
29.6 %
5.5 %
11
11.2 %
5.6 %
10
10.2 %
5.6 %
98
100 %
6 %
$175,000 to $199,999 10
15.6 %
5 %
1
1.6 %
3.7 %
12
18.8 %
3.2 %
5
7.8 %
4.3 %
0
0 %
0 %
24
37.5 %
4.6 %
4
6.2 %
2.1 %
8
12.5 %
4.5 %
64
100 %
3.9 %
$200,000 or more 13
17.3 %
6.4 %
1
1.3 %
3.7 %
21
28 %
5.6 %
3
4 %
2.6 %
1
1.3 %
4.8 %
23
30.7 %
4.4 %
7
9.3 %
3.6 %
6
8 %
3.4 %
75
100 %
4.6 %
Total 202
12.3 %
100 %
27
1.6 %
100 %
373
22.8 %
100 %
117
7.1 %
100 %
21
1.3 %
100 %
526
32.1 %
100 %
195
11.9 %
100 %
178
10.9 %
100 %
1639
100 %
100 %
χ2=76.727 · df=84 · Cramer’s V=0.082 · Fisher’s p=0.606

22.8 Segments by Age

sjt.xtab(df_final_L1_L2$Age, df_final_L1_L2$L1_Segments,
         show.row.prc = TRUE, show.col.prc = TRUE)
Age L1_Segments Total
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8
24 or younger 32
12.1 %
15.6 %
8
3 %
28.6 %
62
23.4 %
16.7 %
19
7.2 %
16.2 %
5
1.9 %
23.8 %
90
34 %
17.3 %
24
9.1 %
13 %
25
9.4 %
14.2 %
265
100 %
16.3 %
25 to 34 9
9.7 %
4.4 %
1
1.1 %
3.6 %
22
23.7 %
5.9 %
6
6.5 %
5.1 %
3
3.2 %
14.3 %
30
32.3 %
5.8 %
9
9.7 %
4.9 %
13
14 %
7.4 %
93
100 %
5.7 %
35 to 44 29
14.4 %
14.1 %
5
2.5 %
17.9 %
49
24.4 %
13.2 %
11
5.5 %
9.4 %
1
0.5 %
4.8 %
56
27.9 %
10.8 %
27
13.4 %
14.6 %
23
11.4 %
13.1 %
201
100 %
12.4 %
45 to 54 34
14.5 %
16.6 %
3
1.3 %
10.7 %
52
22.1 %
14 %
18
7.7 %
15.4 %
5
2.1 %
23.8 %
75
31.9 %
14.5 %
25
10.6 %
13.5 %
23
9.8 %
13.1 %
235
100 %
14.5 %
55 to 64 27
12.4 %
13.2 %
3
1.4 %
10.7 %
45
20.6 %
12.1 %
14
6.4 %
12 %
2
0.9 %
9.5 %
72
33 %
13.9 %
31
14.2 %
16.8 %
24
11 %
13.6 %
218
100 %
13.4 %
65 to 74 36
10.9 %
17.6 %
6
1.8 %
21.4 %
75
22.7 %
20.2 %
28
8.5 %
23.9 %
3
0.9 %
14.3 %
98
29.7 %
18.9 %
43
13 %
23.2 %
41
12.4 %
23.3 %
330
100 %
20.3 %
75 to 84 32
15.5 %
15.6 %
2
1 %
7.1 %
48
23.3 %
12.9 %
14
6.8 %
12 %
2
1 %
9.5 %
70
34 %
13.5 %
19
9.2 %
10.3 %
19
9.2 %
10.8 %
206
100 %
12.7 %
85 to 94 6
8.3 %
2.9 %
0
0 %
0 %
17
23.6 %
4.6 %
7
9.7 %
6 %
0
0 %
0 %
27
37.5 %
5.2 %
7
9.7 %
3.8 %
8
11.1 %
4.5 %
72
100 %
4.4 %
95 or older 0
0 %
0 %
0
0 %
0 %
1
50 %
0.3 %
0
0 %
0 %
0
0 %
0 %
1
50 %
0.2 %
0
0 %
0 %
0
0 %
0 %
2
100 %
0.1 %
Total 205
12.6 %
100 %
28
1.7 %
100 %
371
22.9 %
100 %
117
7.2 %
100 %
21
1.3 %
100 %
519
32 %
100 %
185
11.4 %
100 %
176
10.9 %
100 %
1622
100 %
100 %
χ2=34.961 · df=56 · Cramer’s V=0.055 · Fisher’s p=0.991

22.8.1 Age histograms

Here we plot a histogram of customer’s age in different segment.

ggplot(df_final_L1_L2, aes(x = df_final_L1_L2$Age))+
  geom_bar(aes(y=..count..,fill= df_final_L1_L2$L1_Segments)) +
  #geom_area(aes(y = ..density.. ,fill = df_final_L1_L2$L1_Segments),adjust = 1, alpha = 0.5)+ 
  facet_grid(df_final_L1_L2$L1_Segments ~.)

23 Personification of Segments

  • Segment 6: Have more than $60k income, and are either married or living with a common law partner. No particluar pattern in education.

  • Segment 5: Majority were part-time employed, more than any other segment. More female than male.

24 Marketing Strategy based on Segments

The demographic variables were not informing in personification of the segments. We will use the Level-1 attitude variables to define a marketing strategy.

It is important to identify what affected the “satisfaction” and “recommend” variables in each segment. For example in segment 7, the ratings for quality of service was near the top but medium perception on flight option variables and that caused the decline in cutomers’s perception of the airline. Meaning, this segment was more sensetive to flight options.

There were other senetivities among other segmetns. Segment 1 and 4 had a low rate on overhead_storage or segment 7 was more concerned about the ticket prices and flight options.

A simple marketing strategy would be to identify our best customers segments, that are segments 3,6 and 8. These were the people who seemed comfortable with the airline and most eager to return back. Also, 3 of the most populated segments. So this is all good news for us.

Other than that, we can try to address the so called ‘sensetivities’ of other segments, as mentioned before, and promote that in our advertisments for these groups.