knitr::include_graphics(path = "./airplane.jpg")
Passenger loyalty is fundamental to any airline aiming to maintain a stable market share and revenue stream (Chang and Hung, 2015), particularly in a turbulent market. The competitive landscape of the global airline industry has been in a constant change in recent years, with a rapid growth of low cost carriers and high-speed railways, rising fuel costs, fluctuating demand, and tighter security, safety and quality requirements. This is all but not considering a global pandemic like COVID-19 and its effects on airlines. To survive and grow, airlines managers need to identify factors of their services that satisfy and retain customers (Chen, 2008).
The key objective of this report is to find segments in the key airline drivers customers.
1- k-means on attitudinal variables (Level-1 Segmentation)
2- Predictive model for Level-1
3- Using the demographic variables to predict Level-1 segments (Level-2 segmentation)
4- Interpreting the demographic data in each segment (Level-2 segmentation)
5- Implementing a marketing strategy based on Level-1 and Level-2
df <- read.csv("Airline_Key_Drivers_mimv.csv")
headTail(df) %>% datatable(rownames = F, filter="top", options = list(pageLength = 10, scrollX=T), caption = "Airline data")
Converting data to h2o object
#Making the target variables to factors.
df[,c("FlyAgain", "FlyAgain2")] <- lapply(df[,c("FlyAgain", "FlyAgain2")], as.factor)
df.h2o <- as.h2o(df)
Doing k-means clustering using the 12 attitudinal variables.
nSeg <- 2
clus.2 <- h2o.kmeans(df.h2o[2:15], k = nSeg, estimate_k = F,
init=c("PlusPlus"), standardize= FALSE,
score_each_iteration= FALSE, seed = 7238,
keep_cross_validation_predictions=TRUE)
|
| | 0%
|
|======================================================================| 100%
h2o.cluster_sizes(clus.2)
[1] 1718 50
frq(df$FlyAgain2, out = 'v', title = "Frequency of Customer Return to Airline")
val | label | frq | raw.prc | valid.prc | cum.prc | |
---|---|---|---|---|---|---|
No |
|
706 | 39.93 | 39.93 | 39.93 | |
Yes |
|
1062 | 60.07 | 60.07 | 100 | |
NA | NA | 0 | 0 | NA | NA | |
total N=1768 · valid N=1768 · x̄=1.60 · σ=0.49 |
Saving the metrics of each clustering to select the best at the end. These metrics are sum squares (SS) of the distance between points in each cluster and among other clusters.
clustMetrics <- data.frame(Numbr_Segments=numeric(),
TotWithinSS=numeric(),
BetweenSS=numeric(),
TotSS=numeric(),
stringsAsFactors=FALSE)
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.2)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.2)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.2)
tab_df(clustMetrics)
Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
---|---|---|---|
2 | 47737.14 | 21589.53 | 69326.67 |
The following presents only cluster means, with centroid numbering
clus.2@model$centers
Cluster Means:
centroid easy_reservation preferred_seats flight_options ticket_prices
1 1 8.331781 7.281723 7.096042 6.856810
2 2 2.640000 1.460000 1.580000 1.320000
seat_comfort seat_roominess overhead_storage clean_aircraft courtesy
1 8.004657 7.674622 7.295111 7.943539 8.440047
2 2.420000 2.000000 1.740000 2.120000 2.620000
friendliness helpfulness service satisfaction recommend
1 8.345169 8.251455 8.232247 7.992433 7.231665
2 2.780000 2.980000 2.820000 2.080000 1.580000
plotting the attribute difference in between clusters.
library(reshape2) # it's necessary to reshape the data into 'long' format
df2c_long<- melt(clus.2@model$centers ) # need to reshape to 'long' form
#df2c_long
ggplot(data=df2c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (1-9)") +
ylim(1, 9) +
annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4)
total layering between segments is a bad thing so we move on to 3 segments!
h2o.cluster_sizes(clus.3)
[1] 612 50 1106
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.3)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.3)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.3)
tab_df(clustMetrics, out= 'v')
Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
---|---|---|---|
2 | 47737.14 | 21589.53 | 69326.67 |
3 | 38797.92 | 30528.75 | 69326.67 |
df3c_long<- melt(clus.3@model$centers ) # need to reshape to 'long' form
ggplot(data=df3c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (1-9)") +
ylim(1, 9) +
annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4)
Still layered. The second higher attitude segment is just cut in to two segments. We must add another segment to see.
h2o.cluster_sizes(clus.4)
[1] 433 50 340 945
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.4)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.4)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.4)
tab_df(clustMetrics, out= 'v')
Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
---|---|---|---|
2 | 47737.14 | 21589.53 | 69326.67 |
3 | 38797.92 | 30528.75 | 69326.67 |
4 | 35964.73 | 33361.94 | 69326.67 |
df4c_long<- melt(clus.4@model$centers ) # need to reshape to 'long' form
ggplot(data=df4c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (1-9)") +
ylim(1, 9) +
annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4)
There is a crossover in seat_comfort and recommend variables between segments 1 and 3, however, we still need to increase our segments to get more information.
h2o.cluster_sizes(clus.5)
[1] 359 22 924 29 434
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.5)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.5)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.5)
tab_df(clustMetrics, out= 'v')
Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
---|---|---|---|
2 | 47737.14 | 21589.53 | 69326.67 |
3 | 38797.92 | 30528.75 | 69326.67 |
4 | 35964.73 | 33361.94 | 69326.67 |
5 | 35712.45 | 33614.22 | 69326.67 |
df5c_long<- melt(clus.5@model$centers ) # need to reshape to 'long' form
ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14)) +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (1-9)") +
ylim(1, 9) +
annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4)
h2o.cluster_sizes(clus.6)
[1] 572 29 613 204 21 329
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.6)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.6)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.6)
tab_df(clustMetrics, out= 'v')
Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
---|---|---|---|
2 | 47737.14 | 21589.53 | 69326.67 |
3 | 38797.92 | 30528.75 | 69326.67 |
4 | 35964.73 | 33361.94 | 69326.67 |
5 | 35712.45 | 33614.22 | 69326.67 |
6 | 33648.88 | 35677.79 | 69326.67 |
df5c_long<- melt(clus.6@model$centers ) # need to reshape to 'long' form
ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14) , legend.position = "bottom") +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (1-9)") +
ylim(1, 9) +
annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4) +
scale_color_discrete(name = "clusters",
labels = paste("cluster", c(1:nSeg), " (", "size=", h2o.cluster_sizes(clus.6), ")" ))
h2o.cluster_sizes(clus.7)
[1] 461 31 172 224 207 654 19
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.7)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.7)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.7)
tab_df(clustMetrics, out= 'v')
Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
---|---|---|---|
2 | 47737.14 | 21589.53 | 69326.67 |
3 | 38797.92 | 30528.75 | 69326.67 |
4 | 35964.73 | 33361.94 | 69326.67 |
5 | 35712.45 | 33614.22 | 69326.67 |
6 | 33648.88 | 35677.79 | 69326.67 |
7 | 32044.81 | 37281.86 | 69326.67 |
df5c_long<- melt(clus.7@model$centers ) # need to reshape to 'long' form
ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14), legend.position = "bottom") +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (1-9)") +
ylim(1, 9) +
annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4) +
scale_color_discrete(name = "clusters",
labels = paste("cluster", c(1:nSeg), " (", "size=", h2o.cluster_sizes(clus.7), ")" ))
h2o.cluster_sizes(clus.8)
[1] 222 29 403 125 21 570 206 192
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.8)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.8)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.8)
tab_df(clustMetrics, out= 'v')
Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
---|---|---|---|
2 | 47737.14 | 21589.53 | 69326.67 |
3 | 38797.92 | 30528.75 | 69326.67 |
4 | 35964.73 | 33361.94 | 69326.67 |
5 | 35712.45 | 33614.22 | 69326.67 |
6 | 33648.88 | 35677.79 | 69326.67 |
7 | 32044.81 | 37281.86 | 69326.67 |
8 | 30940.73 | 38385.94 | 69326.67 |
df5c_long<- melt(clus.8@model$centers ) # need to reshape to 'long' form
g8 <- ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14) , legend.position = "bottom") +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (1-9)") +
ylim(1, 9) +
annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4) +
scale_color_discrete(name = "clusters",
labels = paste("cluster", c(1:nSeg), " (", "size=", h2o.cluster_sizes(clus.8), ")" ))
g8
This seems like a decent segmentation but before finalizing it as a viable segmentation, let’s look at a 9 cluster model too.
h2o.cluster_sizes(clus.9)
[1] 243 29 111 327 151 514 21 122 250
clustMetrics[nSeg - 1, 1] <- nSeg
clustMetrics[nSeg - 1, 2] <- getTotWithinSS(clus.9)
clustMetrics[nSeg - 1, 3] <- getBetweenSS(clus.9)
clustMetrics[nSeg - 1, 4] <- getTotSS(clus.9)
tab_df(clustMetrics, out= 'v')
Numbr_Segments | TotWithinSS | BetweenSS | TotSS |
---|---|---|---|
2 | 47737.14 | 21589.53 | 69326.67 |
3 | 38797.92 | 30528.75 | 69326.67 |
4 | 35964.73 | 33361.94 | 69326.67 |
5 | 35712.45 | 33614.22 | 69326.67 |
6 | 33648.88 | 35677.79 | 69326.67 |
7 | 32044.81 | 37281.86 | 69326.67 |
8 | 30940.73 | 38385.94 | 69326.67 |
9 | 30457.74 | 38868.93 | 69326.67 |
df5c_long<- melt(clus.9@model$centers ) # need to reshape to 'long' form
ggplot(data=df5c_long, aes(x=variable, y=value, group=centroid)) +
geom_line(aes( color= centroid ), size=1.2)+
geom_point(aes( color= centroid ), size=3) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size=14) , legend.position = "bottom") +
labs( title= "Centroids (means) of Segments over Basis Variables",
x= "Basis Variables", y= "Means (1-9)") +
ylim(1, 9) +
annotate("text", x = 3, y = 1, label = paste("Base =", nrow(df)), size=4) +
scale_color_discrete(name = "clusters",
labels = paste("cluster", c(1:nSeg), " (", "size=", h2o.cluster_sizes(clus.9), ")" ))
The 9 segment clustering seems to complicated to be interpretable. Therefore, the we go with 8 segments to the next steps of this analysis.
These segment assignments are obtained by predicting each respondent’s segment based on the 8 Segments Clustering results. Below are the predicted segment assignments for the 6-segment solution.
clusters.hex <- h2o.predict(clus.8, df.h2o)
|
| | 0%
|
|======================================================================| 100%
clusters.hex <- as.factor(clusters.hex) # cluster 'names' must be factors for modeling
clusters.hex$predict # segment assignments for first 6 respondents of 1036
And, this table shows the segment sizes printed earlier in a slightly different manner.
h2o.table(clusters.hex$predict) # table of assignments over all respondents
predict Count
1 0 222
2 1 29
3 2 403
4 3 125
5 4 21
6 5 570
[8 rows x 2 columns]
We change the levels of segments to be more pretty! After that, the a segment column is added to our new data frame along with all the other variables in the airline data set. This new data frame will be used for predictive modeling in the next sections.
Below are the new segments names and their count numbers, for some reason it only shows segments 1-6 and not 7 & 8.
h2o.table(clusters.hex$predict)[1:8,] # table of assignments over all respondents
predict Count
1 Seg1 222
2 Seg2 29
3 Seg3 403
4 Seg4 125
5 Seg5 21
6 Seg6 570
[8 rows x 2 columns]
As mentioned above, I’ll continue investigating the 8-segment solution and build a predictive model. There are several model forms that could be used. These include random forests, logistic regression, gradient boosting and deep learning.
I’ll first build a prediction model using the random forest method.
I want to find the best model for predicting segment membership for each segmentation solution that I consider to be a serious candidate as the best segmentation.
This is still part of the level 1 analysis, i.e., using the attitudinal data prior to bringing in demographics and other covariates.
Some form of cross-validation is absolutely necessary. I’ll use the relatively straight-forward procedure of splitting the overall sample into testing (70%) and training (validation) data sets. The 70% is not a required number, but often used.
Below is the training sample formed using the “h2o.splitFrame()” function on which the predictive model will be built. That function uses the result of the 6-segment k-means analysis that produced the df8C.class data frame and randomly splits it into a training sample and a testing sample. The “predict” column in the table below are those segments into which each respondent was assigned by the k-means segmentation algorithm. The “random forest”, “logistic regression” and “gradient boosting” models will be used now to develop models for predicting those segment assignments. The best of those models will be used to score the customer database if the predictor variables are in that database.
fs.split<- h2o.splitFrame(df8C.class, ratios=c(0.7) )
fs.split[[1]]
RID ClustSeg Easy_Reservation Preferred_Seats Flight_Options Ticket_Prices
1 1 Seg2 1 1 1 2
2 2 Seg5 1 1 1 1
3 4 Seg2 1 1 1 1
4 5 Seg5 1 1 3 1
5 6 Seg5 1 1 1 1
6 7 Seg5 3 1 2 2
Seat_Comfort Seat_Roominess Overhead_Storage Clean_Aircraft Courtesy
1 5 4 4 3 1
2 5 4 3 3 5
3 1 3 4 2 1
4 1 1 1 1 7
5 4 6 1 2 2
6 1 1 3 4 4
Friendliness Helpfulness Service Satisfaction Recommend Language Smoker
1 1 1 1 2 1 English No
2 3 6 4 2 1 English No
3 1 1 1 1 2 Spanish No
4 3 4 5 1 1 English No
5 2 2 5 3 1 French No
6 4 4 4 4 1 English No
Employment Education
1 Employed part-time Bachelor's degree
2 <NA> University certificate below bachelor level
3 Employed part-time Bachelor's degree
4 Not employed High school diploma or certificate
5 Employed part-time High school diploma or certificate
6 Not employed No certificate, diploma or degree
Marital Sex Age
1 Married Female 24 or younger
2 Living common law Female 24 or younger
3 Living common law Male 65 to 74
4 Living common law Male 55 to 64
5 Living common law Female 65 to 74
6 Single, never married, not living with a life partner Male 65 to 74
Income FlyAgain FlyAgain2 Easy_Reservation.1 Preferred_Seats.1
1 $40000 to $49999 0 No Strongly Disagree Strongly Disagree
2 $40000 to $49999 0 No Strongly Disagree Strongly Disagree
3 $30000 to $39999 0 No Strongly Disagree Strongly Disagree
4 $30000 to $39999 0 No Strongly Disagree Strongly Disagree
5 $175,000 to $199,999, 0 No Strongly Disagree Strongly Disagree
6 Less than $20,000 0 No 3 Strongly Disagree
Flight_Options.1 Ticket_Prices.1 Seat_Comfort.1 Seat_Roominess.1
1 Strongly Disagree 2 5 4
2 Strongly Disagree Strongly Disagree 5 4
3 Strongly Disagree Strongly Disagree Strongly Disagree 3
4 3 Strongly Disagree Strongly Disagree Strongly Disagree
5 Strongly Disagree Strongly Disagree 4 6
6 2 2 Strongly Disagree Strongly Disagree
Overhead_Storage.1 Clean_Aircraft.1 Courtesy.1 Friendliness.1
1 4 3 Strongly Disagree Strongly Disagree
2 3 3 5 3
3 4 2 Strongly Disagree Strongly Disagree
4 Strongly Disagree Strongly Disagree 7 3
5 Strongly Disagree 2 2 2
6 3 4 4 4
Helpfulness.1 Service.1
1 Strongly Disagree Strongly Disagree
2 6 4
3 Strongly Disagree Strongly Disagree
4 4 5
5 2 5
6 4 4
[1234 rows x 38 columns]
The testing or holdout or validation sample also has a column indicating the assigned segment for each individual.
fs.split[[2]]
RID ClustSeg Easy_Reservation Preferred_Seats Flight_Options Ticket_Prices
1 3 Seg2 1 1 3 1
2 9 Seg5 1 1 1 2
3 11 Seg2 3 1 1 1
4 14 Seg5 5 1 1 3
5 16 Seg5 5 2 3 1
6 17 Seg2 3 1 1 1
Seat_Comfort Seat_Roominess Overhead_Storage Clean_Aircraft Courtesy
1 2 1 1 1 4
2 2 1 1 4 4
3 1 1 1 1 3
4 1 2 1 2 3
5 2 1 1 4 6
6 1 1 1 1 5
Friendliness Helpfulness Service Satisfaction Recommend Language Smoker
1 5 5 1 1 1 <NA> No
2 1 7 5 5 1 English Yes
3 2 4 4 3 1 English Yes
4 6 3 1 4 2 <NA> Yes
5 4 7 5 4 3 English No
6 1 2 1 2 1 French No
Employment Education
1 Not employed Bachelor's degree
2 Fully employed No certificate, diploma or degree
3 Fully employed High school diploma or certificate
4 Fully employed trades certificate
5 Employed part-time High school diploma or certificate
6 Not employed College diploma
Marital Sex Age
1 Single, never married, not living with a life partner Female 65 to 74
2 Living common law Female 75 to 84
3 Separated, not living with a life partner Female 24 or younger
4 Married <NA> 24 or younger
5 Living common law <NA> 65 to 74
6 Living common law Male 45 to 54
Income FlyAgain FlyAgain2 Easy_Reservation.1 Preferred_Seats.1
1 $175,000 to $199,999, 0 No Strongly Disagree Strongly Disagree
2 $40000 to $49999 0 No Strongly Disagree Strongly Disagree
3 $30000 to $39999 0 No 3 Strongly Disagree
4 $50000 to $59999 0 No 5 Strongly Disagree
5 $150,000 to $174,999 0 No 5 2
6 $30000 to $39999 0 No 3 Strongly Disagree
Flight_Options.1 Ticket_Prices.1 Seat_Comfort.1 Seat_Roominess.1
1 3 Strongly Disagree 2 Strongly Disagree
2 Strongly Disagree 2 2 Strongly Disagree
3 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
4 Strongly Disagree 3 Strongly Disagree 2
5 3 Strongly Disagree 2 Strongly Disagree
6 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
Overhead_Storage.1 Clean_Aircraft.1 Courtesy.1 Friendliness.1
1 Strongly Disagree Strongly Disagree 4 5
2 Strongly Disagree 4 4 Strongly Disagree
3 Strongly Disagree Strongly Disagree 3 2
4 Strongly Disagree 2 3 6
5 Strongly Disagree 4 6 4
6 Strongly Disagree Strongly Disagree 5 Strongly Disagree
Helpfulness.1 Service.1
1 5 Strongly Disagree
2 7 5
3 4 4
4 3 Strongly Disagree
5 7 5
6 2 Strongly Disagree
[534 rows x 38 columns]
Random forest is a flexible and highly-valued methodology. As the name implies, it splits the data using a tree-splitting methodology. Instead of building just one tree, the model below builds 200 trees.
df8C.class_rf <- h2o.randomForest( ## h2o.randomForest function
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=3:16, ## the training sample predictor columns, by column index
y=2, ## the target index (what we are predicting)
model_id = "Random_Forest", ## name the model in H2O
## not required, but helps use Flow
ntrees = 200, ## use a maximum of 200 trees to create the
## random forest model. The default is 50.
## I have increased it because I will let
## the early stopping criteria decide when
## the random forest is sufficiently accurate
stopping_rounds = 2, ## Stop fitting new trees when the 2-tree
## average is within 0.001 (default) of
## the prior two 2-tree averages.
## Can be thought of as a convergence setting
score_each_iteration = T, ## Predict against training and validation for
## each tree. Default will skip several.
seed = 1000000) ## Set the random seed so that this result can be reproduced.
The simple “summary()” function can produce a great deal of information about the model. This might be a bit too much to have in one chunk.
summary(df8C.class_rf)
Model Details:
==============
H2OMultinomialModel: drf
Model Key: Random_Forest
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 22 176 181731 2
max_depth mean_depth min_leaves max_leaves mean_leaves
1 20 11.17046 3 179 78.04546
H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("RTMP_sid_b243_18")`
MSE: (Extract with `h2o.mse`) 0.1982057
RMSE: (Extract with `h2o.rmse`) 0.445203
Logloss: (Extract with `h2o.logloss`) 1.287398
Mean Per-Class Error: 0.2517982
R^2: (Extract with `h2o.r2`) 0.9593309
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 95 2 13 4 0 24 9 10 0.3949 = 62 / 157
Seg2 0 17 0 0 1 0 0 0 0.0556 = 1 / 18
Seg3 3 0 238 1 0 28 12 9 0.1821 = 53 / 291
Seg4 3 0 2 55 0 0 10 15 0.3529 = 30 / 85
Seg5 0 5 0 0 8 0 0 0 0.3846 = 5 / 13
Seg6 6 0 14 0 0 370 0 3 0.0585 = 23 / 393
Seg7 6 0 29 2 0 0 105 5 0.2857 = 42 / 147
Seg8 3 0 17 3 0 15 1 91 0.3000 = 39 / 130
Totals 116 24 313 65 9 437 137 133 0.2066 = 255 / 1,234
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.793355
2 2 0.926256
3 3 0.954619
4 4 0.962723
5 5 0.964344
6 6 0.965964
7 7 0.965964
8 8 1.000000
H2OMultinomialMetrics: drf
** Reported on validation data. **
Validation Set Metrics:
=====================
Extract validation frame with `h2o.getFrame("RTMP_sid_b243_20")`
MSE: (Extract with `h2o.mse`) 0.1876194
RMSE: (Extract with `h2o.rmse`) 0.4331505
Logloss: (Extract with `h2o.logloss`) 0.5385996
Mean Per-Class Error: 0.238413
R^2: (Extract with `h2o.r2`) 0.961275
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 46 0 2 0 0 8 7 2 0.2923 = 19 / 65
Seg2 0 8 0 0 3 0 0 0 0.2727 = 3 / 11
Seg3 2 0 96 0 0 11 3 0 0.1429 = 16 / 112
Seg4 3 0 0 29 0 0 5 3 0.2750 = 11 / 40
Seg5 0 1 0 1 6 0 0 0 0.2500 = 2 / 8
Seg6 0 0 0 0 0 177 0 0 0.0000 = 0 / 177
Seg7 5 0 10 1 0 0 43 0 0.2712 = 16 / 59
Seg8 6 0 11 1 0 6 1 37 0.4032 = 25 / 62
Totals 62 9 119 32 9 202 59 42 0.1723 = 92 / 534
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.827715
2 2 0.940075
3 3 0.981273
4 4 0.988764
5 5 0.990637
6 6 0.990637
7 7 0.990637
8 8 1.000000
Scoring History:
timestamp duration number_of_trees training_rmse training_logloss
1 2020-11-26 10:01:36 0.003 sec 0 NA NA
2 2020-11-26 10:01:36 0.013 sec 1 0.65195 13.52251
3 2020-11-26 10:01:36 0.024 sec 2 0.62843 12.21591
4 2020-11-26 10:01:36 0.034 sec 3 0.59609 10.43573
5 2020-11-26 10:01:36 0.046 sec 4 0.57458 9.18947
training_classification_error validation_rmse validation_logloss
1 NA NA NA
2 0.44105 0.64458 12.63004
3 0.41083 0.53233 6.55984
4 0.38397 0.49046 3.94032
5 0.36837 0.47017 2.81891
validation_classification_error
1 NA
2 0.46629
3 0.36330
4 0.30899
5 0.26404
---
timestamp duration number_of_trees training_rmse
18 2020-11-26 10:01:36 0.256 sec 17 0.45407
19 2020-11-26 10:01:36 0.282 sec 18 0.45176
20 2020-11-26 10:01:36 0.304 sec 19 0.44884
21 2020-11-26 10:01:36 0.329 sec 20 0.44702
22 2020-11-26 10:01:36 0.352 sec 21 0.44637
23 2020-11-26 10:01:36 0.375 sec 22 0.44520
training_logloss training_classification_error validation_rmse
18 1.66526 0.23177 0.43337
19 1.55729 0.22366 0.43175
20 1.42207 0.21556 0.43284
21 1.34107 0.21232 0.43247
22 1.34019 0.21556 0.43298
23 1.28740 0.20665 0.43315
validation_logloss validation_classification_error
18 0.60841 0.18165
19 0.59516 0.17041
20 0.53911 0.17416
21 0.53846 0.16479
22 0.53924 0.17228
23 0.53860 0.17228
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Variable Importances:
variable relative_importance scaled_importance percentage
1 Flight_Options 1954.231812 1.000000 0.134817
2 Ticket_Prices 1739.467163 0.890103 0.120001
3 Preferred_Seats 1398.059082 0.715401 0.096449
4 Recommend 1341.355591 0.686385 0.092537
5 Overhead_Storage 1133.051514 0.579794 0.078166
6 Seat_Roominess 1074.776489 0.549974 0.074146
7 Seat_Comfort 963.105713 0.492831 0.066442
8 Service 942.373169 0.482222 0.065012
9 Helpfulness 785.609009 0.402004 0.054197
10 Satisfaction 717.310120 0.367055 0.049485
11 Clean_Aircraft 689.196289 0.352669 0.047546
12 Friendliness 638.604553 0.326780 0.044056
13 Easy_Reservation 586.051025 0.299888 0.040430
14 Courtesy 532.199463 0.272332 0.036715
While this might be interesting, it is best not to place too much credence on the hit-ratio for the training sample since that data was used to build the model.
h2o.confusionMatrix(df8C.class_rf, fs.split[[1]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 157 0 0 0 0 0 0 0 0.0000 = 0 / 157
Seg2 0 18 0 0 0 0 0 0 0.0000 = 0 / 18
Seg3 0 0 291 0 0 0 0 0 0.0000 = 0 / 291
Seg4 0 0 0 85 0 0 0 0 0.0000 = 0 / 85
Seg5 0 0 0 0 13 0 0 0 0.0000 = 0 / 13
Seg6 0 0 0 0 0 393 0 0 0.0000 = 0 / 393
Seg7 0 0 0 0 0 0 147 0 0.0000 = 0 / 147
Seg8 0 0 0 0 0 0 0 130 0.0000 = 0 / 130
Totals 157 18 291 85 13 393 147 130 0.0000 = 0 / 1,234
This confusion matrix shows the ability of the model to predict segment membership for data that was not used in the model construction. This is much better information for partially answering the question, “How good is your model?”
h2o.confusionMatrix(df8C.class_rf, fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 46 0 2 0 0 8 7 2 0.2923 = 19 / 65
Seg2 0 8 0 0 3 0 0 0 0.2727 = 3 / 11
Seg3 2 0 96 0 0 11 3 0 0.1429 = 16 / 112
Seg4 3 0 0 29 0 0 5 3 0.2750 = 11 / 40
Seg5 0 1 0 1 6 0 0 0 0.2500 = 2 / 8
Seg6 0 0 0 0 0 177 0 0 0.0000 = 0 / 177
Seg7 5 0 10 1 0 0 43 0 0.2712 = 16 / 59
Seg8 6 0 11 1 0 6 1 37 0.4032 = 25 / 62
Totals 62 9 119 32 9 202 59 42 0.1723 = 92 / 534
A receptacle for diagnostic statistics is created below so that this information can be compared efficiently at the end of the predictive modeling.
modH <- data.frame(Prediction_model=character(),
hit_ratio=numeric(),
MSE=numeric(),
RMSE=numeric(),
logloss=numeric(),
mean_per_class_error=numeric(),
stringsAsFactors=FALSE)
modH[1, 1] <- "Random_forest"
modH[1, 2] <- df8C.class_rf@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
modH[1, 3] <- df8C.class_rf@model$validation_metrics@metrics$MSE #
modH[1, 4] <- df8C.class_rf@model$validation_metrics@metrics$RMSE #
modH[1, 5] <- df8C.class_rf@model$validation_metrics@metrics$ logloss
modH[1, 6] <- df8C.class_rf@model$validation_metrics@metrics$ mean_per_class_error
h2o allows for extracting variable importance from the predictive model results. That information is graphed below using the ‘plotly’ package. These should be compared across the three models used.
rf_variable_importances <- as.data.frame(df8C.class_rf@model$variable_importances)
library(plotly)
plot_ly(rf_variable_importances,
y=reorder(rf_variable_importances$variable,
rf_variable_importances$percentage),
x = rf_variable_importances$percentage,
color = rf_variable_importances$variable,
type = 'bar', orientation = 'h') %>%
layout( title = "Variable Importance for the random forest model",
xaxis = list(title = "Percentage Importance"),
ylim=c(0,1),
margin = list(l = 120))
It’s important to find the best model for use in the segmentation. This involves using random forest (above) and comparing to other analytics such as logistic regression, deep learning, gradient boosting model and others.
Compare the predictive ability of each type of model and pick the best.
Analyses that follow use different models to develop the best predictive model as alternative to random forest.
The glm option in h2o, h2o.glm, conducts logistic regression. The following is a simple setup for conducting glm on the 6-segment solution.
df8C.class_glm <- h2o.glm(
family= "multinomial",
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=3:16, ## the predictor columns, by column index
y=2,
lambda=0
)
Once again, the “summary()” function produces a lot of information. To find out how to reference parts of the output, run “str(df8C.class_glm)” and follow down that tree until you find what you want to extract.
summary(df8C.class_glm)
Model Details:
==============
H2OMultinomialModel: glm
Model Key: GLM_model_R_1605846379958_105
GLM Model: summary
family link regularization number_of_predictors_total
1 multinomial multinomial None 120
number_of_active_predictors number_of_iterations training_frame
1 112 50 RTMP_sid_b243_18
H2OMultinomialMetrics: glm
** Reported on training data. **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("RTMP_sid_b243_18")`
MSE: (Extract with `h2o.mse`) 0.008719055
RMSE: (Extract with `h2o.rmse`) 0.09337588
Logloss: (Extract with `h2o.logloss`) 0.04084536
Mean Per-Class Error: 0.003244478
Null Deviance: (Extract with `h2o.nulldeviance`) 4323.575
Residual Deviance: (Extract with `h2o.residual_deviance`) 100.8064
R^2: (Extract with `h2o.r2`) 0.998211
AIC: (Extract with `h2o.aic`) NaN
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 156 0 0 0 0 1 0 0 0.0064 = 1 / 157
Seg2 0 18 0 0 0 0 0 0 0.0000 = 0 / 18
Seg3 0 0 290 0 0 1 0 0 0.0034 = 1 / 291
Seg4 0 0 0 85 0 0 0 0 0.0000 = 0 / 85
Seg5 0 0 0 0 13 0 0 0 0.0000 = 0 / 13
Seg6 1 0 0 0 0 392 0 0 0.0025 = 1 / 393
Seg7 0 0 2 0 0 0 145 0 0.0136 = 2 / 147
Seg8 0 0 0 0 0 0 0 130 0.0000 = 0 / 130
Totals 157 18 292 85 13 394 145 130 0.0041 = 5 / 1,234
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.995948
2 2 1.000000
3 3 1.000000
4 4 1.000000
5 5 1.000000
6 6 1.000000
7 7 1.000000
8 8 1.000000
H2OMultinomialMetrics: glm
** Reported on validation data. **
Validation Set Metrics:
=====================
Extract validation frame with `h2o.getFrame("RTMP_sid_b243_20")`
MSE: (Extract with `h2o.mse`) 0.0379163
RMSE: (Extract with `h2o.rmse`) 0.1947211
Logloss: (Extract with `h2o.logloss`) 0.5469952
Mean Per-Class Error: 0.1195723
Null Deviance: (Extract with `h2o.nulldeviance`) 1906.692
Residual Deviance: (Extract with `h2o.residual_deviance`) 1214.378
R^2: (Extract with `h2o.r2`) 0.992174
AIC: (Extract with `h2o.aic`) NaN
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 63 0 0 1 0 1 0 0 0.0308 = 2 / 65
Seg2 0 7 0 1 3 0 0 0 0.3636 = 4 / 11
Seg3 0 0 111 0 0 1 0 0 0.0089 = 1 / 112
Seg4 0 0 0 37 1 0 2 0 0.0750 = 3 / 40
Seg5 0 2 0 1 5 0 0 0 0.3750 = 3 / 8
Seg6 0 0 1 0 0 176 0 0 0.0056 = 1 / 177
Seg7 0 0 0 0 1 0 58 0 0.0169 = 1 / 59
Seg8 2 0 2 0 0 0 1 57 0.0806 = 5 / 62
Totals 65 9 114 40 10 178 61 57 0.0375 = 20 / 534
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.962547
2 2 0.990637
3 3 0.998127
4 4 0.998127
5 5 0.998127
6 6 0.998127
7 7 1.000000
8 8 1.000000
Scoring History:
timestamp duration iterations negative_log_likelihood objective
1 2020-11-26 10:01:38 0.000 sec 0 2161.78770 1.75185
2 2020-11-26 10:01:38 0.026 sec 1 983.44076 0.79695
3 2020-11-26 10:01:38 0.045 sec 2 528.28160 0.42811
4 2020-11-26 10:01:38 0.060 sec 3 362.66936 0.29390
5 2020-11-26 10:01:38 0.072 sec 4 304.05046 0.24639
training_rmse training_logloss training_r2 training_classification_error
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
validation_rmse validation_logloss validation_r2
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
validation_classification_error
1 NA
2 NA
3 NA
4 NA
5 NA
---
timestamp duration iterations negative_log_likelihood objective
46 2020-11-26 10:01:39 0.650 sec 45 56.03445 0.04541
47 2020-11-26 10:01:39 0.660 sec 46 54.83831 0.04444
48 2020-11-26 10:01:39 0.669 sec 47 53.69041 0.04351
49 2020-11-26 10:01:39 0.680 sec 48 52.58921 0.04262
50 2020-11-26 10:01:39 0.692 sec 49 51.49931 0.04173
51 2020-11-26 10:01:39 0.702 sec 50 50.40318 0.04085
training_rmse training_logloss training_r2 training_classification_error
46 NA NA NA NA
47 NA NA NA NA
48 NA NA NA NA
49 NA NA NA NA
50 NA NA NA NA
51 0.09338 0.04085 0.99821 0.00405
validation_rmse validation_logloss validation_r2
46 NA NA NA
47 NA NA NA
48 NA NA NA
49 NA NA NA
50 NA NA NA
51 0.19472 0.54700 0.99217
validation_classification_error
46 NA
47 NA
48 NA
49 NA
50 NA
51 0.03745
Variable Importances: (Extract with `h2o.varimp`)
=================================================
variable relative_importance scaled_importance percentage
1 Preferred_Seats 123.81388 1.0000000 0.13452129
2 Ticket_Prices 115.56724 0.9333949 0.12556149
3 Recommend 85.74906 0.6925642 0.09316463
4 Overhead_Storage 79.80632 0.6445669 0.08670797
5 Satisfaction 77.83319 0.6286306 0.08456420
6 Easy_Reservation 77.37700 0.6249461 0.08406855
7 Flight_Options 60.88457 0.4917427 0.06614987
8 Service 59.96018 0.4842768 0.06514554
9 Clean_Aircraft 56.14288 0.4534458 0.06099812
10 Seat_Roominess 43.91615 0.3546949 0.04771402
11 Seat_Comfort 42.17984 0.3406714 0.04582755
12 Helpfulness 34.94279 0.2822203 0.03796464
13 Friendliness 34.36959 0.2775907 0.03734187
14 Courtesy 27.86086 0.2250221 0.03027027
The logistic regression model provides extremely good predictions for the training model, but it would be cheating and unbelievable to use this to support the modeling.
df8C.class_glm@ model$ training_metrics@ metrics$ cm$ table
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 156 0 0 0 0 1 0 0 0.0064 = 1 / 157
Seg2 0 18 0 0 0 0 0 0 0.0000 = 0 / 18
Seg3 0 0 290 0 0 1 0 0 0.0034 = 1 / 291
Seg4 0 0 0 85 0 0 0 0 0.0000 = 0 / 85
Seg5 0 0 0 0 13 0 0 0 0.0000 = 0 / 13
Seg6 1 0 0 0 0 392 0 0 0.0025 = 1 / 393
Seg7 0 0 2 0 0 0 145 0 0.0136 = 2 / 147
Seg8 0 0 0 0 0 0 0 130 0.0000 = 0 / 130
Totals 157 18 292 85 13 394 145 130 0.0041 = 5 / 1,234
The following chronicles the contents of the generated model. Here, the predictive ability is still unusually good with 95.5752212% being predicted accurately.
df8C.class_glm@ model$ validation_metrics@ metrics$ cm$ table
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 63 0 0 1 0 1 0 0 0.0308 = 2 / 65
Seg2 0 7 0 1 3 0 0 0 0.3636 = 4 / 11
Seg3 0 0 111 0 0 1 0 0 0.0089 = 1 / 112
Seg4 0 0 0 37 1 0 2 0 0.0750 = 3 / 40
Seg5 0 2 0 1 5 0 0 0 0.3750 = 3 / 8
Seg6 0 0 1 0 0 176 0 0 0.0056 = 1 / 177
Seg7 0 0 0 0 1 0 58 0 0.0169 = 1 / 59
Seg8 2 0 2 0 0 0 1 57 0.0806 = 5 / 62
Totals 65 9 114 40 10 178 61 57 0.0375 = 20 / 534
Look at the code and see another command for the confusion matrix.
h2o.confusionMatrix(df8C.class_glm, valid = TRUE) # df8C.class
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 63 0 0 1 0 1 0 0 0.0308 = 2 / 65
Seg2 0 7 0 1 3 0 0 0 0.3636 = 4 / 11
Seg3 0 0 111 0 0 1 0 0 0.0089 = 1 / 112
Seg4 0 0 0 37 1 0 2 0 0.0750 = 3 / 40
Seg5 0 2 0 1 5 0 0 0 0.3750 = 3 / 8
Seg6 0 0 1 0 0 176 0 0 0.0056 = 1 / 177
Seg7 0 0 0 0 1 0 58 0 0.0169 = 1 / 59
Seg8 2 0 2 0 0 0 1 57 0.0806 = 5 / 62
Totals 65 9 114 40 10 178 61 57 0.0375 = 20 / 534
h2o provides an easy way for obtaining the variable importances.
h2o.varimp(df8C.class_glm)
variable relative_importance scaled_importance percentage
1 Preferred_Seats 123.81388 1.0000000 0.13452129
2 Ticket_Prices 115.56724 0.9333949 0.12556149
3 Recommend 85.74906 0.6925642 0.09316463
4 Overhead_Storage 79.80632 0.6445669 0.08670797
5 Satisfaction 77.83319 0.6286306 0.08456420
6 Easy_Reservation 77.37700 0.6249461 0.08406855
7 Flight_Options 60.88457 0.4917427 0.06614987
8 Service 59.96018 0.4842768 0.06514554
9 Clean_Aircraft 56.14288 0.4534458 0.06099812
10 Seat_Roominess 43.91615 0.3546949 0.04771402
11 Seat_Comfort 42.17984 0.3406714 0.04582755
12 Helpfulness 34.94279 0.2822203 0.03796464
13 Friendliness 34.36959 0.2775907 0.03734187
14 Courtesy 27.86086 0.2250221 0.03027027
Notice that the variable importances for this logistic regression model is not the same as produced by the random forest model.
# glm_variable_importances <- as.data.frame(df8C.class_glm@model$variable_importances)
glm_variable_importances <- as.data.frame(h2o.varimp(df8C.class_glm))
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
pacman::p_load(plotly)
plot_ly(glm_variable_importances,
# x = rf_variable_importances$percentage,
y=reorder(glm_variable_importances$variable,
glm_variable_importances$percentage),
x = glm_variable_importances$percentage,
color = glm_variable_importances$variable,
type = 'bar', orientation = 'h') %>%
layout( title = "Variable Importance for the logistic regression model on 6 segments",
xaxis = list(title = "Percentage Importance"),
ylim=c(0,1),
margin = list(l = 120))
modH[2, 1] <- "GLM_log_regr"
modH[2, 2] <- df8C.class_glm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
modH[2, 3] <- df8C.class_glm@model$validation_metrics@metrics$MSE #
modH[2, 4] <- df8C.class_glm@model$validation_metrics@metrics$RMSE #
modH[2, 5] <- df8C.class_glm@model$validation_metrics@metrics$ logloss
modH[2, 6] <- df8C.class_glm@model$validation_metrics@metrics$ mean_per_class_error
The gradient boosting machine model is a type of neural network algorithm that can be very effective. It is quite difficult to foretell which of these models is likely to be the best predictor of segment membership. It is important to try several models and compare the results.
#GBM Gradient Boosting Machine
df8C.class_gbm<- h2o.gbm(
distribution="AUTO",
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=3:16, ## the predictor columns, by column index
y=2,
model_id = "fs.gbm",
stopping_rounds = 3,
histogram_type = "UniformAdaptive" ,
stopping_tolerance = 1e-2,
seed = 1234
)
GBM models do not produce coefficients of the variables.
The “h2o.performance()” function provides a tidy list of many important analytical metrics. The “str(perf)” command will provide locations for the following metrics. Notice that the performance is requested on the holdout sample.
perf <- h2o.performance(df8C.class_gbm, fs.split[[2]])
perf
H2OMultinomialMetrics: gbm
Test Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.1327061
RMSE: (Extract with `h2o.rmse`) 0.3642884
Logloss: (Extract with `h2o.logloss`) 0.4226855
Mean Per-Class Error: 0.2296039
R^2: (Extract with `h2o.r2`) 0.9726092
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 51 0 2 0 0 7 5 0 0.2154 = 14 / 65
Seg2 0 8 0 0 3 0 0 0 0.2727 = 3 / 11
Seg3 0 0 95 0 0 7 7 3 0.1518 = 17 / 112
Seg4 2 0 1 26 1 0 6 4 0.3500 = 14 / 40
Seg5 0 1 0 2 5 0 0 0 0.3750 = 3 / 8
Seg6 0 0 5 0 0 172 0 0 0.0282 = 5 / 177
Seg7 2 0 8 0 0 0 49 0 0.1695 = 10 / 59
Seg8 3 0 8 0 0 5 1 45 0.2742 = 17 / 62
Totals 58 9 119 28 9 191 68 52 0.1554 = 83 / 534
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.844569
2 2 0.953183
3 3 0.986891
4 4 0.996255
5 5 1.000000
6 6 1.000000
7 7 1.000000
8 8 1.000000
# str(df8C.class_glm)
df8C.class_gbm@ model$ training_metrics@ metrics$ cm$ table
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 157 0 0 0 0 0 0 0 0.0000 = 0 / 157
Seg2 0 18 0 0 0 0 0 0 0.0000 = 0 / 18
Seg3 0 0 291 0 0 0 0 0 0.0000 = 0 / 291
Seg4 0 0 0 85 0 0 0 0 0.0000 = 0 / 85
Seg5 0 0 0 0 13 0 0 0 0.0000 = 0 / 13
Seg6 0 0 0 0 0 393 0 0 0.0000 = 0 / 393
Seg7 0 0 0 0 0 0 147 0 0.0000 = 0 / 147
Seg8 0 0 0 0 0 0 0 130 0.0000 = 0 / 130
Totals 157 18 291 85 13 393 147 130 0.0000 = 0 / 1,234
# h2o.confusionMatrix( df8C.class_glm,train = TRUE)
The following chronicles the contents of the generated model.
The confusion matrix can be pulled from the model output file or from the performance output shown above.
h2o.confusionMatrix(df8C.class_gbm, fs.split[[2]]) # CONFUSION TABLE FOR HOLDOUT
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 51 0 2 0 0 7 5 0 0.2154 = 14 / 65
Seg2 0 8 0 0 3 0 0 0 0.2727 = 3 / 11
Seg3 0 0 95 0 0 7 7 3 0.1518 = 17 / 112
Seg4 2 0 1 26 1 0 6 4 0.3500 = 14 / 40
Seg5 0 1 0 2 5 0 0 0 0.3750 = 3 / 8
Seg6 0 0 5 0 0 172 0 0 0.0282 = 5 / 177
Seg7 2 0 8 0 0 0 49 0 0.1695 = 10 / 59
Seg8 3 0 8 0 0 5 1 45 0.2742 = 17 / 62
Totals 58 9 119 28 9 191 68 52 0.1554 = 83 / 534
#h2o.confusionMatrix(perf) # DIFFERENT WAY
h2o.varimp(df8C.class_gbm)
Variable Importances:
variable relative_importance scaled_importance percentage
1 Flight_Options 524.496399 1.000000 0.156150
2 Ticket_Prices 440.558929 0.839966 0.131160
3 Overhead_Storage 360.092560 0.686549 0.107204
4 Recommend 351.703003 0.670554 0.104707
5 Service 346.527100 0.660685 0.103166
6 Preferred_Seats 339.407776 0.647112 0.101046
7 Seat_Roominess 241.974472 0.461346 0.072039
8 Helpfulness 167.264206 0.318904 0.049797
9 Seat_Comfort 166.859787 0.318133 0.049676
10 Satisfaction 117.915565 0.224817 0.035105
11 Friendliness 100.707726 0.192008 0.029982
12 Clean_Aircraft 83.652298 0.159491 0.024904
13 Easy_Reservation 60.830585 0.115979 0.018110
14 Courtesy 56.941521 0.108564 0.016952
The variable importances are similar to those for rf and glm models, but with some differences.
# glm_variable_importances <- as.data.frame(df8C.class_glm@model$variable_importances)
gbm_variable_importances <- as.data.frame(h2o.varimp(df8C.class_gbm))
# rf_variable_importances
#install.packages("plotly", dependencies=TRUE)
pacman::p_load(plotly)
plot_ly(gbm_variable_importances,
# x = rf_variable_importances$percentage,
y=reorder(gbm_variable_importances$variable,
gbm_variable_importances$percentage),
x = gbm_variable_importances$percentage,
color = gbm_variable_importances$variable,
type = 'bar', orientation = 'h') %>%
layout( title = "Variable Importance for GBM model of 6-segment solution",
xaxis = list(title = "Percentage Importance"),
ylim=c(0,1),
margin = list(l = 120))
The table below shows the probabilities of each several respondents being in each of the 8 segments and then assigns the person to the segment having the highest probability. This can be very valuable information and used in different ways.
gbm_8_pred <- h2o.predict(df8C.class_gbm, newdata=fs.split[[2]])
head(gbm_8_pred, 10) %>% tab_df( show.rownames = T, title = "Predicted segments for testing sample")
predict | Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 |
---|---|---|---|---|---|---|---|---|
Seg2 | 0.00 | 0.88 | 0.00 | 0.00 | 0.11 | 0.00 | 0.00 | 0.00 |
Seg5 | 0.00 | 0.04 | 0.00 | 0.15 | 0.78 | 0.00 | 0.01 | 0.00 |
Seg5 | 0.00 | 0.08 | 0.00 | 0.00 | 0.91 | 0.00 | 0.00 | 0.00 |
Seg2 | 0.00 | 0.61 | 0.00 | 0.11 | 0.26 | 0.00 | 0.00 | 0.00 |
Seg4 | 0.00 | 0.00 | 0.00 | 0.67 | 0.32 | 0.00 | 0.01 | 0.00 |
Seg2 | 0.00 | 0.91 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 |
Seg5 | 0.00 | 0.26 | 0.00 | 0.01 | 0.72 | 0.00 | 0.00 | 0.00 |
Seg2 | 0.00 | 0.99 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 |
Seg2 | 0.00 | 0.92 | 0.00 | 0.00 | 0.05 | 0.00 | 0.01 | 0.00 |
Seg2 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
modH[3, 1] <- "Gradient Boosting"
modH[3, 2] <- df8C.class_gbm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
modH[3, 3] <- df8C.class_gbm@model$validation_metrics@metrics$MSE #
modH[3, 4] <- df8C.class_gbm@model$validation_metrics@metrics$RMSE #
modH[3, 5] <- df8C.class_gbm@model$validation_metrics@metrics$ logloss
modH[3, 6] <- df8C.class_gbm@model$validation_metrics@metrics$ mean_per_class_error
The table below provides the diagnostic statistics produced by the predictive models.
modH %>% tab_df( show.rownames = TRUE, sort.column = -2, title = "Statistics of predictive models for the 6-segment solution")
Row | Prediction_model | hit_ratio | MSE | RMSE | logloss | mean_per_class_error |
---|---|---|---|---|---|---|
2 | GLM_log_regr | 0.96 | 0.04 | 0.19 | 0.55 | 0.12 |
3 | Gradient Boosting | 0.84 | 0.13 | 0.36 | 0.42 | 0.23 |
1 | Random_forest | 0.83 | 0.19 | 0.43 | 0.54 | 0.24 |
The GLM (logistic regression) model seems to be better with a higher hit-ratio of 0.9625468 compared to hit-ratios for the other models. All other statistics indicate that GLM is the superior model.
modH_hit <- modH[, 1:2]
modH_hit_long <- melt(modH_hit)
# Using Prediction_model as id variables
ggplot(data=modH_hit_long, aes(x= reorder(Prediction_model, -value), y=value ) ) +
geom_bar( stat="identity" , fill = "lightblue", width = 0.3)+
geom_point(aes( color= Prediction_model ), size=6) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14)) +
labs( title= "Hit ratios for predictive models", y="Proportion correct",
x= "Prediction Model")
Before moving on to the Level-2 segmentation and using demographic variables, in this section we try to interpret the clusters using the mean of their attitudinal variables.
g8 + theme(legend.direction = "vertical",
legend.text = element_text(size=15),
legend.key.size = unit(.7, "cm"))
Segment 2: Have the least scoring in all attitude variables, however their sample size is 1.64 percent of all customers.
Segment 5: One of the two least scoring segments, although its customers had a better opinion than segment 2 in helpfulness, service and easy_reservation variables.
Segment 6: Our best customers! They had the highest average in all categories with no particular peak or trough.
Segment 1: Thinking highly of the service variables but were not satisfied with the flight option variables. overhead storage was a huge turn-off for them.
Segment 3: Near perfect in almost all categories (like segment 6) but, flight options, ticket prices and preferred seats were a concern for this segment. However, it didn’t affect their attitude toward recommending the airline to others, this segment is in the top 3 in the recommend variable.
Segment 4: Medium satisfied, Overhead storage and ticket prices was a concern for this segment.
Segment 7: This is one of interesting segments. They were near the top with seat_comfort, courtesy, friendliness, etc. (quality of service variables). However, not as much on the flight options and ticket prices variables. Also, they were not very eager to recommend the airline to others as they had a around average result in the recommend variable.
Segment 8: They had the same linear pattern in their variable means as in our highest segment (Segment 6) with about 1 point difference (lower). No particular peaks
The objective of Level 1 segmentation was 1) to use a partitioning method, or several, to find several segmentation solutions; and 2) to find a model that is best at predicting membership in the segments based on the Level 1 variables, normally attitudinal variables based on the product.
Level 2 of the segmentation process focuses on the very practical problem of attempting to assign people to segments based on variables that may be more accessible than the attitudinal variables, i.e., demographic and geo-demographic variables. Most large customer data bases contain individual demographic information but rarely attitudes, including product perceptions and preferences.
The process of applying a predictive model to a customer data base is often called scoring or tagging. The development of a very simple predictive model that can be used quickly by sales representatives to estimate segment membership of a prospect is referred to as a typing tool.
The demographic variables in the airline data need to be analysed before put in the model. After using a simple describe
function. It is found out that they are not factors. In the next code chunk we change turn these variables in to factors. Also, the demographics contain missing points and missing point causes all the next predictive models to produce an error. By omitting the rows containing a missing value, we lose 50 percent of customers! But as the model performance in level 2 is not comparable to the performance in Level 1, we proceed for now.
h2o.describe(df8C.class[,17:24])
Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
1 Language string 159 0 0 0 NaN NaN NA NA NA
2 Smoker string 153 0 0 0 NaN NaN NA NA NA
3 Employment string 141 0 0 0 NaN NaN NA NA NA
4 Education string 147 0 0 0 NaN NaN NA NA NA
5 Marital string 141 0 0 0 NaN NaN NA NA NA
6 Sex string 151 0 0 0 NaN NaN NA NA NA
7 Age string 146 0 0 0 NaN NaN NA NA NA
8 Income string 129 0 0 0 NaN NaN NA NA NA
df8C.class[,17:24] <- h2o.asfactor(df8C.class[,17:24])
df8C.class.na <- df8C.class %>% h2o.na_omit()
nrow(df8C.class.na)
[1] 895
After that the reduced (by 50%) data is splitted into a train and validation subset.
#__splitting df into training (70%) and testing (validation) datasets________
fs.split<- h2o.splitFrame(df8C.class.na, ratios=c(0.7))
This random forest model attempts to predict segment membership from the 8-segment solution developed in Level 1 by using only the demographic variables. The demographic variables were not used in the construction of the predictive model. But, they might be the only variables in the customer database and are easier to obtain than are the values of the attitudinal variables. This begins the process to answer the question, “How well do the covariates predict segment membership?”
fs3Cov.class_rf <- h2o.randomForest(
training_frame = fs.split[[1]],
validation_frame = fs.split[[2]],
x=17:24, ## the predictor columns, by column index
y=2, ## the target index (what we are predicting)
model_id = "RF_cov", ## name the model in H2O
## not required, but helps use Flow
ntrees = 200, ## use a maximum of 200 trees to create the
## random forest model. The default is 50.
## I have increased it because I will let
## the early stopping criteria decide when
## the random forest is sufficiently accurate
stopping_rounds = 2, ## Stop fitting new trees when the 2-tree
## average is within 0.001 (default) of
## the prior two 2-tree averages.
## Can be thought of as a convergence setting
score_each_iteration = T, ## Predict against training and validation for
## each tree. Default will skip several.
seed = 1000000) ## Set the random seed so that this can be reproduced.
The summary()
function brings forth a great deal of output. Most of this is useful but might be too much to digest in one chunk. Using the str()
function on the model output will list the smaller pieces of output that can be extracted in more usable pieces. The demographics do not seem to very accurately predict segment membership.
Both train and validation metrics indicate that the demographic variables are not very effective in predicting segment membership.
summary(fs3Cov.class_rf)
Model Details:
==============
H2OMultinomialModel: drf
Model Key: RF_cov
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 30 240 226294 5
max_depth mean_depth min_leaves max_leaves mean_leaves
1 18 11.89583 9 161 71.02500
H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.6623662
RMSE: (Extract with `h2o.rmse`) 0.8138588
Logloss: (Extract with `h2o.logloss`) 5.199065
Mean Per-Class Error: 0.8732956
R^2: (Extract with `h2o.r2`) 0.8610053
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 10 1 18 2 0 38 5 4 0.8718 = 68 / 78
Seg2 2 0 3 1 0 6 0 0 1.0000 = 12 / 12
Seg3 14 1 35 9 1 75 9 10 0.7727 = 119 / 154
Seg4 6 2 10 3 0 27 1 3 0.9423 = 49 / 52
Seg5 0 0 1 0 0 8 0 0 1.0000 = 9 / 9
Seg6 15 1 67 4 1 91 14 12 0.5561 = 114 / 205
Seg7 6 0 14 3 0 40 7 1 0.9014 = 64 / 71
Seg8 3 0 13 4 0 40 5 4 0.9420 = 65 / 69
Totals 56 5 161 26 2 325 41 34 0.7692 = 500 / 650
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.230769
2 2 0.429231
3 3 0.615385
4 4 0.740000
5 5 0.827692
6 6 0.872308
7 7 0.889231
8 8 1.000000
H2OMultinomialMetrics: drf
** Reported on validation data. **
Validation Set Metrics:
=====================
Extract validation frame with `h2o.getFrame("RTMP_sid_b243_25")`
MSE: (Extract with `h2o.mse`) 0.6634014
RMSE: (Extract with `h2o.rmse`) 0.8144946
Logloss: (Extract with `h2o.logloss`) 3.606853
Mean Per-Class Error: 0.8715399
R^2: (Extract with `h2o.r2`) 0.8626447
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 0 0 14 0 0 17 0 0 1.0000 = 31 / 31
Seg2 0 0 4 0 0 2 0 0 1.0000 = 6 / 6
Seg3 2 0 21 1 0 26 3 3 0.6250 = 35 / 56
Seg4 1 0 6 0 0 8 0 1 1.0000 = 16 / 16
Seg5 0 0 0 0 0 1 2 0 1.0000 = 3 / 3
Seg6 7 0 15 3 2 45 6 0 0.4231 = 33 / 78
Seg7 3 0 6 0 0 22 1 1 0.9697 = 32 / 33
Seg8 0 0 6 0 0 14 1 1 0.9545 = 21 / 22
Totals 13 0 72 4 2 135 13 6 0.7224 = 177 / 245
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.277551
2 2 0.436735
3 3 0.616326
4 4 0.726531
5 5 0.820408
6 6 0.897959
7 7 0.934694
8 8 1.000000
Scoring History:
timestamp duration number_of_trees training_rmse training_logloss
1 2020-11-26 10:01:44 0.002 sec 0 NA NA
2 2020-11-26 10:01:45 0.012 sec 1 0.87477 23.42415
3 2020-11-26 10:01:45 0.019 sec 2 0.86543 22.31440
4 2020-11-26 10:01:45 0.027 sec 3 0.86228 21.29809
5 2020-11-26 10:01:45 0.035 sec 4 0.85881 19.85723
training_classification_error validation_rmse validation_logloss
1 NA NA NA
2 0.76371 0.89835 25.44157
3 0.77696 0.87445 20.27283
4 0.77439 0.84749 16.53726
5 0.78766 0.84085 13.41488
validation_classification_error
1 NA
2 0.80000
3 0.81224
4 0.76735
5 0.79592
---
timestamp duration number_of_trees training_rmse
26 2020-11-26 10:01:45 0.312 sec 25 0.81133
27 2020-11-26 10:01:45 0.330 sec 26 0.81171
28 2020-11-26 10:01:45 0.349 sec 27 0.81249
29 2020-11-26 10:01:45 0.370 sec 28 0.81351
30 2020-11-26 10:01:45 0.391 sec 29 0.81330
31 2020-11-26 10:01:45 0.412 sec 30 0.81386
training_logloss training_classification_error validation_rmse
26 6.08735 0.74308 0.81237
27 5.75777 0.74462 0.81253
28 5.43538 0.75692 0.81259
29 5.38847 0.75846 0.81289
30 5.29201 0.75846 0.81375
31 5.19907 0.76923 0.81449
validation_logloss validation_classification_error
26 3.82308 0.73061
27 3.82442 0.73469
28 3.58487 0.73061
29 3.59066 0.73061
30 3.59965 0.71837
31 3.60685 0.72245
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Variable Importances:
variable relative_importance scaled_importance percentage
1 Income 1835.585205 1.000000 0.246807
2 Age 1492.825073 0.813269 0.200721
3 Education 1338.323120 0.729099 0.179947
4 Marital 932.513550 0.508020 0.125383
5 Language 614.997009 0.335041 0.082691
6 Employment 611.182190 0.332963 0.082178
7 Sex 344.215271 0.187523 0.046282
8 Smoker 267.674744 0.145825 0.035991
More direct ways to access the training and validation metrics are shown in the code-chunks below. Performance metrics depend on the type of model being built. With a multinomial classification, we will primarily look at the confusion matrix, and overall accuracy via the hit-ratio. Once again, error rate and the hit-ratio are very poor.
The predictive ability of the random forest model on the training sample is quite poor. However, this information should not be relied upon.
fs3Cov.class_rf@model$training_metrics ##
H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.6623662
RMSE: (Extract with `h2o.rmse`) 0.8138588
Logloss: (Extract with `h2o.logloss`) 5.199065
Mean Per-Class Error: 0.8732956
R^2: (Extract with `h2o.r2`) 0.8610053
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 10 1 18 2 0 38 5 4 0.8718 = 68 / 78
Seg2 2 0 3 1 0 6 0 0 1.0000 = 12 / 12
Seg3 14 1 35 9 1 75 9 10 0.7727 = 119 / 154
Seg4 6 2 10 3 0 27 1 3 0.9423 = 49 / 52
Seg5 0 0 1 0 0 8 0 0 1.0000 = 9 / 9
Seg6 15 1 67 4 1 91 14 12 0.5561 = 114 / 205
Seg7 6 0 14 3 0 40 7 1 0.9014 = 64 / 71
Seg8 3 0 13 4 0 40 5 4 0.9420 = 65 / 69
Totals 56 5 161 26 2 325 41 34 0.7692 = 500 / 650
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.230769
2 2 0.429231
3 3 0.615385
4 4 0.740000
5 5 0.827692
6 6 0.872308
7 7 0.889231
8 8 1.000000
The error rate and the corresponding hit-ratio for the testing sample are very poor.
h2o.confusionMatrix(fs3Cov.class_rf, fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 0 0 14 0 0 17 0 0 1.0000 = 31 / 31
Seg2 0 0 4 0 0 2 0 0 1.0000 = 6 / 6
Seg3 2 0 21 1 0 26 3 3 0.6250 = 35 / 56
Seg4 1 0 6 0 0 8 0 1 1.0000 = 16 / 16
Seg5 0 0 0 0 0 1 2 0 1.0000 = 3 / 3
Seg6 7 0 15 3 2 45 6 0 0.4231 = 33 / 78
Seg7 3 0 6 0 0 22 1 1 0.9697 = 32 / 33
Seg8 0 0 6 0 0 14 1 1 0.9545 = 21 / 22
Totals 13 0 72 4 2 135 13 6 0.7224 = 177 / 245
Or, the same result can be obtained as follows.
fs3Cov.class_rf@model$validation_metrics ##
H2OMultinomialMetrics: drf
** Reported on validation data. **
Validation Set Metrics:
=====================
Extract validation frame with `h2o.getFrame("RTMP_sid_b243_25")`
MSE: (Extract with `h2o.mse`) 0.6634014
RMSE: (Extract with `h2o.rmse`) 0.8144946
Logloss: (Extract with `h2o.logloss`) 3.606853
Mean Per-Class Error: 0.8715399
R^2: (Extract with `h2o.r2`) 0.8626447
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 0 0 14 0 0 17 0 0 1.0000 = 31 / 31
Seg2 0 0 4 0 0 2 0 0 1.0000 = 6 / 6
Seg3 2 0 21 1 0 26 3 3 0.6250 = 35 / 56
Seg4 1 0 6 0 0 8 0 1 1.0000 = 16 / 16
Seg5 0 0 0 0 0 1 2 0 1.0000 = 3 / 3
Seg6 7 0 15 3 2 45 6 0 0.4231 = 33 / 78
Seg7 3 0 6 0 0 22 1 1 0.9697 = 32 / 33
Seg8 0 0 6 0 0 14 1 1 0.9545 = 21 / 22
Totals 13 0 72 4 2 135 13 6 0.7224 = 177 / 245
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,valid = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.277551
2 2 0.436735
3 3 0.616326
4 4 0.726531
5 5 0.820408
6 6 0.897959
7 7 0.934694
8 8 1.000000
h2o.hit_ratio_table(fs3Cov.class_rf, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios:
k hit_ratio
1 1 0.230769
2 2 0.429231
3 3 0.615385
4 4 0.740000
5 5 0.827692
6 6 0.872308
7 7 0.889231
8 8 1.000000
$valid
Top-8 Hit Ratios:
k hit_ratio
1 1 0.277551
2 2 0.436735
3 3 0.616326
4 4 0.726531
5 5 0.820408
6 6 0.897959
7 7 0.934694
8 8 1.000000
While education, occupation, age and income seem to be the most important variables, they are not doing a very good job of predicting.
fs3Cov.class_rf@model$variable_importances
Variable Importances:
variable relative_importance scaled_importance percentage
1 Income 1835.585205 1.000000 0.246807
2 Age 1492.825073 0.813269 0.200721
3 Education 1338.323120 0.729099 0.179947
4 Marital 932.513550 0.508020 0.125383
5 Language 614.997009 0.335041 0.082691
6 Employment 611.182190 0.332963 0.082178
7 Sex 344.215271 0.187523 0.046282
8 Smoker 267.674744 0.145825 0.035991
The table below shows the hit-ratio, MSE (mean square error), RMSE (root mean square error), logloss and mean per class error for the random forest model. This table will be expanded to include the diagnostic statistics from each successive predictive model. After all of the analyses, we’ll select the model with the best stats, i.e., highest hit ratio and lowest error rates.
# THE FOLLOWING covH data frame IS TO HOLD STATISTICS FROM EACH MODEL
covH <- data.frame(Prediction_model=character(),
hit_ratio=numeric(),
MSE=numeric(),
RMSE=numeric(),
logloss=numeric(),
mean_per_class_error=numeric(),
stringsAsFactors=FALSE)
covH[1, 1] <- "Random_forest"
covH[1, 2] <- fs3Cov.class_rf@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[1, 3] <- fs3Cov.class_rf@model$validation_metrics@metrics$MSE #
covH[1, 4] <- fs3Cov.class_rf@model$validation_metrics@metrics$RMSE #
covH[1, 5] <- fs3Cov.class_rf@model$validation_metrics@metrics$ logloss
covH[1, 6] <- fs3Cov.class_rf@model$validation_metrics@metrics$ mean_per_class_error
covH
Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
1 Random_forest 0.277551 0.6634014 0.8144946 3.606853 0.8715399
Logistic regression using the GLM function will be the second predictive model that we’ll investigate.
fs3Cov.class_glm<- h2o.glm(
family= "multinomial",
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=17:24, ## the predictor columns, by column index
y=2,
lambda=0
)
Logistic regression does provide coefficients but they can be difficult to interpret.
fs3Cov.class_glm@model$coefficients_table
Coefficients: glm multinomial coefficients
names coefs_class_0 coefs_class_1 coefs_class_2
1 Intercept -2.789630 -29.177628 -1.748118
2 Income.$150,000 to $174,999 -12.301131 14.376549 -1.050206
3 Income.$175,000 to $199,999, -0.419814 -0.424694 0.173737
4 Income.$200,000 or more -0.372210 1.875082 -0.494475
5 Income.$20000 to $29999 -0.328354 12.463572 -0.328955
coefs_class_3 coefs_class_4 coefs_class_5 coefs_class_6 coefs_class_7
1 -1.317261 -31.382642 -1.503323 -2.020717 -1.514884
2 -1.340962 -0.056789 0.976500 -0.094116 0.818264
3 -1.370077 14.454119 0.443560 -0.868560 0.111476
4 -12.078823 0.852225 0.689659 -0.019912 0.252480
5 -0.870218 12.568106 0.205512 -0.378688 0.294759
std_coefs_class_0 std_coefs_class_1 std_coefs_class_2 std_coefs_class_3
1 -2.789630 -29.177628 -1.748118 -1.317261
2 -12.301131 14.376549 -1.050206 -1.340962
3 -0.419814 -0.424694 0.173737 -1.370077
4 -0.372210 1.875082 -0.494475 -12.078823
5 -0.328354 12.463572 -0.328955 -0.870218
std_coefs_class_4 std_coefs_class_5 std_coefs_class_6 std_coefs_class_7
1 -31.382642 -1.503323 -2.020717 -1.514884
2 -0.056789 0.976500 -0.094116 0.818264
3 14.454119 0.443560 -0.868560 0.111476
4 0.852225 0.689659 -0.019912 0.252480
5 12.568106 0.205512 -0.378688 0.294759
---
names coefs_class_0 coefs_class_1 coefs_class_2
38 Language.Italian 0.872204 -12.074147 0.432465
39 Language.Spanish 0.180752 0.538956 0.390999
40 Employment.Fully employed -0.039444 1.336692 0.047779
41 Employment.Not employed -0.219908 0.705802 0.053990
42 Smoker.Yes 0.205536 -12.207787 0.220553
43 Sex.Male 0.531955 0.541160 -0.092518
coefs_class_3 coefs_class_4 coefs_class_5 coefs_class_6 coefs_class_7
38 -0.419141 -12.239819 0.442000 -0.526422 0.075546
39 -0.850755 -0.492135 0.084140 -0.524931 0.439032
40 -0.064321 -0.829027 -0.156842 -0.053498 0.282103
41 0.180056 1.062008 -0.232056 -0.154852 0.342767
42 -0.601104 -0.221593 -0.163952 0.441261 -0.294634
43 -0.287154 -0.946531 0.053459 0.164545 -0.112481
std_coefs_class_0 std_coefs_class_1 std_coefs_class_2 std_coefs_class_3
38 0.872204 -12.074147 0.432465 -0.419141
39 0.180752 0.538956 0.390999 -0.850755
40 -0.039444 1.336692 0.047779 -0.064321
41 -0.219908 0.705802 0.053990 0.180056
42 0.205536 -12.207787 0.220553 -0.601104
43 0.531955 0.541160 -0.092518 -0.287154
std_coefs_class_4 std_coefs_class_5 std_coefs_class_6 std_coefs_class_7
38 -12.239819 0.442000 -0.526422 0.075546
39 -0.492135 0.084140 -0.524931 0.439032
40 -0.829027 -0.156842 -0.053498 0.282103
41 1.062008 -0.232056 -0.154852 0.342767
42 -0.221593 -0.163952 0.441261 -0.294634
43 -0.946531 0.053459 0.164545 -0.112481
The error rate is quite poor on the training sample for the random forest model.
fs3Cov.class_glm@model$training_metrics
H2OMultinomialMetrics: glm
** Reported on training data. **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.5700341
RMSE: (Extract with `h2o.rmse`) 0.7550061
Logloss: (Extract with `h2o.logloss`) 1.551365
Mean Per-Class Error: 0.7689441
Null Deviance: (Extract with `h2o.nulldeviance`) 2306.878
Residual Deviance: (Extract with `h2o.residual_deviance`) 2016.775
R^2: (Extract with `h2o.r2`) 0.8803808
AIC: (Extract with `h2o.aic`) NaN
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 7 0 25 1 0 39 4 2 0.9103 = 71 / 78
Seg2 1 3 1 0 0 7 0 0 0.7500 = 9 / 12
Seg3 6 1 69 1 1 69 4 3 0.5519 = 85 / 154
Seg4 2 1 9 6 0 32 1 1 0.8846 = 46 / 52
Seg5 0 0 1 0 0 8 0 0 1.0000 = 9 / 9
Seg6 8 0 45 3 0 138 7 4 0.3268 = 67 / 205
Seg7 4 1 14 4 0 40 8 0 0.8873 = 63 / 71
Seg8 3 0 19 3 0 33 0 11 0.8406 = 58 / 69
Totals 31 6 183 18 1 366 24 21 0.6277 = 408 / 650
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.372308
2 2 0.613846
3 3 0.778462
4 4 0.880000
5 5 0.947692
6 6 0.984615
7 7 0.998462
8 8 1.000000
The holdout sample error rate is worse than for the training sample, which is what should be expected. Unfortunately, the prediction error rate is very high.
h2o.confusionMatrix(fs3Cov.class_glm, fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 2 2 10 0 0 16 1 0 0.9355 = 29 / 31
Seg2 0 1 2 0 0 3 0 0 0.8333 = 5 / 6
Seg3 2 0 14 0 1 32 3 4 0.7500 = 42 / 56
Seg4 1 1 6 1 0 6 1 0 0.9375 = 15 / 16
Seg5 0 0 1 0 0 2 0 0 1.0000 = 3 / 3
Seg6 3 0 23 1 0 46 4 1 0.4103 = 32 / 78
Seg7 2 2 11 0 0 16 1 1 0.9697 = 32 / 33
Seg8 1 1 5 0 0 13 2 0 1.0000 = 22 / 22
Totals 11 7 72 2 1 134 12 6 0.7347 = 180 / 245
Of course the hit-ratios, which are just 1 minus the error rates, are quite poor.
h2o.hit_ratio_table(fs3Cov.class_glm, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios:
k hit_ratio
1 1 0.372308
2 2 0.613846
3 3 0.778462
4 4 0.880000
5 5 0.947692
6 6 0.984615
7 7 0.998462
8 8 1.000000
$valid
Top-8 Hit Ratios:
k hit_ratio
1 1 0.265306
2 2 0.453061
3 3 0.624490
4 4 0.783673
5 5 0.881633
6 6 0.955102
7 7 0.979592
8 8 1.000000
The table below shows that the diagnostic statistics for the random forest model and logistic regression model. Neither of those results are not at all impressive.
covH[2, 1] <- "GLM"
covH[2, 2] <- fs3Cov.class_glm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[2, 3] <- fs3Cov.class_glm@model$validation_metrics@metrics$MSE #
covH[2, 4] <- fs3Cov.class_glm@model$validation_metrics@metrics$RMSE #
covH[2, 5] <- fs3Cov.class_glm@model$validation_metrics@metrics$ logloss
covH[2, 6] <- fs3Cov.class_glm@model$validation_metrics@metrics$ mean_per_class_error
covH
Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
1 Random_forest 0.2775510 0.6634014 0.8144946 3.606853 0.8715399
2 GLM 0.2653061 0.6644477 0.8151366 2.407140 0.8545338
The basic deep learning model has a very simple structure that can be expanded.
fs3Cov.class_dl<- h2o.deeplearning(
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=17:24, ## the predictor columns, by column index
y=2
)
The training sample error rate is a little better than for the earlier models. However, this does not really mean anything.
fs3Cov.class_dl@model$training_metrics
H2OMultinomialMetrics: deeplearning
** Reported on training data. **
** Metrics reported on full training frame **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.4472733
RMSE: (Extract with `h2o.rmse`) 0.6687849
Logloss: (Extract with `h2o.logloss`) 1.254786
Mean Per-Class Error: 0.5981942
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 31 0 4 4 0 28 7 4 0.6026 = 47 / 78
Seg2 1 0 2 1 0 5 0 3 1.0000 = 12 / 12
Seg3 7 0 58 6 0 54 23 6 0.6234 = 96 / 154
Seg4 1 0 2 24 0 17 4 4 0.5385 = 28 / 52
Seg5 0 0 0 0 0 6 3 0 1.0000 = 9 / 9
Seg6 5 0 6 6 0 172 12 4 0.1610 = 33 / 205
Seg7 0 0 1 2 0 19 48 1 0.3239 = 23 / 71
Seg8 1 0 5 4 0 24 3 32 0.5362 = 37 / 69
Totals 46 0 78 47 0 325 100 54 0.4385 = 285 / 650
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.561538
2 2 0.749231
3 3 0.869231
4 4 0.938462
5 5 0.963077
6 6 0.976923
7 7 1.000000
8 8 1.000000
Unfortunately, the validation sample error rate is worse than that produced by the random forest and GLM models.
h2o.confusionMatrix(fs3Cov.class_dl, fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 1 0 4 1 0 18 6 1 0.9677 = 30 / 31
Seg2 1 0 0 0 0 3 2 0 1.0000 = 6 / 6
Seg3 1 0 7 3 0 33 8 4 0.8750 = 49 / 56
Seg4 1 0 5 0 0 7 2 1 1.0000 = 16 / 16
Seg5 0 0 0 1 0 2 0 0 1.0000 = 3 / 3
Seg6 3 0 8 6 0 44 12 5 0.4359 = 34 / 78
Seg7 4 0 1 0 0 24 2 2 0.9394 = 31 / 33
Seg8 0 0 3 0 0 15 2 2 0.9091 = 20 / 22
Totals 11 0 28 11 0 146 34 15 0.7714 = 189 / 245
Again, the hit ratios are very poor.
h2o.hit_ratio_table(fs3Cov.class_dl, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios:
k hit_ratio
1 1 0.561538
2 2 0.749231
3 3 0.869231
4 4 0.938462
5 5 0.963077
6 6 0.976923
7 7 1.000000
8 8 1.000000
$valid
Top-8 Hit Ratios:
k hit_ratio
1 1 0.228571
2 2 0.424490
3 3 0.542857
4 4 0.718367
5 5 0.840816
6 6 0.959184
7 7 0.983673
8 8 1.000000
The diagnostics statistics table indicates that the GLM model is the best performer in a very dismal race.
covH[3, 1] <- "Deep_Learning"
covH[3, 2] <- fs3Cov.class_dl@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[3, 3] <- fs3Cov.class_dl@model$validation_metrics@metrics$MSE #
covH[3, 4] <- fs3Cov.class_dl@model$validation_metrics@metrics$RMSE #
covH[3, 5] <- fs3Cov.class_dl@model$validation_metrics@metrics$ logloss
covH[3, 6] <- fs3Cov.class_dl@model$validation_metrics@metrics$ mean_per_class_error
covH
Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
1 Random_forest 0.2775510 0.6634014 0.8144946 3.606853 0.8715399
2 GLM 0.2653061 0.6644477 0.8151366 2.407140 0.8545338
3 Deep_Learning 0.2285714 0.6793221 0.8242100 2.218547 0.8908905
The basic gradient boosting model, using neural networks, is specified below.
fs3Cov.class_gbm<- h2o.gbm(
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=17:24, ## the predictor columns, by column index
y=2
)
The incredibly low error rate shown below for the training sample is quite suspicious compared to the earlier models.
fs3Cov.class_gbm@model$training_metrics
H2OMultinomialMetrics: gbm
** Reported on training data. **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.102122
RMSE: (Extract with `h2o.rmse`) 0.3195654
Logloss: (Extract with `h2o.logloss`) 0.3592171
Mean Per-Class Error: 0.02954064
R^2: (Extract with `h2o.r2`) 0.9785701
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 73 1 0 1 0 2 1 0 0.0641 = 5 / 78
Seg2 0 12 0 0 0 0 0 0 0.0000 = 0 / 12
Seg3 1 0 146 0 0 6 0 1 0.0519 = 8 / 154
Seg4 0 0 2 49 0 1 0 0 0.0577 = 3 / 52
Seg5 0 0 0 0 9 0 0 0 0.0000 = 0 / 9
Seg6 0 0 2 1 0 201 0 1 0.0195 = 4 / 205
Seg7 0 0 0 0 0 1 70 0 0.0141 = 1 / 71
Seg8 1 0 1 0 0 0 0 67 0.0290 = 2 / 69
Totals 75 13 151 51 9 211 71 69 0.0354 = 23 / 650
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.964615
2 2 0.996923
3 3 1.000000
4 4 1.000000
5 5 1.000000
6 6 1.000000
7 7 1.000000
8 8 1.000000
However, the holdout sample error rate is about as bad as seen earlier.
h2o.confusionMatrix(fs3Cov.class_gbm, fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 0 0 12 0 0 18 1 0 1.0000 = 31 / 31
Seg2 0 0 3 1 0 2 0 0 1.0000 = 6 / 6
Seg3 3 0 21 0 0 24 2 6 0.6250 = 35 / 56
Seg4 1 0 7 0 0 7 1 0 1.0000 = 16 / 16
Seg5 0 0 1 0 0 0 2 0 1.0000 = 3 / 3
Seg6 9 1 19 4 1 31 10 3 0.6026 = 47 / 78
Seg7 4 1 7 1 0 18 1 1 0.9697 = 32 / 33
Seg8 0 0 8 1 0 9 2 2 0.9091 = 20 / 22
Totals 17 2 78 7 1 109 19 12 0.7755 = 190 / 245
h2o.hit_ratio_table(fs3Cov.class_gbm, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios:
k hit_ratio
1 1 0.964615
2 2 0.996923
3 3 1.000000
4 4 1.000000
5 5 1.000000
6 6 1.000000
7 7 1.000000
8 8 1.000000
$valid
Top-8 Hit Ratios:
k hit_ratio
1 1 0.224490
2 2 0.444898
3 3 0.587755
4 4 0.746939
5 5 0.857143
6 6 0.959184
7 7 0.987755
8 8 1.000000
The GLM model is maintaining its narrow edge over the other three models.
covH[4, 1] <- "GBM_Boosting"
covH[4, 2] <- fs3Cov.class_gbm@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[4, 3] <- fs3Cov.class_gbm@model$validation_metrics@metrics$MSE #
covH[4, 4] <- fs3Cov.class_gbm@model$validation_metrics@metrics$RMSE #
covH[4, 5] <- fs3Cov.class_gbm@model$validation_metrics@metrics$ logloss
covH[4, 6] <- fs3Cov.class_gbm@model$validation_metrics@metrics$ mean_per_class_error
covH
Prediction_model hit_ratio MSE RMSE logloss mean_per_class_error
1 Random_forest 0.2775510 0.6634014 0.8144946 3.606853 0.8715399
2 GLM 0.2653061 0.6644477 0.8151366 2.407140 0.8545338
3 Deep_Learning 0.2285714 0.6793221 0.8242100 2.218547 0.8908905
4 GBM_Boosting 0.2244898 0.6720699 0.8197987 2.361972 0.8882940
fs3Cov.class_nB <- h2o.naiveBayes(
training_frame = fs.split[[1]], ## the H2O frame for training
validation_frame = fs.split[[2]], ## the H2O frame for validation (not required)
x=17:24, ## the predictor columns, by column index
y=2,
laplace = 3)
The naive Bayes model error rate for the training sample is quite similar to the results of the other four models.
fs3Cov.class_nB@model$training_metrics
H2OMultinomialMetrics: naivebayes
** Reported on training data. **
Training Set Metrics:
=====================
Extract training frame with `h2o.getFrame("RTMP_sid_b243_23")`
MSE: (Extract with `h2o.mse`) 0.5938202
RMSE: (Extract with `h2o.rmse`) 0.7705973
Logloss: (Extract with `h2o.logloss`) 1.643498
Mean Per-Class Error: 0.8257825
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 4 0 27 0 0 44 2 1 0.9487 = 74 / 78
Seg2 0 0 1 0 0 10 0 1 1.0000 = 12 / 12
Seg3 4 0 66 1 0 79 3 1 0.5714 = 88 / 154
Seg4 2 0 8 1 0 36 3 2 0.9808 = 51 / 52
Seg5 0 0 0 0 0 9 0 0 1.0000 = 9 / 9
Seg6 3 0 39 0 0 157 3 3 0.2341 = 48 / 205
Seg7 1 0 18 0 0 48 4 0 0.9437 = 67 / 71
Seg8 0 0 18 1 0 45 0 5 0.9275 = 64 / 69
Totals 14 0 177 3 0 428 15 13 0.6354 = 413 / 650
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-8 Hit Ratios:
k hit_ratio
1 1 0.364615
2 2 0.596923
3 3 0.763077
4 4 0.853846
5 5 0.933846
6 6 0.973846
7 7 0.998462
8 8 1.000000
The holdout error rate is also about the same as is those rates of the other models.
h2o.confusionMatrix(fs3Cov.class_nB, fs.split[[2]] )
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Seg1 Seg2 Seg3 Seg4 Seg5 Seg6 Seg7 Seg8 Error Rate
Seg1 1 0 11 0 0 19 0 0 0.9677 = 30 / 31
Seg2 0 0 3 0 0 3 0 0 1.0000 = 6 / 6
Seg3 1 0 14 0 0 39 2 0 0.7500 = 42 / 56
Seg4 0 0 5 0 0 11 0 0 1.0000 = 16 / 16
Seg5 0 0 1 0 0 2 0 0 1.0000 = 3 / 3
Seg6 1 0 23 0 0 51 3 0 0.3462 = 27 / 78
Seg7 1 0 9 0 0 23 0 0 1.0000 = 33 / 33
Seg8 0 0 4 0 0 17 1 0 1.0000 = 22 / 22
Totals 4 0 70 0 0 165 6 0 0.7306 = 179 / 245
h2o.hit_ratio_table(fs3Cov.class_nB, train=TRUE, valid=TRUE )
$train
Top-8 Hit Ratios:
k hit_ratio
1 1 0.364615
2 2 0.596923
3 3 0.763077
4 4 0.853846
5 5 0.933846
6 6 0.973846
7 7 0.998462
8 8 1.000000
$valid
Top-8 Hit Ratios:
k hit_ratio
1 1 0.269388
2 2 0.518367
3 3 0.657143
4 4 0.779592
5 5 0.877551
6 6 0.959184
7 7 0.983673
8 8 1.000000
covH[5, 1] <- "Naive_Bayes"
covH[5, 2] <- fs3Cov.class_nB@model$validation_metrics@metrics$hit_ratio_table$hit_ratio[1] #
covH[5, 3] <- fs3Cov.class_nB@model$validation_metrics@metrics$MSE #
covH[5, 4] <- fs3Cov.class_nB@model$validation_metrics@metrics$RMSE #
covH[5, 5] <- fs3Cov.class_nB@model$validation_metrics@metrics$ logloss
covH[5, 6] <- fs3Cov.class_nB@model$validation_metrics@metrics$ mean_per_class_error
The naive Bayes model maintained its narrow edge over the other models. However, the hit ratio is nothing to be proud of. The naive Bayes model seems to be superior on several of the error statistics.
pacman::p_load(sjPlot)
tab_df(covH, sort.column = -2, show.rownames = FALSE, digits = 3,
title = "Statistics of five Level 2 models of the 8-segment solution")
Prediction_model | hit_ratio | MSE | RMSE | logloss | mean_per_class_error |
---|---|---|---|---|---|
Random_forest | 0.278 | 0.663 | 0.814 | 3.607 | 0.872 |
Naive_Bayes | 0.269 | 0.653 | 0.808 | 1.891 | 0.883 |
GLM | 0.265 | 0.664 | 0.815 | 2.407 | 0.855 |
Deep_Learning | 0.229 | 0.679 | 0.824 | 2.219 | 0.891 |
GBM_Boosting | 0.224 | 0.672 | 0.820 | 2.362 | 0.888 |
As seen in the following graph, Naive bayes were the top model in predicting the segments. Altough the general resutls is not comparable to the numbers we had in level-1 segmentation!
pacman::p_load(reshape2)
# covH[,1:2]
hits_long<- melt(covH[,1:2] ) # need to reshape to 'long' form
# head(hits_long )
pacman::p_load(ggplot2)
ggplot(data=hits_long, aes(x= reorder(Prediction_model, -value), y=value ) ) +
geom_bar( stat="identity" , fill = "pink", width = 0.3) +
geom_point(aes( color= Prediction_model ), size=3) +
theme(axis.text.x = element_text(angle = 25, hjust = 1, size=10)) +
labs( title= "Hit ratios for predictive models using covariates" ,
y= "Proportion correct", x= "Prediction Model") +
ylim(0, 1)
Predict each respondents’ segment based on fs3Cov.class_nB model.
cov.assign.hex = h2o.predict(fs3Cov.class_nB, df8C.class[,17:24])
cov.assign.hex<- as.factor(cov.assign.hex) # cluster 'names' must be factors for modeling
h2o.print(cov.assign.hex$predict, n = 10L)
h2o.table(cov.assign.hex$predict) # table of assignments over all respondents
The segments developed from the 8-segment k-means solution and those predicted based on the demographic variables are combined into a single dataframe below and then saved.
fs6.L1.L2.segs <- h2o.cbind(df8C.class[,c(1,2)], cov.assign.hex$predict, df8C.class[,3:38])
colnames(fs6.L1.L2.segs)[ c(2,3)] <- c("L1_Segments", "L2_Segments")
fs6.L1.L2.segs[1:6, 1:5]
RID L1_Segments L2_Segments Easy_Reservation Preferred_Seats
1 1 Seg2 Seg6 1 1
2 2 Seg5 Seg6 1 1
3 3 Seg2 Seg6 1 1
4 4 Seg2 Seg6 1 1
5 5 Seg5 Seg6 1 1
6 6 Seg5 Seg6 1 1
[6 rows x 5 columns]
This table shows the coincidence and divergence of prediction using the two models.
print( h2o.table(fs6.L1.L2.segs$L1_Segments, fs6.L1.L2.segs$L2_Segments), n=64L)
L1_Segments L2_Segments Counts
1 Seg1 Seg1 6
2 Seg1 Seg3 70
3 Seg1 Seg6 141
4 Seg1 Seg7 2
5 Seg1 Seg8 3
6 Seg2 Seg3 9
7 Seg2 Seg6 19
8 Seg2 Seg8 1
9 Seg3 Seg1 6
10 Seg3 Seg3 133
11 Seg3 Seg4 2
12 Seg3 Seg6 251
13 Seg3 Seg7 8
14 Seg3 Seg8 3
15 Seg4 Seg1 3
16 Seg4 Seg3 23
17 Seg4 Seg4 1
18 Seg4 Seg6 92
19 Seg4 Seg7 3
20 Seg4 Seg8 3
21 Seg5 Seg3 3
22 Seg5 Seg6 17
23 Seg5 Seg7 1
24 Seg6 Seg1 8
25 Seg6 Seg3 136
26 Seg6 Seg4 1
27 Seg6 Seg6 413
28 Seg6 Seg7 9
29 Seg6 Seg8 3
30 Seg7 Seg1 3
31 Seg7 Seg3 50
32 Seg7 Seg6 147
33 Seg7 Seg7 5
34 Seg7 Seg8 1
35 Seg8 Seg1 2
36 Seg8 Seg3 49
37 Seg8 Seg4 3
38 Seg8 Seg6 131
39 Seg8 Seg7 2
40 Seg8 Seg8 5
[40 rows x 3 columns]
The table below lists the mean ratings for each attitudinal variable for those in each of the 8 Level 1 segments.
Use the information below and the earlier information to determine if this 8-segment solution has good interpretability. If not, go back to one of the other segmentation solutions.
library("dplyr")
library("kableExtra")
library(psych)
library(data.table)
df_final_L1_L2 <- as.data.frame(fs6.L1.L2.segs)
df_final_L1_L2 %>%
mutate(Group = as.factor(L1_Segments)) %>%
group_by(Group) %>%
summarize(Avg_v075 = round(mean(Easy_Reservation),2),
Avg_v076 = round(mean(Preferred_Seats),2),
Avg_v077 = round(mean(Flight_Options),2),
Avg_v078 = round(mean(Ticket_Prices),2),
Avg_v079 = round(mean(Seat_Comfort),2),
Avg_v080 = round(mean(Seat_Roominess),2),
Avg_v081 = round(mean(Overhead_Storage),2),
Avg_v082 = round(mean(Clean_Aircraft),2),
Avg_v083 = round(mean(Courtesy),2),
Avg_v084 = round(mean(Friendliness),2),
Avg_v085 = round(mean(Helpfulness),2),
Avg_v086 = round(mean(Service),2),
Avg_v087 = round(mean(Satisfaction),2),
Avg_v088 = round(mean(Recommend), 2),
Count_of_Members = n()
) %>%
arrange(Group) %>%
transpose() -> cd
colnames(cd) <- cd[1,]
cd <- cd[-1,]
cd$order <- 1:nrow(cd)
rownames(cd)[1:14] <- colnames(df_final_L1_L2[,4:17])
rownames(cd)[15] <- c("Segment_Size")
# cd$variable <- rownames(cd)
cd$variable <- c(names(df_final_L1_L2)[4:17],
"Segment Size")
cd <- cd[, c(10, 1:9)]
cd[,2:9] <- lapply(cd[,2:9], function(x) as.numeric(as.character(x)))
cd[1:14, 1:9] %>%
arrange(variable ) %>%
mutate_if(is.numeric, function(x) {
cell_spec(x, bold = T,
color = spec_color(x, end = 0.9),
font_size = spec_font_size(x))
}) %>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = T, position = "left")
variable | Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 |
---|---|---|---|---|---|---|---|---|
Clean_Aircraft | 6.92 | 1.52 | 8.52 | 6.61 | 2.95 | 8.41 | 7.75 | 7.59 |
Courtesy | 8.56 | 2.03 | 8.6 | 7.41 | 3.43 | 8.68 | 8.63 | 7.72 |
Easy_Reservation | 8.47 | 1.86 | 8.26 | 7.41 | 3.71 | 8.79 | 7.37 | 8.6 |
Flight_Options | 7.53 | 1.38 | 6.09 | 5.69 | 1.86 | 8.28 | 5.53 | 7.8 |
Friendliness | 8.57 | 2.28 | 8.56 | 7.09 | 3.48 | 8.71 | 8.57 | 7.14 |
Helpfulness | 8.5 | 2.14 | 8.47 | 6.47 | 4.14 | 8.75 | 8.38 | 7.05 |
Overhead_Storage | 5.7 | 1.45 | 8.01 | 5.26 | 2.14 | 8.17 | 6.86 | 6.83 |
Preferred_Seats | 7.45 | 1.34 | 7.09 | 6.18 | 1.62 | 8.25 | 5.31 | 7.44 |
Recommend | 6.42 | 1.38 | 7.52 | 5.33 | 1.86 | 8.28 | 5.53 | 7.51 |
Satisfaction | 7.84 | 1.59 | 8.24 | 6.14 | 2.76 | 8.6 | 7.82 | 7.24 |
Seat_Comfort | 6.71 | 1.86 | 8.48 | 6.11 | 3.19 | 8.64 | 7.91 | 7.94 |
Seat_Roominess | 5.96 | 1.48 | 8.35 | 6.45 | 2.71 | 8.34 | 7.35 | 7.42 |
Service | 8.43 | 1.9 | 8.6 | 6.22 | 4.1 | 8.74 | 8.54 | 6.7 |
Ticket_Prices | 7.14 | 1.17 | 6.39 | 5.05 | 1.52 | 8.02 | 5.11 | 7.09 |
There is not much that could be imputed from this plot. Some known infomation could be related to it, for example, the CA shows that Segments 2 and 5 that had the least score in attitudinal variables are close in the plot as well as segments 6 and 8 which had the best scores. Other than that, it shows the closeness of service-related variables (e.g. Helpfulness, Friendliness, Courtesy) but it does not give us much personified information about the segments.
library(FactoMineR)
cd.m <- cd[1:14, 2:9]
c <- CA(cd.m, graph=FALSE)
plot(c, title="Correspondence Analysis of Attributes and Segments", col.main="blue" )
The several tables that follow provide the row and column percents for each level of each demographic for the 8 Level 1 segments. While prediction using the demographic variables has been shown to be very poor, there may be some insights that could help to better describe the segments.
The Chi-square statistics beneath each table provide some indication of whether the table shows any significant relationship between the variables and the segments.
The majority of customers are english speaking. And the second most is Spanish. The two most populated segments in each language are segments 3 and 6, that also were the most populated segments in total. Not much insight here!
library(sjPlot)
sjt.xtab(df_final_L1_L2$Language, df_final_L1_L2$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)
Language | L1_Segments | Total | |||||||
---|---|---|---|---|---|---|---|---|---|
Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 | ||
English |
117 11.7 % 57.4 % |
16 1.6 % 57.1 % |
227 22.8 % 60.4 % |
76 7.6 % 66.1 % |
14 1.4 % 73.7 % |
315 31.6 % 61.6 % |
122 12.2 % 67 % |
110 11 % 63.2 % |
997 100 % 62 % |
French |
27 14.6 % 13.2 % |
6 3.2 % 21.4 % |
41 22.2 % 10.9 % |
11 5.9 % 9.6 % |
2 1.1 % 10.5 % |
69 37.3 % 13.5 % |
17 9.2 % 9.3 % |
12 6.5 % 6.9 % |
185 100 % 11.5 % |
German |
1 11.1 % 0.5 % |
0 0 % 0 % |
1 11.1 % 0.3 % |
2 22.2 % 1.7 % |
1 11.1 % 5.3 % |
0 0 % 0 % |
2 22.2 % 1.1 % |
2 22.2 % 1.1 % |
9 100 % 0.6 % |
Italian |
14 18.7 % 6.9 % |
0 0 % 0 % |
20 26.7 % 5.3 % |
4 5.3 % 3.5 % |
0 0 % 0 % |
25 33.3 % 4.9 % |
4 5.3 % 2.2 % |
8 10.7 % 4.6 % |
75 100 % 4.7 % |
Spanish |
45 13.1 % 22.1 % |
6 1.7 % 21.4 % |
87 25.4 % 23.1 % |
22 6.4 % 19.1 % |
2 0.6 % 10.5 % |
102 29.7 % 20 % |
37 10.8 % 20.3 % |
42 12.2 % 24.1 % |
343 100 % 21.3 % |
Total |
204 12.7 % 100 % |
28 1.7 % 100 % |
376 23.4 % 100 % |
115 7.1 % 100 % |
19 1.2 % 100 % |
511 31.8 % 100 % |
182 11.3 % 100 % |
174 10.8 % 100 % |
1609 100 % 100 % |
χ2=38.233 · df=28 · Cramer’s V=0.077 · Fisher’s p=0.119 |
Aboth 80 percent of customers are non-smokers.
sjt.xtab(df_final_L1_L2$Smoker, df_final_L1_L2$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)
Smoker | L1_Segments | Total | |||||||
---|---|---|---|---|---|---|---|---|---|
Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 | ||
No |
165 12.7 % 83.8 % |
24 1.8 % 92.3 % |
293 22.5 % 79.4 % |
94 7.2 % 83.2 % |
15 1.2 % 71.4 % |
422 32.4 % 81.3 % |
142 10.9 % 75.9 % |
148 11.4 % 80.9 % |
1303 100 % 80.7 % |
Yes |
32 10.3 % 16.2 % |
2 0.6 % 7.7 % |
76 24.4 % 20.6 % |
19 6.1 % 16.8 % |
6 1.9 % 28.6 % |
97 31.1 % 18.7 % |
45 14.4 % 24.1 % |
35 11.2 % 19.1 % |
312 100 % 19.3 % |
Total |
197 12.2 % 100 % |
26 1.6 % 100 % |
369 22.8 % 100 % |
113 7 % 100 % |
21 1.3 % 100 % |
519 32.1 % 100 % |
187 11.6 % 100 % |
183 11.3 % 100 % |
1615 100 % 100 % |
χ2=8.282 · df=7 · Cramer’s V=0.072 · Fisher’s p=0.304 |
One of the more important demographic variable according to the Chi-Square tests. Most segemnts are nearly equally distributed between the three categories of employment type other than , again, segments 2 & 5 which have a small sample size.
sjt.xtab(df_final_L1_L2$Employment, df_final_L1_L2$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)
Employment | L1_Segments | Total | |||||||
---|---|---|---|---|---|---|---|---|---|
Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 | ||
Employed part-time |
69 12.8 % 33.7 % |
7 1.3 % 25 % |
118 21.9 % 31.8 % |
37 6.9 % 32.5 % |
10 1.9 % 52.6 % |
178 33.1 % 33.8 % |
69 12.8 % 35.6 % |
50 9.3 % 29.6 % |
538 100 % 33.1 % |
Fully employed |
73 13 % 35.6 % |
12 2.1 % 42.9 % |
123 21.9 % 33.2 % |
37 6.6 % 32.5 % |
3 0.5 % 15.8 % |
188 33.5 % 35.7 % |
65 11.6 % 33.5 % |
61 10.9 % 36.1 % |
562 100 % 34.5 % |
Not employed |
63 12 % 30.7 % |
9 1.7 % 32.1 % |
130 24.7 % 35 % |
40 7.6 % 35.1 % |
6 1.1 % 31.6 % |
161 30.6 % 30.6 % |
60 11.4 % 30.9 % |
58 11 % 34.3 % |
527 100 % 32.4 % |
Total |
205 12.6 % 100 % |
28 1.7 % 100 % |
371 22.8 % 100 % |
114 7 % 100 % |
19 1.2 % 100 % |
527 32.4 % 100 % |
194 11.9 % 100 % |
169 10.4 % 100 % |
1627 100 % 100 % |
χ2=9.428 · df=14 · Cramer’s V=0.054 · p=0.803 |
sjt.xtab(df_final_L1_L2$Education, df_final_L1_L2$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)
Education | L1_Segments | Total | |||||||
---|---|---|---|---|---|---|---|---|---|
Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 | ||
Bachelor’s degree |
35 13.9 % 17.3 % |
7 2.8 % 28 % |
65 25.8 % 17.5 % |
14 5.6 % 11.8 % |
1 0.4 % 5.3 % |
71 28.2 % 13.4 % |
31 12.3 % 16.7 % |
28 11.1 % 16.4 % |
252 100 % 15.5 % |
College diploma |
38 12.2 % 18.8 % |
3 1 % 12 % |
67 21.5 % 18.1 % |
32 10.3 % 26.9 % |
3 1 % 15.8 % |
103 33.1 % 19.5 % |
35 11.3 % 18.8 % |
30 9.6 % 17.5 % |
311 100 % 19.2 % |
Earned doctorate |
0 0 % 0 % |
0 0 % 0 % |
2 25 % 0.5 % |
2 25 % 1.7 % |
0 0 % 0 % |
2 25 % 0.4 % |
1 12.5 % 0.5 % |
1 12.5 % 0.6 % |
8 100 % 0.5 % |
High school diploma or certificate |
50 11.6 % 24.8 % |
5 1.2 % 20 % |
105 24.4 % 28.3 % |
24 5.6 % 20.2 % |
6 1.4 % 31.6 % |
141 32.8 % 26.7 % |
56 13 % 30.1 % |
43 10 % 25.1 % |
430 100 % 26.5 % |
Master’s degree |
10 12.7 % 5 % |
0 0 % 0 % |
19 24.1 % 5.1 % |
8 10.1 % 6.7 % |
1 1.3 % 5.3 % |
22 27.8 % 4.2 % |
9 11.4 % 4.8 % |
10 12.7 % 5.8 % |
79 100 % 4.9 % |
Medical degree |
1 7.1 % 0.5 % |
1 7.1 % 4 % |
4 28.6 % 1.1 % |
1 7.1 % 0.8 % |
0 0 % 0 % |
4 28.6 % 0.8 % |
2 14.3 % 1.1 % |
1 7.1 % 0.6 % |
14 100 % 0.9 % |
No certificate, diploma or degree |
46 14.3 % 22.8 % |
4 1.2 % 16 % |
65 20.2 % 17.5 % |
26 8.1 % 21.8 % |
3 0.9 % 15.8 % |
117 36.3 % 22.2 % |
32 9.9 % 17.2 % |
29 9 % 17 % |
322 100 % 19.9 % |
trades certificate |
13 9.5 % 6.4 % |
3 2.2 % 12 % |
31 22.6 % 8.4 % |
8 5.8 % 6.7 % |
1 0.7 % 5.3 % |
48 35 % 9.1 % |
11 8 % 5.9 % |
22 16.1 % 12.9 % |
137 100 % 8.5 % |
University certificate above bachelor’s degree |
3 13.6 % 1.5 % |
2 9.1 % 8 % |
3 13.6 % 0.8 % |
1 4.5 % 0.8 % |
2 9.1 % 10.5 % |
7 31.8 % 1.3 % |
3 13.6 % 1.6 % |
1 4.5 % 0.6 % |
22 100 % 1.4 % |
University certificate below bachelor level |
6 13 % 3 % |
0 0 % 0 % |
10 21.7 % 2.7 % |
3 6.5 % 2.5 % |
2 4.3 % 10.5 % |
13 28.3 % 2.5 % |
6 13 % 3.2 % |
6 13 % 3.5 % |
46 100 % 2.8 % |
Total |
202 12.5 % 100 % |
25 1.5 % 100 % |
371 22.9 % 100 % |
119 7.3 % 100 % |
19 1.2 % 100 % |
528 32.6 % 100 % |
186 11.5 % 100 % |
171 10.5 % 100 % |
1621 100 % 100 % |
χ2=68.773 · df=63 · Cramer’s V=0.078 · Fisher’s p=0.482 |
sjt.xtab(df_final_L1_L2$Marital, df_final_L1_L2$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)
Marital | L1_Segments | Total | |||||||
---|---|---|---|---|---|---|---|---|---|
Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 | ||
Divorced, not living with a life partner |
12 11.7 % 5.9 % |
1 1 % 3.6 % |
24 23.3 % 6.4 % |
7 6.8 % 5.8 % |
0 0 % 0 % |
33 32 % 6.3 % |
10 9.7 % 5.3 % |
16 15.5 % 9.1 % |
103 100 % 6.3 % |
Living common law |
51 11.6 % 25.2 % |
9 2.1 % 32.1 % |
81 18.5 % 21.7 % |
42 9.6 % 35 % |
8 1.8 % 44.4 % |
154 35.2 % 29.4 % |
54 12.3 % 28.9 % |
39 8.9 % 22.3 % |
438 100 % 26.9 % |
Married |
100 13.4 % 49.5 % |
13 1.7 % 46.4 % |
192 25.8 % 51.3 % |
44 5.9 % 36.7 % |
9 1.2 % 50 % |
218 29.3 % 41.7 % |
88 11.8 % 47.1 % |
81 10.9 % 46.3 % |
745 100 % 45.8 % |
Separated, not living with a life partner |
12 11.4 % 5.9 % |
1 1 % 3.6 % |
24 22.9 % 6.4 % |
8 7.6 % 6.7 % |
0 0 % 0 % |
37 35.2 % 7.1 % |
13 12.4 % 7 % |
10 9.5 % 5.7 % |
105 100 % 6.5 % |
Single, never married, not living with a life partner |
23 12 % 11.4 % |
4 2.1 % 14.3 % |
43 22.4 % 11.5 % |
16 8.3 % 13.3 % |
1 0.5 % 5.6 % |
64 33.3 % 12.2 % |
17 8.9 % 9.1 % |
24 12.5 % 13.7 % |
192 100 % 11.8 % |
Widowed |
4 9.1 % 2 % |
0 0 % 0 % |
10 22.7 % 2.7 % |
3 6.8 % 2.5 % |
0 0 % 0 % |
17 38.6 % 3.3 % |
5 11.4 % 2.7 % |
5 11.4 % 2.9 % |
44 100 % 2.7 % |
Total |
202 12.4 % 100 % |
28 1.7 % 100 % |
374 23 % 100 % |
120 7.4 % 100 % |
18 1.1 % 100 % |
523 32.1 % 100 % |
187 11.5 % 100 % |
175 10.8 % 100 % |
1627 100 % 100 % |
χ2=30.855 · df=35 · Cramer’s V=0.062 · Fisher’s p=0.774 |
The ratio is almost always 1:1 for men and female but in segment 5 we have a huge difference 1:2 in the ratio. Although the sample size for this segment is quite low (21
).
sjt.xtab(df_final_L1_L2$Sex, df_final_L1_L2$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)
Sex | L1_Segments | Total | |||||||
---|---|---|---|---|---|---|---|---|---|
Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 | ||
Female |
106 12.8 % 51 % |
15 1.8 % 55.6 % |
193 23.3 % 52.2 % |
57 6.9 % 49.6 % |
12 1.4 % 66.7 % |
257 31 % 49.8 % |
95 11.5 % 49.2 % |
93 11.2 % 54.7 % |
828 100 % 51.2 % |
Male |
102 12.9 % 49 % |
12 1.5 % 44.4 % |
177 22.4 % 47.8 % |
58 7.4 % 50.4 % |
6 0.8 % 33.3 % |
259 32.8 % 50.2 % |
98 12.4 % 50.8 % |
77 9.8 % 45.3 % |
789 100 % 48.8 % |
Total |
208 12.9 % 100 % |
27 1.7 % 100 % |
370 22.9 % 100 % |
115 7.1 % 100 % |
18 1.1 % 100 % |
516 31.9 % 100 % |
193 11.9 % 100 % |
170 10.5 % 100 % |
1617 100 % 100 % |
χ2=3.733 · df=7 · Cramer’s V=0.048 · p=0.810 |
Income is normally distributed in almost each segment and the mean is around $60k.
levels(df_final_L1_L2$Income) <- c("Less than $20,000",
"$20000 to $29999",
"$30000 to $39999","$40000 to $49999",
"$50000 to $59999",
"$60000 to $69999",
"$70000 to $79999",
"$80000 to $89999",
"$90000 to $99999",
"$100000 to $149999",
"$150,000 to $174,999",
"$175,000 to $199,999",
"$200,000 or more")
sjt.xtab(df_final_L1_L2$Income, df_final_L1_L2$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)
Income | L1_Segments | Total | |||||||
---|---|---|---|---|---|---|---|---|---|
Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 | ||
Less than $20,000 |
6 11.3 % 3 % |
0 0 % 0 % |
17 32.1 % 4.6 % |
7 13.2 % 6 % |
1 1.9 % 4.8 % |
15 28.3 % 2.9 % |
4 7.5 % 2.1 % |
3 5.7 % 1.7 % |
53 100 % 3.2 % |
$20000 to $29999 |
3 6 % 1.5 % |
1 2 % 3.7 % |
7 14 % 1.9 % |
3 6 % 2.6 % |
1 2 % 4.8 % |
21 42 % 4 % |
7 14 % 3.6 % |
7 14 % 3.9 % |
50 100 % 3.1 % |
$30000 to $39999 |
7 10.1 % 3.5 % |
1 1.4 % 3.7 % |
21 30.4 % 5.6 % |
5 7.2 % 4.3 % |
1 1.4 % 4.8 % |
22 31.9 % 4.2 % |
7 10.1 % 3.6 % |
5 7.2 % 2.8 % |
69 100 % 4.2 % |
$40000 to $49999 |
5 11.6 % 2.5 % |
0 0 % 0 % |
12 27.9 % 3.2 % |
0 0 % 0 % |
2 4.7 % 9.5 % |
15 34.9 % 2.9 % |
6 14 % 3.1 % |
3 7 % 1.7 % |
43 100 % 2.6 % |
$50000 to $59999 |
35 14.6 % 17.3 % |
4 1.7 % 14.8 % |
56 23.4 % 15 % |
16 6.7 % 13.7 % |
3 1.3 % 14.3 % |
72 30.1 % 13.7 % |
22 9.2 % 11.3 % |
31 13 % 17.4 % |
239 100 % 14.6 % |
$60000 to $69999 |
27 9.9 % 13.4 % |
5 1.8 % 18.5 % |
56 20.4 % 15 % |
16 5.8 % 13.7 % |
6 2.2 % 28.6 % |
93 33.9 % 17.7 % |
38 13.9 % 19.5 % |
33 12 % 18.5 % |
274 100 % 16.7 % |
$70000 to $79999 |
16 7.9 % 7.9 % |
4 2 % 14.8 % |
49 24.1 % 13.1 % |
23 11.3 % 19.7 % |
3 1.5 % 14.3 % |
57 28.1 % 10.8 % |
25 12.3 % 12.8 % |
26 12.8 % 14.6 % |
203 100 % 12.4 % |
$80000 to $89999 |
21 11.5 % 10.4 % |
1 0.5 % 3.7 % |
35 19.2 % 9.4 % |
11 6 % 9.4 % |
1 0.5 % 4.8 % |
66 36.3 % 12.5 % |
26 14.3 % 13.3 % |
21 11.5 % 11.8 % |
182 100 % 11.1 % |
$90000 to $99999 |
25 14.2 % 12.4 % |
5 2.8 % 18.5 % |
44 25 % 11.8 % |
12 6.8 % 10.3 % |
0 0 % 0 % |
49 27.8 % 9.3 % |
21 11.9 % 10.8 % |
20 11.4 % 11.2 % |
176 100 % 10.7 % |
$100000 to $149999 |
17 15 % 8.4 % |
2 1.8 % 7.4 % |
21 18.6 % 5.6 % |
10 8.8 % 8.5 % |
1 0.9 % 4.8 % |
40 35.4 % 7.6 % |
17 15 % 8.7 % |
5 4.4 % 2.8 % |
113 100 % 6.9 % |
$150,000 to $174,999 |
17 17.3 % 8.4 % |
2 2 % 7.4 % |
22 22.4 % 5.9 % |
6 6.1 % 5.1 % |
1 1 % 4.8 % |
29 29.6 % 5.5 % |
11 11.2 % 5.6 % |
10 10.2 % 5.6 % |
98 100 % 6 % |
$175,000 to $199,999 |
10 15.6 % 5 % |
1 1.6 % 3.7 % |
12 18.8 % 3.2 % |
5 7.8 % 4.3 % |
0 0 % 0 % |
24 37.5 % 4.6 % |
4 6.2 % 2.1 % |
8 12.5 % 4.5 % |
64 100 % 3.9 % |
$200,000 or more |
13 17.3 % 6.4 % |
1 1.3 % 3.7 % |
21 28 % 5.6 % |
3 4 % 2.6 % |
1 1.3 % 4.8 % |
23 30.7 % 4.4 % |
7 9.3 % 3.6 % |
6 8 % 3.4 % |
75 100 % 4.6 % |
Total |
202 12.3 % 100 % |
27 1.6 % 100 % |
373 22.8 % 100 % |
117 7.1 % 100 % |
21 1.3 % 100 % |
526 32.1 % 100 % |
195 11.9 % 100 % |
178 10.9 % 100 % |
1639 100 % 100 % |
χ2=76.727 · df=84 · Cramer’s V=0.082 · Fisher’s p=0.606 |
sjt.xtab(df_final_L1_L2$Age, df_final_L1_L2$L1_Segments,
show.row.prc = TRUE, show.col.prc = TRUE)
Age | L1_Segments | Total | |||||||
---|---|---|---|---|---|---|---|---|---|
Seg1 | Seg2 | Seg3 | Seg4 | Seg5 | Seg6 | Seg7 | Seg8 | ||
24 or younger |
32 12.1 % 15.6 % |
8 3 % 28.6 % |
62 23.4 % 16.7 % |
19 7.2 % 16.2 % |
5 1.9 % 23.8 % |
90 34 % 17.3 % |
24 9.1 % 13 % |
25 9.4 % 14.2 % |
265 100 % 16.3 % |
25 to 34 |
9 9.7 % 4.4 % |
1 1.1 % 3.6 % |
22 23.7 % 5.9 % |
6 6.5 % 5.1 % |
3 3.2 % 14.3 % |
30 32.3 % 5.8 % |
9 9.7 % 4.9 % |
13 14 % 7.4 % |
93 100 % 5.7 % |
35 to 44 |
29 14.4 % 14.1 % |
5 2.5 % 17.9 % |
49 24.4 % 13.2 % |
11 5.5 % 9.4 % |
1 0.5 % 4.8 % |
56 27.9 % 10.8 % |
27 13.4 % 14.6 % |
23 11.4 % 13.1 % |
201 100 % 12.4 % |
45 to 54 |
34 14.5 % 16.6 % |
3 1.3 % 10.7 % |
52 22.1 % 14 % |
18 7.7 % 15.4 % |
5 2.1 % 23.8 % |
75 31.9 % 14.5 % |
25 10.6 % 13.5 % |
23 9.8 % 13.1 % |
235 100 % 14.5 % |
55 to 64 |
27 12.4 % 13.2 % |
3 1.4 % 10.7 % |
45 20.6 % 12.1 % |
14 6.4 % 12 % |
2 0.9 % 9.5 % |
72 33 % 13.9 % |
31 14.2 % 16.8 % |
24 11 % 13.6 % |
218 100 % 13.4 % |
65 to 74 |
36 10.9 % 17.6 % |
6 1.8 % 21.4 % |
75 22.7 % 20.2 % |
28 8.5 % 23.9 % |
3 0.9 % 14.3 % |
98 29.7 % 18.9 % |
43 13 % 23.2 % |
41 12.4 % 23.3 % |
330 100 % 20.3 % |
75 to 84 |
32 15.5 % 15.6 % |
2 1 % 7.1 % |
48 23.3 % 12.9 % |
14 6.8 % 12 % |
2 1 % 9.5 % |
70 34 % 13.5 % |
19 9.2 % 10.3 % |
19 9.2 % 10.8 % |
206 100 % 12.7 % |
85 to 94 |
6 8.3 % 2.9 % |
0 0 % 0 % |
17 23.6 % 4.6 % |
7 9.7 % 6 % |
0 0 % 0 % |
27 37.5 % 5.2 % |
7 9.7 % 3.8 % |
8 11.1 % 4.5 % |
72 100 % 4.4 % |
95 or older |
0 0 % 0 % |
0 0 % 0 % |
1 50 % 0.3 % |
0 0 % 0 % |
0 0 % 0 % |
1 50 % 0.2 % |
0 0 % 0 % |
0 0 % 0 % |
2 100 % 0.1 % |
Total |
205 12.6 % 100 % |
28 1.7 % 100 % |
371 22.9 % 100 % |
117 7.2 % 100 % |
21 1.3 % 100 % |
519 32 % 100 % |
185 11.4 % 100 % |
176 10.9 % 100 % |
1622 100 % 100 % |
χ2=34.961 · df=56 · Cramer’s V=0.055 · Fisher’s p=0.991 |
Here we plot a histogram of customer’s age in different segment.
ggplot(df_final_L1_L2, aes(x = df_final_L1_L2$Age))+
geom_bar(aes(y=..count..,fill= df_final_L1_L2$L1_Segments)) +
#geom_area(aes(y = ..density.. ,fill = df_final_L1_L2$L1_Segments),adjust = 1, alpha = 0.5)+
facet_grid(df_final_L1_L2$L1_Segments ~.)
Segment 6: Have more than $60k income, and are either married or living with a common law partner. No particluar pattern in education.
Segment 5: Majority were part-time employed, more than any other segment. More female than male.
The demographic variables were not informing in personification of the segments. We will use the Level-1 attitude variables to define a marketing strategy.
It is important to identify what affected the “satisfaction” and “recommend” variables in each segment. For example in segment 7, the ratings for quality of service was near the top but medium perception on flight option variables and that caused the decline in cutomers’s perception of the airline. Meaning, this segment was more sensetive to flight options.
There were other senetivities among other segmetns. Segment 1 and 4 had a low rate on overhead_storage or segment 7 was more concerned about the ticket prices and flight options.
A simple marketing strategy would be to identify our best customers segments, that are segments 3,6 and 8. These were the people who seemed comfortable with the airline and most eager to return back. Also, 3 of the most populated segments. So this is all good news for us.
Other than that, we can try to address the so called ‘sensetivities’ of other segments, as mentioned before, and promote that in our advertisments for these groups.