Assignment 1

2.1 Reading the Ice Cream data set

The first step is to load the data and required packages.

library(dplyr)
library(psych)
library(DT)
library(tidyverse)
ic<- read.csv("./IceCream_raw.csv")
#ic<- read.csv("http://bus-sawtooth.mcmaster.ca/M733_ONLINE_F2020/IceCream_raw.csv")

and take a glance at what it offers.

headTail(ic) %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.2 Reducing the raw data to essential variables

In this analysis we are only focusing on the Dreyer’s brand. Based on the questionnaires we only need to analyze certain columns.

d_names <- c(grep("^D.$", names(ic), value = TRUE), "D10")
q11 <- grep("^Q11_", names(ic), value = TRUE)
q11_1 <- grep("^Q11_1_", q11, value = TRUE) ##Only for the Dreyer's brand
s7 <- grep("^S7_", names(ic), value = TRUE)
s8 <- grep("^S8_", names(ic), value = TRUE)
col_names <- c("ID",s7, s8, "Q1_1","Q2_1","Q3_1", q11_1, d_names)

The data and its selected variables are depicted below. The S7 and S8 are chose from the screener questions to identify the participants who have heard about or have purchased our brand in the past 6 months. Also the behavioral questions needed for this analysis that are Q1,Q2,Q3 and Q11 are selected for the Dreyer’s brand. Each question is coded with a number in its name that shows the corresponding ice cream brand. The number assigned to the Dreyer’s brand is 1, therefore we choose Q1_1, Q2_1 ,etc.

ic_sub <- ic[col_names]
headTail(ic_sub,5,5) %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 12, scrollX=T))

2.3 Filtering the participants

Starting to inspect the Dryer’s brand. it is evident that many NA values are present in S8 questions. Which by looking at the S7 question (Having heard of this brand) it shows that those are the people who have not heard of this brand and therefore should be omitted from this analysis.

Heard of Dreyer’s Brand
S7_1	Freq
0	247
1	354
NA	0

Purchased Dreyer’s brand in last 6 months
S8_1	Freq
0	234
1	120
NA	247

The following tables shows the participants (rows) that have the S8_1 variable missing. It also shows that these are the people who have not heard of our brand (S7_1 == 0).

ic[is.na(ic["S8_1"]),c("ID", "S7_1", "S8_1","Q1_1","Q2_1","Q3_1" )] %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.3.1 filtering rows

As the variable S7_1 has no missing values and it can effectively identify the people who have not heard of, therefore not purchased the Dreyer’s brand. These are the people who we need to exclude from the analysis and save the remaining subset to ic_sub_1

ic_sub_1 <- ic_sub %>% filter(!(S7_1 == 0))

Checking the S7_1 and S8_1 in the new filtered data

sjmisc::frq(ic_sub_1$S7_1, out = 'v', title = "Heard of Dreyer's Brand (S7_1)")

Heard of Dreyer’s Brand (S7_1)
val	label	frq	raw.prc	valid.prc	cum.prc
1		354	100	100	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=1.00 · σ=0.00

sjmisc::frq(ic_sub_1$S8_1, out = 'v', title = "Purchased Dreyer's brand in last 6 months (S8_1)")

Purchased Dreyer’s brand in last 6 months (S8_1)
val	label	frq	raw.prc	valid.prc	cum.prc
0		234	66.1	66.1	66.1
1		120	33.9	33.9	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=0.34 · σ=0.47

You can see that there is no missing values anymore in these two variables.

2.4 Missing values analyses

We filtered our subset of the data for the Dreyer’s brand analysis. Now let’s look at missing values in this subset.

options(max.print = 100000)
library(inspectdf)
ic_sub_1 %>% inspect_na()  %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

and the following figures to see the pattern of the missing data.

library(VIM)
ic_sub_1[, c(2:47)] %>% aggr(col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))


 Variables sorted by number of missings: 
 Variable      Count
     Q1_1 0.66101695
     Q2_1 0.66101695
     Q3_1 0.66101695
     S8_3 0.45762712
  Q11_1_1 0.28248588
  Q11_1_2 0.28248588
  Q11_1_3 0.28248588
  Q11_1_4 0.28248588
  Q11_1_7 0.28248588
  Q11_1_8 0.28248588
  Q11_1_9 0.28248588
 Q11_1_10 0.28248588
 Q11_1_11 0.28248588
 Q11_1_13 0.28248588
 Q11_1_15 0.28248588
  Q11_1_5 0.27966102
  Q11_1_6 0.27966102
 Q11_1_12 0.27966102
 Q11_1_14 0.27966102
 Q11_1_16 0.27966102
 Q11_1_17 0.27966102
     S8_2 0.25423729
     S8_7 0.17796610
     S8_4 0.03389831
     S8_5 0.01412429
     S8_6 0.01412429
     S7_1 0.00000000
     S7_2 0.00000000
     S7_3 0.00000000
     S7_4 0.00000000
     S7_5 0.00000000
     S7_6 0.00000000
     S7_7 0.00000000
    S7_99 0.00000000
     S8_1 0.00000000
    S8_99 0.00000000
       D1 0.00000000
       D2 0.00000000
       D3 0.00000000
       D4 0.00000000
       D5 0.00000000
       D6 0.00000000
       D7 0.00000000
       D8 0.00000000
       D9 0.00000000
      D10 0.00000000

First the S7 and S8 for other brands are removed. Now we look at the remaining variables and their missing counts.

ic_sub_2 <- ic_sub_1[c("ID", "S7_1", "S8_1", "Q1_1", "Q2_1", "Q3_1", q11_1, d_names)]
library(inspectdf)
ic_sub_2 %>% inspect_na %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

The demographics variables have no missing values. The subset of S7_1 and S8_1 has no missing point either. We only have 28% missing in Q11 questions and 66% for Q1-Q3. The latter makes total sense because those are the people who have not purchased the product in last 6 months, therefore, have not been asked to answer Q1-Q3 questions. These questions are missing 234 rows which is the same number of people who have not purchased the brand in S8_1. The code below shows this.

ic_sub_2[c("ID", "S7_1", "S8_1", "Q1_1", "Q2_1", "Q3_1")] %>% filter(S8_1 ==0) %>% nrow()

[1] 234

Therefore, the number of missing points in Q1-Q3 is the same as number of people who have not purchased the brand in 6 months and their missing values should not be imputed as it does not make sense to do so!

2.4.1 Imputing the valid variables using mice

The only valid variables that require imputation are the seventeen Q11s. So only this subset of our “ic_sub_2” is passed to ‘mice’ function. The following code uses ‘mice’ to generate m=5 new imputed data sets using the Random Forest Imputation method.

library(mice) 
set.seed(456)
ic_to_impute <- ic_sub_2[q11_1]
tempData <- mice(ic_to_impute, m=5, maxit=50, meth='rf', seed=500, print=FALSE) # This uses 'mice' to generate m=5 new imputed datasets using the 'rf' method and '500' as a seed to get the process initiated.

Taking a look at the imputed subset, it can be seen that there is no missing value anymore.

compData <- mice::complete(tempData, 1)
headTail(compData, 20)  %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

# No missing point!
compData %>%  inspect_na() %>%  datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.4.2 Using the ‘sjmisc’ package to merge the 5 datasets into 1

The “merge_imputations()” function from the ‘sjmisc’ package merges the 5 imputed data sets into one with the original number of rows, i.e., not in long format and depicts some comparison plots between the original variables and their imputed version.

library(sjmisc)  
mice_mrg <- merge_imputations(
ic_to_impute,
tempData,
summary = c("hist" ),
filter = NULL
)
mice_mrg

Printing the merged of all 5 imputed data set

head( mice_mrg$data, 20 ) %>% datatable(rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.5 Comparing the imputed and original data sets statistically

The t-test is done to compare the means and variance of the imputed Q11s and the original subset.

1st mice imp. using rf: Comparing means
	fs n	fs micerf n	fs mean	micerf mean	p-value
Q11_1_1	254	354	4.15	4.16	0.94
Q11_1_2	254	354	4.17	4.17	0.97
Q11_1_3	254	354	4.07	4.05	0.83
Q11_1_4	254	354	3.93	3.95	0.85
Q11_1_5	255	354	4.58	4.62	0.75
Q11_1_6	255	354	4.04	4.04	0.97
Q11_1_7	254	354	3.76	3.79	0.78
Q11_1_8	254	354	4.64	4.69	0.65
Q11_1_9	254	354	4.11	4.10	0.91
Q11_1_10	254	354	4.45	4.48	0.80
Q11_1_11	254	354	4.83	4.90	0.52
Q11_1_12	255	354	4.78	4.88	0.37
Q11_1_13	254	354	4.26	4.25	0.89
Q11_1_14	255	354	3.59	3.61	0.85
Q11_1_15	254	354	3.47	3.52	0.66
Q11_1_16	255	354	4.77	4.82	0.65
Q11_1_17	255	354	4.70	4.81	0.33

Comparing variances
	fs var	micerf var	p-value
Q11_1_1	2.34	1.75	0.01
Q11_1_2	2.02	1.50	0.01
Q11_1_3	2.11	1.54	0.01
Q11_1_4	2.18	1.62	0.01
Q11_1_5	2.18	1.66	0.02
Q11_1_6	2.08	1.52	0.01
Q11_1_7	2.65	2.02	0.02
Q11_1_8	2.28	1.71	0.01
Q11_1_9	1.95	1.45	0.01
Q11_1_10	2.07	1.58	0.02
Q11_1_11	1.93	1.49	0.03
Q11_1_12	2.04	1.56	0.02
Q11_1_13	1.85	1.39	0.01
Q11_1_14	2.31	1.78	0.02
Q11_1_15	1.94	1.47	0.02
Q11_1_16	2.14	1.61	0.01
Q11_1_17	2.35	1.79	0.02

By looking at the p-value and comparing them to our confidence interval of 95%. Imputation does well in the means but rather poorly in the variance. In the variance table, many of the seventeen questions have a p-value under the 0.05 limit. Evidently the merge-of-five imputations decreases the variance too much and has statistically significant difference with the original subset in terms of variance.
This was surprising to me therefore I looked at each of the 5 imputed data sets to see how they perform in terms of variance.

3rd mice imp. using rf: Comparing means
	fs n	fs micerf n	fs mean	micerf mean	p-value
Q11_1_1	254	354	4.15	4.14	0.95
Q11_1_2	254	354	4.17	4.16	0.93
Q11_1_3	254	354	4.07	4.06	0.89
Q11_1_4	254	354	3.93	3.91	0.87
Q11_1_5	255	354	4.58	4.62	0.75
Q11_1_6	255	354	4.04	4.04	0.97
Q11_1_7	254	354	3.76	3.76	0.99
Q11_1_8	254	354	4.64	4.71	0.59
Q11_1_9	254	354	4.11	4.12	0.93
Q11_1_10	254	354	4.45	4.47	0.88
Q11_1_11	254	354	4.83	4.88	0.69
Q11_1_12	255	354	4.78	4.88	0.41
Q11_1_13	254	354	4.26	4.23	0.79
Q11_1_14	255	354	3.59	3.55	0.74
Q11_1_15	254	354	3.47	3.47	0.98
Q11_1_16	255	354	4.77	4.82	0.69
Q11_1_17	255	354	4.70	4.84	0.26

Comparing variances
	fs var	micerf var	p-value
Q11_1_1	2.34	1.88	0.06
Q11_1_2	2.02	1.64	0.07
Q11_1_3	2.11	1.68	0.05
Q11_1_4	2.18	1.78	0.08
Q11_1_5	2.18	1.89	0.22
Q11_1_6	2.08	1.61	0.03
Q11_1_7	2.65	2.55	0.72
Q11_1_8	2.28	1.92	0.14
Q11_1_9	1.95	1.56	0.05
Q11_1_10	2.07	1.74	0.13
Q11_1_11	1.93	1.72	0.33
Q11_1_12	2.04	1.74	0.17
Q11_1_13	1.85	1.52	0.09
Q11_1_14	2.31	2.08	0.35
Q11_1_15	1.94	1.75	0.36
Q11_1_16	2.14	1.85	0.22
Q11_1_17	2.35	2.02	0.19

As seen in the above tables, the single imputed data sets perfrom much better in terms of not changing the variance significantly. Moreover, the 4th imputed data set performs better than all five and the merged one.
The below histograms for one of the imputed variables (Q11_1_1) shows that the merged imputed data set is inclined to impute the missing values to be equal to the mod and thus, decrease the variance significantly. Whereas, the single imputed data set distributes the missing values in all 7 levels based on the original mean and variance.

pacman::p_load(epiDisplay)
i = q11_1[7]
tab1(ic_to_impute[i], main = "Q11_1_1 distribution in the original data set" )

ic_to_impute[i] : 
        Frequency   %(NA+)   %(NA-)
1              33      9.3     13.0
2              28      7.9     11.0
3              34      9.6     13.4
4              77     21.8     30.3
5              50     14.1     19.7
6              19      5.4      7.5
7              13      3.7      5.1
<NA>          100     28.2      0.0
  Total       354    100.0    100.0

tab1(mice_mrg$data[i], main = "Q11_1_1 distribution in the merged-of-5 imputed data set" )

mice_mrg$data[i] : 
        Frequency Percent Cum. percent
1              33     9.3          9.3
2              30     8.5         17.8
3              55    15.5         33.3
4             141    39.8         73.2
5              63    17.8         91.0
6              19     5.4         96.3
7              13     3.7        100.0
  Total       354   100.0        100.0

tab1(compData[i], main = "Q11_1_1 distribution in the 4th imputed data set")

compData[i] : 
        Frequency Percent Cum. percent
1              43    12.1         12.1
2              40    11.3         23.4
3              51    14.4         37.9
4             103    29.1         66.9
5              77    21.8         88.7
6              23     6.5         95.2
7              17     4.8        100.0
  Total       354   100.0        100.0

2.5.1 Combining the imputed data

The best imputed data set is the fourth one in all 5 imputed data sets. Therefore, it is chosen to be combined with the original data set.

ic_sub_2[q11_1] <- compData
ic_sub_2 %>%  datatable( rownames = TRUE, filter="top", caption = "The imputed and complete data set", options = list(pageLength = 10, scrollX=T))

2.6 Renaming the variables

The variables in the final subset of the ice cream data are renamed to make further analysis easier.

demo_names <- c("Gender", "Age", "Marital", "Children_less_18", "Household_size", "Education", "Employment", "Ethnicity",
                "Household_income", "Residence_state")
names(ic_sub_2) <- c(c("ID", "Heard_of_brand", "Purchased_last6mo", "Satisfaction", "Buying_likelihood", "Recommend_likelihood",
                     "is_relaxing", "is_wholesome", "is_fun",
                     "is_exciting","is_premium_quality", "is_memorable", "is_treat", "is_good_for_regular",
                     "is_interesting", "taste_better_other_brands", "has_many_flavors", "is_enjoyable", "has_best_value/price",
                     "is_organic", "is_low_cal", "is_great_for_family", "is_great_for_guests") , demo_names)

ic_sub_2 %>%  datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.7 Recoding the demographic variables

Now it’s time to recode the values of different variables to a more meaningful format. Let’s first look at the demographic variables and see their current values.

library(wrapr)
tt<- function(x) { table( x,  useNA = "ifany" ) }
24:33 %.>% ( function(x) { sapply( ic_sub_2[, (x)], tt ) } )  (.)

$Gender
x
  1   2 
127 227 

$Age
x
 2  3  4  5  6  7 
18 75 49 60 58 94 

$Marital
x
  1   2   3   4   5   6   7 
 71  36 185  43   2  14   3 

$Children_less_18
x
  1   2 
 61 293 

$Household_size
x
  1   2   3   4   5   6   7  11 
 73 177  60  31  10   1   1   1 

$Education
x
  1   2   3   4   5   6 
 20  10 114 110  28  72 

$Employment
x
  1   2   3   4   5   6 
159  35  16  23 101  20 

$Ethnicity
x
  1   2   3   4   5   6 
285  19  13  22   9   6 

$Household_income
x
  1   2   3   4   5   6   7   8   9 
126 106  43  16  16   7   5   1  34 

$Residence_state
x
 1  2  3  4  5  6  7 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
 3  7  9  1 63  9  6  9  3  5  5 16  6  3  3  3  1  2  2  4 16  8  2  6  1  3 
29 30 31 32 33 34 36 37 38 39 41 42 43 44 45 46 47 48 50 
10  1  8  2  8  2 10  3 12 13  4  1  6 38  3  1  5 22  9

Now using different tools and the provided questionnaires, the following demographic variables are recoded and depicted in tables below.

ic_sub_2$Gender <- ifelse( ic_sub_2$Gender == 1, "Male", "Female")
frq(ic_sub_2$Gender, out="v", show.na=TRUE, title= "Gender of the participants")

Gender of the participants
val	label	frq	raw.prc	valid.prc	cum.prc
Female		227	64.12	64.12	64.12
Male		127	35.88	35.88	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=1.36 · σ=0.48

ic_sub_2$Age <- recode_factor(ic_sub_2$Age,`1` = "Under 18",
                                          `2` = "18-24",
                                          `3` = "25-34",
                                          `4` = "35-44",
                                          `5` = "45-54",
                                          `6` = "55-64",
                                          `7` = "65 or older")
frq(ic_sub_2$Age, out="v", show.na=TRUE, title= "Age of the participants")

Age of the participants
val	label	frq	raw.prc	valid.prc	cum.prc
18-24		18	5.08	5.08	5.08
25-34		75	21.19	21.19	26.27
35-44		49	13.84	13.84	40.11
45-54		60	16.95	16.95	57.06
55-64		58	16.38	16.38	73.45
65 or older		94	26.55	26.55	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.98 · σ=1.64

ic_sub_2$Marital <- recode_factor(ic_sub_2$Marital,`1` = "Single, not living with domestic partner",
                                          `2` = "Single, living with domestic partner",
                                          `3` = "Married",
                                          `4` = "Divorced",
                                          `5` = "Separated",
                                          `6` = "Widowed",
                                          `7` = "Prefer not to say")
frq(ic_sub_2$Marital, out="v",sort.frq = c("desc"), show.na=TRUE, title= "Marital status of the participants")

Marital status of the participants
val	label	frq	raw.prc	valid.prc	cum.prc
Married		185	52.26	52.26	52.26
Single, not living with domestic partner		71	20.06	20.06	72.32
Divorced		43	12.15	12.15	84.46
Single, living with domestic partner		36	10.17	10.17	94.63
Widowed		14	3.95	3.95	98.59
Prefer not to say		3	0.85	0.85	99.44
Separated		2	0.56	0.56	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=2.78 · σ=1.22

ic_sub_2$Children_less_18 <- ifelse(ic_sub_2$Children_less_18 == 1, "Yes", "No")
frq(ic_sub_2$Children_less_18, out="v", show.na=TRUE, title= "Does Participants have children under the age of 18 living with them")

Does Participants have children under the age of 18 living with them
val	label	frq	raw.prc	valid.prc	cum.prc
No		293	82.77	82.77	82.77
Yes		61	17.23	17.23	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=1.17 · σ=0.38

No need to recode the household size as it is a numerical variable

frq(ic_sub_2$Household_size, out="v", show.na=TRUE, title= "Number of people living in participants household")

Number of people living in participants household
val	label	frq	raw.prc	valid.prc	cum.prc
1		73	20.62	20.62	20.62
2		177	50	50	70.62
3		60	16.95	16.95	87.57
4		31	8.76	8.76	96.33
5		10	2.82	2.82	99.15
6		1	0.28	0.28	99.44
7		1	0.28	0.28	99.72
11		1	0.28	0.28	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=2.27 · σ=1.12

ic_sub_2$Education <- recode_factor(ic_sub_2$Education,`1` = "High school or less",
                                          `2` = "Trade/Technical school",
                                          `3` = "Some college or Associate's Degree",
                                          `4` = "Graduated College/Bachelor's Degree",
                                          `5` = "Attended Graduate School",
                                          `6` = "Advanced Degree (Master's, PhD.D.)")
frq(ic_sub_2$Education, out="v", show.na=TRUE, title= "Education of the participants")

Education of the participants
val	label	frq	raw.prc	valid.prc	cum.prc
High school or less		20	5.65	5.65	5.65
Trade/Technical school		10	2.82	2.82	8.47
Some college or Associate’s Degree		114	32.2	32.2	40.68
Graduated College/Bachelor’s Degree		110	31.07	31.07	71.75
Attended Graduate School		28	7.91	7.91	79.66
Advanced Degree (Master’s, PhD.D.)		72	20.34	20.34	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.94 · σ=1.36

ic_sub_2$Ethnicity <- recode_factor(ic_sub_2$Ethnicity,`1` = "High school or less",
                                          `2` = "Trade/Technical school",
                                          `3` = "Some college or Associate's Degree",
                                          `4` = "Graduated College/Bachelor's Degree",
                                          `5` = "Attended Graduate School",
                                          `6` = "Advanced Degree (Master's, PhD.D.)")
frq(ic_sub_2$Ethnicity, out="v", show.na=TRUE, title= "Education of the participants")

Education of the participants
val	label	frq	raw.prc	valid.prc	cum.prc
High school or less		285	80.51	80.51	80.51
Trade/Technical school		19	5.37	5.37	85.88
Some college or Associate’s Degree		13	3.67	3.67	89.55
Graduated College/Bachelor’s Degree		22	6.21	6.21	95.76
Attended Graduate School		9	2.54	2.54	98.31
Advanced Degree (Master’s, PhD.D.)		6	1.69	1.69	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=1.50 · σ=1.16

ic_sub_2$Employment <- recode_factor(ic_sub_2$Employment,`1` = "Employed full-time (30+ hrs)",
                                          `2` = "Employed part time",
                                          `3` = "Not Currently Employed",
                                          `4` = "Student",
                                          `5` = "Retired",
                                          `6` = "Homemaker")
frq(ic_sub_2$Employment, out="v", show.na=TRUE, title= "Education of the participants")

Education of the participants
val	label	frq	raw.prc	valid.prc	cum.prc
Employed full-time (30+ hrs)		159	44.92	44.92	44.92
Employed part time		35	9.89	9.89	54.8
Not Currently Employed		16	4.52	4.52	59.32
Student		23	6.5	6.5	65.82
Retired		101	28.53	28.53	94.35
Homemaker		20	5.65	5.65	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=2.81 · σ=1.89

ic_sub_2$Household_income <- recode_factor(ic_sub_2$Household_income,`1` = "under $50,000",
                                          `2` = "$50,000 just under $75,000",
                                          `3` = "$75,000 just under $100,000",
                                          `4` = "$100,000 just under $125,000",
                                          `5` = "$125,000 just under $150,000",
                                          `6` = "$150,000 just under $175,000",
                                          `7` = "$175,000 just under $200,000",
                                          `8` = "$200,000 or more",
                                          `9` = "Prefer not to say")
frq(ic_sub_2$Household_income, out="v", show.na=TRUE, title= "Household income of the participants")

Household income of the participants
val	label	frq	raw.prc	valid.prc	cum.prc
under $50,000		126	35.59	35.59	35.59
$50,000 just under $75,000		106	29.94	29.94	65.54
$75,000 just under $100,000		43	12.15	12.15	77.68
$100,000 just under $125,000		16	4.52	4.52	82.2
$125,000 just under $150,000		16	4.52	4.52	86.72
$150,000 just under $175,000		7	1.98	1.98	88.7
$175,000 just under $200,000		5	1.41	1.41	90.11
$200,000 or more		1	0.28	0.28	90.4
Prefer not to say		34	9.6	9.6	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=2.83 · σ=2.42

The states were assigned a number, the actual name is not evident. There are a total of 50 states.

frq(ic_sub_2$Residence_state, out="v", show.na=TRUE, title= "Residence state of the participants")

Residence state of the participants
val	label	frq	raw.prc	valid.prc	cum.prc
1		3	0.85	0.85	0.85
2		7	1.98	1.98	2.82
3		9	2.54	2.54	5.37
4		1	0.28	0.28	5.65
5		63	17.8	17.8	23.45
6		9	2.54	2.54	25.99
7		6	1.69	1.69	27.68
10		9	2.54	2.54	30.23
11		3	0.85	0.85	31.07
12		5	1.41	1.41	32.49
13		5	1.41	1.41	33.9
14		16	4.52	4.52	38.42
15		6	1.69	1.69	40.11
16		3	0.85	0.85	40.96
17		3	0.85	0.85	41.81
18		3	0.85	0.85	42.66
19		1	0.28	0.28	42.94
20		2	0.56	0.56	43.5
21		2	0.56	0.56	44.07
22		4	1.13	1.13	45.2
23		16	4.52	4.52	49.72
24		8	2.26	2.26	51.98
25		2	0.56	0.56	52.54
26		6	1.69	1.69	54.24
27		1	0.28	0.28	54.52
28		3	0.85	0.85	55.37
29		10	2.82	2.82	58.19
30		1	0.28	0.28	58.47
31		8	2.26	2.26	60.73
32		2	0.56	0.56	61.3
33		8	2.26	2.26	63.56
34		2	0.56	0.56	64.12
36		10	2.82	2.82	66.95
37		3	0.85	0.85	67.8
38		12	3.39	3.39	71.19
39		13	3.67	3.67	74.86
41		4	1.13	1.13	75.99
42		1	0.28	0.28	76.27
43		6	1.69	1.69	77.97
44		38	10.73	10.73	88.7
45		3	0.85	0.85	89.55
46		1	0.28	0.28	89.83
47		5	1.41	1.41	91.24
48		22	6.21	6.21	97.46
50		9	2.54	2.54	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=24.56 · σ=16.31

2.8 Recoding the Perception and Attitude variables

Let’s look at the current values of these variables. Which are Q1-Q3 and Q11_1_1 to Q11_1_17

frq(ic_sub_2$Satisfaction, out="v", show.na=TRUE, title= "Different values in the Behavioral questions")

Different values in the Behavioral questions
val	label	frq	raw.prc	valid.prc	cum.prc
1		1	0.28	0.83	0.83
2		1	0.28	0.83	1.67
4		7	1.98	5.83	7.5
5		22	6.21	18.33	25.83
6		43	12.15	35.83	61.67
7		46	12.99	38.33	100
NA	NA	234	66.1	NA	NA
total N=354 · valid N=120 · x̄=6.01 · σ=1.07

The values span from 1 to 7 therefore we use a 7 Point Likert Scale to recode them. “1” being Extremely negative and “7” the extremely positive.

As these variables would be used for further analyses, to save their natural numerical format, the recoded categorical columns would be appended to end of data set.

ic_sub_2$Satisfaction_cat <- dplyr::recode_factor(ic_sub_2$Satisfaction,
                                                  `1` = "Extremely Dissatisfied",
                                                  `2` = "Dissatisfied",
                                                  `3` = "Somewhat Dissatisfied",
                                                  `4` = "Neither Dissatisfied nor Satisfied",
                                                  `5` = "Somewhat Satisfied",
                                                  `6` = "Satisfied",
                                                  `7` = "Extremely Satisfied"
                       )
y <- c("Extremely Satisfied","Satisfied", "Somewhat Satisfied", "Neither Dissatisfied nor Satisfied",
  "Somewhat Dissatisfied", "Dissatisfied", "Extremely Dissatisfied")

ic_sub_2$Satisfaction_cat <- factor(ic_sub_2$Satisfaction_cat, levels = y )
frq(ic_sub_2$Satisfaction_cat, out="v", show.na=TRUE, title= "How satisfied with Ice Cream Brand")

How satisfied with Ice Cream Brand
val	label	frq	raw.prc	valid.prc	cum.prc
Extremely Satisfied		46	12.99	38.33	38.33
Satisfied		43	12.15	35.83	74.17
Somewhat Satisfied		22	6.21	18.33	92.5
Neither Dissatisfied nor Satisfied		7	1.98	5.83	98.33
Somewhat Dissatisfied		0	0	0	98.33
Dissatisfied		1	0.28	0.83	99.17
Extremely Dissatisfied		1	0.28	0.83	100
NA	NA	234	66.1	NA	NA
total N=354 · valid N=120 · x̄=1.99 · σ=1.07

Now we look at Q2 or Buying_likelihood variable in the data set.

ic_sub_2$Buying_likelihood_cat <- dplyr::recode_factor(ic_sub_2$Buying_likelihood,
                                                  `1` = "Not at all Likely to purchase again",
                                                  `2` = "Not Likely to purchase again",
                                                  `3` = "Not Somewhat  Likely to purchase again",
                                                  `4` = "Neither Likely nor Unlikely to purchase again",
                                                  `5` = "Somewhat Likely to purchase again",
                                                  `6` = "Likely to purchase again",
                                                  `7` = "Extremely Likely to purchase again"
                       )

y <- c("Extremely Likely to purchase again", "Likely to purchase again",  "Somewhat Likely to purchase again",
  "Neither Likely nor Unlikely to purchase again",  "Somewhat Not Likely to purchase again",
  "Not Likely to purchase again",  "Not at all Likely to purchase again")

ic_sub_2$Buying_likelihood_cat <- factor(ic_sub_2$Buying_likelihood_cat, levels = y )
frq(ic_sub_2$Buying_likelihood_cat, out="v", show.na=TRUE, title= "How Likely to purchase brand again")

How Likely to purchase brand again
val	label	frq	raw.prc	valid.prc	cum.prc
Extremely Likely to purchase again		74	20.9	61.67	61.67
Likely to purchase again		29	8.19	24.17	85.83
Somewhat Likely to purchase again		11	3.11	9.17	95
Neither Likely nor Unlikely to purchase again		6	1.69	5	100
Somewhat Not Likely to purchase again		0	0	0	100
Not Likely to purchase again		0	0	0	100
Not at all Likely to purchase again		0	0	0	100
NA	NA	234	66.1	NA	NA
total N=354 · valid N=120 · x̄=1.57 · σ=0.86

Now we look at Q3 or Recommendation likelihood variable in the data set.

ic_sub_2$Recommend_likehood_cat <- dplyr::recode_factor(ic_sub_2$Recommend_likelihood,
                                                  `1` = "Not at all Likely to recommend",
                                                  `2` = "Not Likely to recommend",
                                                  `3` = "Not Somewhat  Likely to recommend",
                                                  `4` = "Neither Likely nor Unlikely to recommend",
                                                  `5` = "Somewhat Likely to recommend",
                                                  `6` = "Likely to recommend",
                                                  `7` = "Extremely Likely to recommend"
                       )

y <- c("Extremely Likely to recommend", "Likely to recommend",  "Somewhat Likely to recommend",
  "Neither Likely nor Unlikely to recommend",  "Somewhat Not Likely to recommend",
  "Not Likely to recommend",  "Not at all Likely to recommend")

ic_sub_2$Recommend_likehood_cat <- factor(ic_sub_2$Recommend_likehood_cat, levels = y )
frq(ic_sub_2$Recommend_likehood_cat, out="v", show.na=TRUE, title= "How Likely to recommend brand again")

How Likely to recommend brand again
val	label	frq	raw.prc	valid.prc	cum.prc
Extremely Likely to recommend		57	16.1	48.72	48.72
Likely to recommend		40	11.3	34.19	82.91
Somewhat Likely to recommend		13	3.67	11.11	94.02
Neither Likely nor Unlikely to recommend		6	1.69	5.13	99.15
Somewhat Not Likely to recommend		0	0	0	99.15
Not Likely to recommend		1	0.28	0.85	100
Not at all Likely to recommend		0	0	0	100
NA	NA	237	66.95	NA	NA
total N=354 · valid N=117 · x̄=1.76 · σ=0.94

2.8.1 Recoding the brand perception and opinion variables

All 17 variables in this area (Q_1_1 to Q_1_17) could be coded with the same format. Again for not losing their numerical for further analyses, they are recoded into new variables and added to the data set. First we make those new categorical variables and then recode them. Tables of 17 “recoded” perception questions of Dreyer’s brand appear next.

library(forcats)

ftt <- function(x) { ordered( x, levels = c("1", "2", "3", "4", "5", "6", "7")) }
ic_sub_2[, 37:53] <- 7:24 %.>% ( function(x) { lapply( ic_sub_2[, (x)], ftt ) } )  (.)

rr <- function(x) { dplyr::recode_factor( x,
                                          `1` = "Does not describe at all",
                                          `2` = "Does not describe",
                                          `3` = "Not Somewhat does describe",
                                          `4` = "Neither does nor does not describe",
                                          `5` = "Somewhat does describe well",
                                          `6` = "Does describe well",
                                          `7` = "Describes extremely well"
) }
ic_sub_2[, 37:53] <- 37:53 %.>% ( function(x) { lapply( ic_sub_2[, (x)], rr ) } )  (.)


frq( fct_rev(ic_sub_2[,37]), out="v", show.na=TRUE, title= "is relaxing")

is relaxing
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		15	4.24	4.24	4.24
Does describe well		39	11.02	11.02	15.25
Somewhat does describe well		67	18.93	18.93	34.18
Neither does nor does not describe		157	44.35	44.35	78.53
Not Somewhat does describe		31	8.76	8.76	87.29
Does not describe		25	7.06	7.06	94.35
Does not describe at all		20	5.65	5.65	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.86 · σ=1.37

frq( fct_rev(ic_sub_2[,38]), out="v", show.na=TRUE, title= "is wholesome" )

is wholesome
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		13	3.67	3.67	3.67
Does describe well		39	11.02	11.02	14.69
Somewhat does describe well		66	18.64	18.64	33.33
Neither does nor does not describe		161	45.48	45.48	78.81
Not Somewhat does describe		37	10.45	10.45	89.27
Does not describe		26	7.34	7.34	96.61
Does not describe at all		12	3.39	3.39	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.84 · σ=1.28

frq( fct_rev(ic_sub_2[,39]), out="v", show.na=TRUE, title= "is fun"           )

is fun
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		14	3.95	3.95	3.95
Does describe well		28	7.91	7.91	11.86
Somewhat does describe well		57	16.1	16.1	27.97
Neither does nor does not describe		181	51.13	51.13	79.1
Not Somewhat does describe		37	10.45	10.45	89.55
Does not describe		14	3.95	3.95	93.5
Does not describe at all		23	6.5	6.5	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.94 · σ=1.30

frq( fct_rev(ic_sub_2[,40]), out="v", show.na=TRUE, title= "is exciting"       )

is exciting
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		13	3.67	3.67	3.67
Does describe well		27	7.63	7.63	11.3
Somewhat does describe well		44	12.43	12.43	23.73
Neither does nor does not describe		176	49.72	49.72	73.45
Not Somewhat does describe		38	10.73	10.73	84.18
Does not describe		36	10.17	10.17	94.35
Does not describe at all		20	5.65	5.65	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=4.09 · σ=1.33

frq( fct_rev(ic_sub_2[,41]), out="v", show.na=TRUE, title= "has premium quality")

has premium quality
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		31	8.76	8.76	8.76
Does describe well		71	20.06	20.06	28.81
Somewhat does describe well		70	19.77	19.77	48.59
Neither does nor does not describe		130	36.72	36.72	85.31
Not Somewhat does describe		29	8.19	8.19	93.5
Does not describe		13	3.67	3.67	97.18
Does not describe at all		10	2.82	2.82	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.38 · σ=1.37

frq( fct_rev(ic_sub_2[,42]), out="v", show.na=TRUE, title= "is memorable"            )

is memorable
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		17	4.8	4.8	4.8
Does describe well		20	5.65	5.65	10.45
Somewhat does describe well		61	17.23	17.23	27.68
Neither does nor does not describe		171	48.31	48.31	75.99
Not Somewhat does describe		48	13.56	13.56	89.55
Does not describe		21	5.93	5.93	95.48
Does not describe at all		16	4.52	4.52	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.96 · σ=1.27

frq( fct_rev(ic_sub_2[,43]), out="v", show.na=TRUE, title= "Bought as special treat but not regularly")

Bought as special treat but not regularly
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		17	4.8	4.8	4.8
Does describe well		23	6.5	6.5	11.3
Somewhat does describe well		77	21.75	21.75	33.05
Neither does nor does not describe		103	29.1	29.1	62.15
Not Somewhat does describe		51	14.41	14.41	76.55
Does not describe		40	11.3	11.3	87.85
Does not describe at all		43	12.15	12.15	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=4.24 · σ=1.60

frq( fct_rev(ic_sub_2[,44]), out="v", show.na=TRUE, title= "is good for regular consumption")

is good for regular consumption
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		37	10.45	10.45	10.45
Does describe well		60	16.95	16.95	27.4
Somewhat does describe well		100	28.25	28.25	55.65
Neither does nor does not describe		113	31.92	31.92	87.57
Not Somewhat does describe		20	5.65	5.65	93.22
Does not describe		11	3.11	3.11	96.33
Does not describe at all		13	3.67	3.67	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.29 · σ=1.39

frq( fct_rev(ic_sub_2[,45]), out="v", show.na=TRUE, title= "is interesting")

is interesting
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		13	3.67	3.67	3.67
Does describe well		32	9.04	9.04	12.71
Somewhat does describe well		62	17.51	17.51	30.23
Neither does nor does not describe		174	49.15	49.15	79.38
Not Somewhat does describe		41	11.58	11.58	90.96
Does not describe		16	4.52	4.52	95.48
Does not describe at all		16	4.52	4.52	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.88 · σ=1.25

frq( fct_rev(ic_sub_2[,46]), out="v", show.na=TRUE, title= "Tastes better than other brands" )

Tastes better than other brands
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		24	6.78	6.78	6.78
Does describe well		51	14.41	14.41	21.19
Somewhat does describe well		85	24.01	24.01	45.2
Neither does nor does not describe		135	38.14	38.14	83.33
Not Somewhat does describe		35	9.89	9.89	93.22
Does not describe		13	3.67	3.67	96.89
Does not describe at all		11	3.11	3.11	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.53 · σ=1.32

frq( fct_rev(ic_sub_2[,47]), out="v", show.na=TRUE, title= "Has many flavors")

Has many flavors
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		43	12.15	12.15	12.15
Does describe well		70	19.77	19.77	31.92
Somewhat does describe well		98	27.68	27.68	59.6
Neither does nor does not describe		109	30.79	30.79	90.4
Not Somewhat does describe		17	4.8	4.8	95.2
Does not describe		11	3.11	3.11	98.31
Does not describe at all		6	1.69	1.69	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.12 · σ=1.31

frq( fct_rev(ic_sub_2[,48]), out="v", show.na=TRUE, title= "Is enjoyable"   )

Is enjoyable
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		36	10.17	10.17	10.17
Does describe well		82	23.16	23.16	33.33
Somewhat does describe well		102	28.81	28.81	62.15
Neither does nor does not describe		97	27.4	27.4	89.55
Not Somewhat does describe		21	5.93	5.93	95.48
Does not describe		6	1.69	1.69	97.18
Does not describe at all		10	2.82	2.82	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.12 · σ=1.32

frq( fct_rev(ic_sub_2[,49]), out="v", show.na=TRUE, title= "Has best value for price"   )

Has best value for price
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		18	5.08	5.08	5.08
Does describe well		31	8.76	8.76	13.84
Somewhat does describe well		69	19.49	19.49	33.33
Neither does nor does not describe		170	48.02	48.02	81.36
Not Somewhat does describe		41	11.58	11.58	92.94
Does not describe		13	3.67	3.67	96.61
Does not describe at all		12	3.39	3.39	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.77 · σ=1.23

frq( fct_rev(ic_sub_2[,50]), out="v", show.na=TRUE, title= "is natural/organic")

is natural/organic
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		8	2.26	2.26	2.26
Does describe well		18	5.08	5.08	7.34
Somewhat does describe well		45	12.71	12.71	20.06
Neither does nor does not describe		151	42.66	42.66	62.71
Not Somewhat does describe		42	11.86	11.86	74.58
Does not describe		47	13.28	13.28	87.85
Does not describe at all		43	12.15	12.15	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=4.45 · σ=1.44

frq( fct_rev(ic_sub_2[,51]), out="v", show.na=TRUE, title= "is low calorie")

is low calorie
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		6	1.69	1.69	1.69
Does describe well		9	2.54	2.54	4.24
Somewhat does describe well		42	11.86	11.86	16.1
Neither does nor does not describe		153	43.22	43.22	59.32
Not Somewhat does describe		61	17.23	17.23	76.55
Does not describe		45	12.71	12.71	89.27
Does not describe at all		38	10.73	10.73	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=4.53 · σ=1.32

frq( fct_rev(ic_sub_2[,52]), out="v", show.na=TRUE, title= "Great for whole family")

Great for whole family
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		44	12.43	12.43	12.43
Does describe well		69	19.49	19.49	31.92
Somewhat does describe well		85	24.01	24.01	55.93
Neither does nor does not describe		117	33.05	33.05	88.98
Not Somewhat does describe		22	6.21	6.21	95.2
Does not describe		8	2.26	2.26	97.46
Does not describe at all		9	2.54	2.54	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.18 · σ=1.36

frq( fct_rev(ic_sub_2[,53]), out="v", show.na=TRUE, title= "Great for guests")

Great for guests
val	label	frq	raw.prc	valid.prc	cum.prc
Describes extremely well		42	11.86	11.86	11.86
Does describe well		75	21.19	21.19	33.05
Somewhat does describe well		100	28.25	28.25	61.3
Neither does nor does not describe		94	26.55	26.55	87.85
Not Somewhat does describe		20	5.65	5.65	93.5
Does not describe		9	2.54	2.54	96.05
Does not describe at all		14	3.95	3.95	100
NA	NA	0	0	NA	NA
total N=354 · valid N=354 · x̄=3.16 · σ=1.42

Changing the names of new categorical variables. Adding a suffix to imply their categorical format.

q <- c("is_relaxing", "is_wholesome", "is_fun",
                     "is_exciting","is_premium_quality", "is_memorable", "is_treat", "is_good_for_regular",
                     "is_interesting", "taste_better_other_brands", "has_many_flavors", "is_enjoyable", "has_best_value/price",
                     "is_organic", "is_low_cal", "is_great_for_family", "is_great_for_guests")

colnames(ic_sub_2)[37:53] <- gsub("$", "_cat", q)

Recoding the Heard of brand and Purchased brands variables and checking the data.

ic_sub_2$Heard_of_brand <- ifelse(ic_sub_2$Heard_of_brand ==1, 'yes', 'no')
ic_sub_2$Purchased_last6mo <- ifelse(ic_sub_2$Purchased_last6mo ==1, 'yes', 'no')

2.9 Saving the final data set

This recoded, imputed and filtered dataset is saved in your working directory by using the code below.

write.csv(ic_sub_2, "./final_data.csv", row.names = FALSE)