1 Statement of authorship

I have executed and prepared this assignment and document by myself without the help of any other person. Signature:

2 Ice Cream Survey analysis

This work is a preliminary data analysis task done on a survey data set done on ice cream brands. Its purpose is to clean and prepare the data set so that it can be used most easily by the analysts. This data set accompanies the “Seven Summits of Marketing Research” text by Greg Allenby and Jeff Brazell and is used with their permission.

knitr::include_graphics("./ice_cream.png")

knitr::include_graphics("./survey.png")

2.1 Reading the Ice Cream data set

The first step is to load the data and required packages.

library(dplyr)
library(psych)
library(DT)
library(tidyverse)
ic<- read.csv("./IceCream_raw.csv")
#ic<- read.csv("http://bus-sawtooth.mcmaster.ca/M733_ONLINE_F2020/IceCream_raw.csv")

and take a glance at what it offers.

headTail(ic) %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.2 Reducing the raw data to essential variables

In this analysis we are only focusing on the Dreyer’s brand. Based on the questionnaires we only need to analyze certain columns.

d_names <- c(grep("^D.$", names(ic), value = TRUE), "D10")
q11 <- grep("^Q11_", names(ic), value = TRUE)
q11_1 <- grep("^Q11_1_", q11, value = TRUE) ##Only for the Dreyer's brand
s7 <- grep("^S7_", names(ic), value = TRUE)
s8 <- grep("^S8_", names(ic), value = TRUE)
col_names <- c("ID",s7, s8, "Q1_1","Q2_1","Q3_1", q11_1, d_names)

The data and its selected variables are depicted below. The S7 and S8 are chose from the screener questions to identify the participants who have heard about or have purchased our brand in the past 6 months. Also the behavioral questions needed for this analysis that are Q1,Q2,Q3 and Q11 are selected for the Dreyer’s brand. Each question is coded with a number in its name that shows the corresponding ice cream brand. The number assigned to the Dreyer’s brand is 1, therefore we choose Q1_1, Q2_1 ,etc.

ic_sub <- ic[col_names]
headTail(ic_sub,5,5) %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 12, scrollX=T))

2.3 Filtering the participants

Starting to inspect the Dryer’s brand. it is evident that many NA values are present in S8 questions. Which by looking at the S7 question (Having heard of this brand) it shows that those are the people who have not heard of this brand and therefore should be omitted from this analysis.
Heard of Dreyer’s Brand
S7_1 Freq
0 247
1 354
NA 0
Purchased Dreyer’s brand in last 6 months
S8_1 Freq
0 234
1 120
NA 247

The following tables shows the participants (rows) that have the S8_1 variable missing. It also shows that these are the people who have not heard of our brand (S7_1 == 0).

ic[is.na(ic["S8_1"]),c("ID", "S7_1", "S8_1","Q1_1","Q2_1","Q3_1" )] %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.3.1 filtering rows

As the variable S7_1 has no missing values and it can effectively identify the people who have not heard of, therefore not purchased the Dreyer’s brand. These are the people who we need to exclude from the analysis and save the remaining subset to ic_sub_1

ic_sub_1 <- ic_sub %>% filter(!(S7_1 == 0))

Checking the S7_1 and S8_1 in the new filtered data

sjmisc::frq(ic_sub_1$S7_1, out = 'v', title = "Heard of Dreyer's Brand (S7_1)")
Heard of Dreyer’s Brand (S7_1)
val label frq raw.prc valid.prc cum.prc
1 354 100 100 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=1.00 · σ=0.00
sjmisc::frq(ic_sub_1$S8_1, out = 'v', title = "Purchased Dreyer's brand in last 6 months (S8_1)")
Purchased Dreyer’s brand in last 6 months (S8_1)
val label frq raw.prc valid.prc cum.prc
0 234 66.1 66.1 66.1
1 120 33.9 33.9 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=0.34 · σ=0.47


You can see that there is no missing values anymore in these two variables.

2.4 Missing values analyses

We filtered our subset of the data for the Dreyer’s brand analysis. Now let’s look at missing values in this subset.

options(max.print = 100000)
library(inspectdf)
ic_sub_1 %>% inspect_na()  %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

and the following figures to see the pattern of the missing data.

library(VIM)
ic_sub_1[, c(2:47)] %>% aggr(col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))


 Variables sorted by number of missings: 
 Variable      Count
     Q1_1 0.66101695
     Q2_1 0.66101695
     Q3_1 0.66101695
     S8_3 0.45762712
  Q11_1_1 0.28248588
  Q11_1_2 0.28248588
  Q11_1_3 0.28248588
  Q11_1_4 0.28248588
  Q11_1_7 0.28248588
  Q11_1_8 0.28248588
  Q11_1_9 0.28248588
 Q11_1_10 0.28248588
 Q11_1_11 0.28248588
 Q11_1_13 0.28248588
 Q11_1_15 0.28248588
  Q11_1_5 0.27966102
  Q11_1_6 0.27966102
 Q11_1_12 0.27966102
 Q11_1_14 0.27966102
 Q11_1_16 0.27966102
 Q11_1_17 0.27966102
     S8_2 0.25423729
     S8_7 0.17796610
     S8_4 0.03389831
     S8_5 0.01412429
     S8_6 0.01412429
     S7_1 0.00000000
     S7_2 0.00000000
     S7_3 0.00000000
     S7_4 0.00000000
     S7_5 0.00000000
     S7_6 0.00000000
     S7_7 0.00000000
    S7_99 0.00000000
     S8_1 0.00000000
    S8_99 0.00000000
       D1 0.00000000
       D2 0.00000000
       D3 0.00000000
       D4 0.00000000
       D5 0.00000000
       D6 0.00000000
       D7 0.00000000
       D8 0.00000000
       D9 0.00000000
      D10 0.00000000

First the S7 and S8 for other brands are removed. Now we look at the remaining variables and their missing counts.

ic_sub_2 <- ic_sub_1[c("ID", "S7_1", "S8_1", "Q1_1", "Q2_1", "Q3_1", q11_1, d_names)]
library(inspectdf)
ic_sub_2 %>% inspect_na %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

The demographics variables have no missing values. The subset of S7_1 and S8_1 has no missing point either. We only have 28% missing in Q11 questions and 66% for Q1-Q3. The latter makes total sense because those are the people who have not purchased the product in last 6 months, therefore, have not been asked to answer Q1-Q3 questions. These questions are missing 234 rows which is the same number of people who have not purchased the brand in S8_1. The code below shows this.

ic_sub_2[c("ID", "S7_1", "S8_1", "Q1_1", "Q2_1", "Q3_1")] %>% filter(S8_1 ==0) %>% nrow()
[1] 234

Therefore, the number of missing points in Q1-Q3 is the same as number of people who have not purchased the brand in 6 months and their missing values should not be imputed as it does not make sense to do so!

2.4.1 Imputing the valid variables using mice

The only valid variables that require imputation are the seventeen Q11s. So only this subset of our “ic_sub_2” is passed to ‘mice’ function. The following code uses ‘mice’ to generate m=5 new imputed data sets using the Random Forest Imputation method.

library(mice) 
set.seed(456)
ic_to_impute <- ic_sub_2[q11_1]
tempData <- mice(ic_to_impute, m=5, maxit=50, meth='rf', seed=500, print=FALSE) # This uses 'mice' to generate m=5 new imputed datasets using the 'rf' method and '500' as a seed to get the process initiated.

Taking a look at the imputed subset, it can be seen that there is no missing value anymore.

compData <- mice::complete(tempData, 1)
headTail(compData, 20)  %>% datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))
# No missing point!
compData %>%  inspect_na() %>%  datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.4.2 Using the ‘sjmisc’ package to merge the 5 datasets into 1

The “merge_imputations()” function from the ‘sjmisc’ package merges the 5 imputed data sets into one with the original number of rows, i.e., not in long format and depicts some comparison plots between the original variables and their imputed version.

library(sjmisc)  
mice_mrg <- merge_imputations(
ic_to_impute,
tempData,
summary = c("hist" ),
filter = NULL
)
mice_mrg

Printing the merged of all 5 imputed data set

head( mice_mrg$data, 20 ) %>% datatable(rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T)) 

2.5 Comparing the imputed and original data sets statistically

The t-test is done to compare the means and variance of the imputed Q11s and the original subset.
1st mice imp. using rf: Comparing means
fs n fs micerf n fs mean micerf mean p-value
Q11_1_1 254 354 4.15 4.16 0.94
Q11_1_2 254 354 4.17 4.17 0.97
Q11_1_3 254 354 4.07 4.05 0.83
Q11_1_4 254 354 3.93 3.95 0.85
Q11_1_5 255 354 4.58 4.62 0.75
Q11_1_6 255 354 4.04 4.04 0.97
Q11_1_7 254 354 3.76 3.79 0.78
Q11_1_8 254 354 4.64 4.69 0.65
Q11_1_9 254 354 4.11 4.10 0.91
Q11_1_10 254 354 4.45 4.48 0.80
Q11_1_11 254 354 4.83 4.90 0.52
Q11_1_12 255 354 4.78 4.88 0.37
Q11_1_13 254 354 4.26 4.25 0.89
Q11_1_14 255 354 3.59 3.61 0.85
Q11_1_15 254 354 3.47 3.52 0.66
Q11_1_16 255 354 4.77 4.82 0.65
Q11_1_17 255 354 4.70 4.81 0.33
Comparing variances
fs var micerf var p-value
Q11_1_1 2.34 1.75 0.01
Q11_1_2 2.02 1.50 0.01
Q11_1_3 2.11 1.54 0.01
Q11_1_4 2.18 1.62 0.01
Q11_1_5 2.18 1.66 0.02
Q11_1_6 2.08 1.52 0.01
Q11_1_7 2.65 2.02 0.02
Q11_1_8 2.28 1.71 0.01
Q11_1_9 1.95 1.45 0.01
Q11_1_10 2.07 1.58 0.02
Q11_1_11 1.93 1.49 0.03
Q11_1_12 2.04 1.56 0.02
Q11_1_13 1.85 1.39 0.01
Q11_1_14 2.31 1.78 0.02
Q11_1_15 1.94 1.47 0.02
Q11_1_16 2.14 1.61 0.01
Q11_1_17 2.35 1.79 0.02

By looking at the p-value and comparing them to our confidence interval of 95%. Imputation does well in the means but rather poorly in the variance. In the variance table, many of the seventeen questions have a p-value under the 0.05 limit. Evidently the merge-of-five imputations decreases the variance too much and has statistically significant difference with the original subset in terms of variance.
This was surprising to me therefore I looked at each of the 5 imputed data sets to see how they perform in terms of variance.

3rd mice imp. using rf: Comparing means
fs n fs micerf n fs mean micerf mean p-value
Q11_1_1 254 354 4.15 4.14 0.95
Q11_1_2 254 354 4.17 4.16 0.93
Q11_1_3 254 354 4.07 4.06 0.89
Q11_1_4 254 354 3.93 3.91 0.87
Q11_1_5 255 354 4.58 4.62 0.75
Q11_1_6 255 354 4.04 4.04 0.97
Q11_1_7 254 354 3.76 3.76 0.99
Q11_1_8 254 354 4.64 4.71 0.59
Q11_1_9 254 354 4.11 4.12 0.93
Q11_1_10 254 354 4.45 4.47 0.88
Q11_1_11 254 354 4.83 4.88 0.69
Q11_1_12 255 354 4.78 4.88 0.41
Q11_1_13 254 354 4.26 4.23 0.79
Q11_1_14 255 354 3.59 3.55 0.74
Q11_1_15 254 354 3.47 3.47 0.98
Q11_1_16 255 354 4.77 4.82 0.69
Q11_1_17 255 354 4.70 4.84 0.26
Comparing variances
fs var micerf var p-value
Q11_1_1 2.34 1.88 0.06
Q11_1_2 2.02 1.64 0.07
Q11_1_3 2.11 1.68 0.05
Q11_1_4 2.18 1.78 0.08
Q11_1_5 2.18 1.89 0.22
Q11_1_6 2.08 1.61 0.03
Q11_1_7 2.65 2.55 0.72
Q11_1_8 2.28 1.92 0.14
Q11_1_9 1.95 1.56 0.05
Q11_1_10 2.07 1.74 0.13
Q11_1_11 1.93 1.72 0.33
Q11_1_12 2.04 1.74 0.17
Q11_1_13 1.85 1.52 0.09
Q11_1_14 2.31 2.08 0.35
Q11_1_15 1.94 1.75 0.36
Q11_1_16 2.14 1.85 0.22
Q11_1_17 2.35 2.02 0.19

As seen in the above tables, the single imputed data sets perfrom much better in terms of not changing the variance significantly. Moreover, the 4th imputed data set performs better than all five and the merged one.
The below histograms for one of the imputed variables (Q11_1_1) shows that the merged imputed data set is inclined to impute the missing values to be equal to the mod and thus, decrease the variance significantly. Whereas, the single imputed data set distributes the missing values in all 7 levels based on the original mean and variance.

pacman::p_load(epiDisplay)
i = q11_1[7]
tab1(ic_to_impute[i], main = "Q11_1_1 distribution in the original data set" )

ic_to_impute[i] : 
        Frequency   %(NA+)   %(NA-)
1              33      9.3     13.0
2              28      7.9     11.0
3              34      9.6     13.4
4              77     21.8     30.3
5              50     14.1     19.7
6              19      5.4      7.5
7              13      3.7      5.1
<NA>          100     28.2      0.0
  Total       354    100.0    100.0
tab1(mice_mrg$data[i], main = "Q11_1_1 distribution in the merged-of-5 imputed data set" )

mice_mrg$data[i] : 
        Frequency Percent Cum. percent
1              33     9.3          9.3
2              30     8.5         17.8
3              55    15.5         33.3
4             141    39.8         73.2
5              63    17.8         91.0
6              19     5.4         96.3
7              13     3.7        100.0
  Total       354   100.0        100.0
tab1(compData[i], main = "Q11_1_1 distribution in the 4th imputed data set")

compData[i] : 
        Frequency Percent Cum. percent
1              43    12.1         12.1
2              40    11.3         23.4
3              51    14.4         37.9
4             103    29.1         66.9
5              77    21.8         88.7
6              23     6.5         95.2
7              17     4.8        100.0
  Total       354   100.0        100.0

2.5.1 Combining the imputed data

The best imputed data set is the fourth one in all 5 imputed data sets. Therefore, it is chosen to be combined with the original data set.

ic_sub_2[q11_1] <- compData
ic_sub_2 %>%  datatable( rownames = TRUE, filter="top", caption = "The imputed and complete data set", options = list(pageLength = 10, scrollX=T))

2.6 Renaming the variables

The variables in the final subset of the ice cream data are renamed to make further analysis easier.

demo_names <- c("Gender", "Age", "Marital", "Children_less_18", "Household_size", "Education", "Employment", "Ethnicity",
                "Household_income", "Residence_state")
names(ic_sub_2) <- c(c("ID", "Heard_of_brand", "Purchased_last6mo", "Satisfaction", "Buying_likelihood", "Recommend_likelihood",
                     "is_relaxing", "is_wholesome", "is_fun",
                     "is_exciting","is_premium_quality", "is_memorable", "is_treat", "is_good_for_regular",
                     "is_interesting", "taste_better_other_brands", "has_many_flavors", "is_enjoyable", "has_best_value/price",
                     "is_organic", "is_low_cal", "is_great_for_family", "is_great_for_guests") , demo_names)

ic_sub_2 %>%  datatable( rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T))

2.7 Recoding the demographic variables

Now it’s time to recode the values of different variables to a more meaningful format. Let’s first look at the demographic variables and see their current values.

library(wrapr)
tt<- function(x) { table( x,  useNA = "ifany" ) }
24:33 %.>% ( function(x) { sapply( ic_sub_2[, (x)], tt ) } )  (.) 
$Gender
x
  1   2 
127 227 

$Age
x
 2  3  4  5  6  7 
18 75 49 60 58 94 

$Marital
x
  1   2   3   4   5   6   7 
 71  36 185  43   2  14   3 

$Children_less_18
x
  1   2 
 61 293 

$Household_size
x
  1   2   3   4   5   6   7  11 
 73 177  60  31  10   1   1   1 

$Education
x
  1   2   3   4   5   6 
 20  10 114 110  28  72 

$Employment
x
  1   2   3   4   5   6 
159  35  16  23 101  20 

$Ethnicity
x
  1   2   3   4   5   6 
285  19  13  22   9   6 

$Household_income
x
  1   2   3   4   5   6   7   8   9 
126 106  43  16  16   7   5   1  34 

$Residence_state
x
 1  2  3  4  5  6  7 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
 3  7  9  1 63  9  6  9  3  5  5 16  6  3  3  3  1  2  2  4 16  8  2  6  1  3 
29 30 31 32 33 34 36 37 38 39 41 42 43 44 45 46 47 48 50 
10  1  8  2  8  2 10  3 12 13  4  1  6 38  3  1  5 22  9 

Now using different tools and the provided questionnaires, the following demographic variables are recoded and depicted in tables below.

ic_sub_2$Gender <- ifelse( ic_sub_2$Gender == 1, "Male", "Female")
frq(ic_sub_2$Gender, out="v", show.na=TRUE, title= "Gender of the participants")
Gender of the participants
val label frq raw.prc valid.prc cum.prc
Female 227 64.12 64.12 64.12
Male 127 35.88 35.88 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=1.36 · σ=0.48
ic_sub_2$Age <- recode_factor(ic_sub_2$Age,`1` = "Under 18",
                                          `2` = "18-24",
                                          `3` = "25-34",
                                          `4` = "35-44",
                                          `5` = "45-54",
                                          `6` = "55-64",
                                          `7` = "65 or older")
frq(ic_sub_2$Age, out="v", show.na=TRUE, title= "Age of the participants")
Age of the participants
val label frq raw.prc valid.prc cum.prc
18-24 18 5.08 5.08 5.08
25-34 75 21.19 21.19 26.27
35-44 49 13.84 13.84 40.11
45-54 60 16.95 16.95 57.06
55-64 58 16.38 16.38 73.45
65 or older 94 26.55 26.55 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.98 · σ=1.64
ic_sub_2$Marital <- recode_factor(ic_sub_2$Marital,`1` = "Single, not living with domestic partner",
                                          `2` = "Single, living with domestic partner",
                                          `3` = "Married",
                                          `4` = "Divorced",
                                          `5` = "Separated",
                                          `6` = "Widowed",
                                          `7` = "Prefer not to say")
frq(ic_sub_2$Marital, out="v",sort.frq = c("desc"), show.na=TRUE, title= "Marital status of the participants")
Marital status of the participants
val label frq raw.prc valid.prc cum.prc
Married 185 52.26 52.26 52.26
Single, not living with domestic partner 71 20.06 20.06 72.32
Divorced 43 12.15 12.15 84.46
Single, living with domestic partner 36 10.17 10.17 94.63
Widowed 14 3.95 3.95 98.59
Prefer not to say 3 0.85 0.85 99.44
Separated 2 0.56 0.56 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=2.78 · σ=1.22
ic_sub_2$Children_less_18 <- ifelse(ic_sub_2$Children_less_18 == 1, "Yes", "No")
frq(ic_sub_2$Children_less_18, out="v", show.na=TRUE, title= "Does Participants have children under the age of 18 living with them")
Does Participants have children under the age of 18 living with them
val label frq raw.prc valid.prc cum.prc
No 293 82.77 82.77 82.77
Yes 61 17.23 17.23 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=1.17 · σ=0.38

No need to recode the household size as it is a numerical variable

frq(ic_sub_2$Household_size, out="v", show.na=TRUE, title= "Number of people living in participants household")
Number of people living in participants household
val label frq raw.prc valid.prc cum.prc
1 73 20.62 20.62 20.62
2 177 50 50 70.62
3 60 16.95 16.95 87.57
4 31 8.76 8.76 96.33
5 10 2.82 2.82 99.15
6 1 0.28 0.28 99.44
7 1 0.28 0.28 99.72
11 1 0.28 0.28 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=2.27 · σ=1.12
ic_sub_2$Education <- recode_factor(ic_sub_2$Education,`1` = "High school or less",
                                          `2` = "Trade/Technical school",
                                          `3` = "Some college or Associate's Degree",
                                          `4` = "Graduated College/Bachelor's Degree",
                                          `5` = "Attended Graduate School",
                                          `6` = "Advanced Degree (Master's, PhD.D.)")
frq(ic_sub_2$Education, out="v", show.na=TRUE, title= "Education of the participants")
Education of the participants
val label frq raw.prc valid.prc cum.prc
High school or less 20 5.65 5.65 5.65
Trade/Technical school 10 2.82 2.82 8.47
Some college or Associate’s Degree 114 32.2 32.2 40.68
Graduated College/Bachelor’s Degree 110 31.07 31.07 71.75
Attended Graduate School 28 7.91 7.91 79.66
Advanced Degree (Master’s, PhD.D.) 72 20.34 20.34 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.94 · σ=1.36
ic_sub_2$Ethnicity <- recode_factor(ic_sub_2$Ethnicity,`1` = "High school or less",
                                          `2` = "Trade/Technical school",
                                          `3` = "Some college or Associate's Degree",
                                          `4` = "Graduated College/Bachelor's Degree",
                                          `5` = "Attended Graduate School",
                                          `6` = "Advanced Degree (Master's, PhD.D.)")
frq(ic_sub_2$Ethnicity, out="v", show.na=TRUE, title= "Education of the participants")
Education of the participants
val label frq raw.prc valid.prc cum.prc
High school or less 285 80.51 80.51 80.51
Trade/Technical school 19 5.37 5.37 85.88
Some college or Associate’s Degree 13 3.67 3.67 89.55
Graduated College/Bachelor’s Degree 22 6.21 6.21 95.76
Attended Graduate School 9 2.54 2.54 98.31
Advanced Degree (Master’s, PhD.D.) 6 1.69 1.69 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=1.50 · σ=1.16
ic_sub_2$Employment <- recode_factor(ic_sub_2$Employment,`1` = "Employed full-time (30+ hrs)",
                                          `2` = "Employed part time",
                                          `3` = "Not Currently Employed",
                                          `4` = "Student",
                                          `5` = "Retired",
                                          `6` = "Homemaker")
frq(ic_sub_2$Employment, out="v", show.na=TRUE, title= "Education of the participants")
Education of the participants
val label frq raw.prc valid.prc cum.prc
Employed full-time (30+ hrs) 159 44.92 44.92 44.92
Employed part time 35 9.89 9.89 54.8
Not Currently Employed 16 4.52 4.52 59.32
Student 23 6.5 6.5 65.82
Retired 101 28.53 28.53 94.35
Homemaker 20 5.65 5.65 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=2.81 · σ=1.89
ic_sub_2$Household_income <- recode_factor(ic_sub_2$Household_income,`1` = "under $50,000",
                                          `2` = "$50,000 just under $75,000",
                                          `3` = "$75,000 just under $100,000",
                                          `4` = "$100,000 just under $125,000",
                                          `5` = "$125,000 just under $150,000",
                                          `6` = "$150,000 just under $175,000",
                                          `7` = "$175,000 just under $200,000",
                                          `8` = "$200,000 or more",
                                          `9` = "Prefer not to say")
frq(ic_sub_2$Household_income, out="v", show.na=TRUE, title= "Household income of the participants")
Household income of the participants
val label frq raw.prc valid.prc cum.prc
under $50,000 126 35.59 35.59 35.59
$50,000 just under $75,000 106 29.94 29.94 65.54
$75,000 just under $100,000 43 12.15 12.15 77.68
$100,000 just under $125,000 16 4.52 4.52 82.2
$125,000 just under $150,000 16 4.52 4.52 86.72
$150,000 just under $175,000 7 1.98 1.98 88.7
$175,000 just under $200,000 5 1.41 1.41 90.11
$200,000 or more 1 0.28 0.28 90.4
Prefer not to say 34 9.6 9.6 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=2.83 · σ=2.42

The states were assigned a number, the actual name is not evident. There are a total of 50 states.

frq(ic_sub_2$Residence_state, out="v", show.na=TRUE, title= "Residence state of the participants")
Residence state of the participants
val label frq raw.prc valid.prc cum.prc
1 3 0.85 0.85 0.85
2 7 1.98 1.98 2.82
3 9 2.54 2.54 5.37
4 1 0.28 0.28 5.65
5 63 17.8 17.8 23.45
6 9 2.54 2.54 25.99
7 6 1.69 1.69 27.68
10 9 2.54 2.54 30.23
11 3 0.85 0.85 31.07
12 5 1.41 1.41 32.49
13 5 1.41 1.41 33.9
14 16 4.52 4.52 38.42
15 6 1.69 1.69 40.11
16 3 0.85 0.85 40.96
17 3 0.85 0.85 41.81
18 3 0.85 0.85 42.66
19 1 0.28 0.28 42.94
20 2 0.56 0.56 43.5
21 2 0.56 0.56 44.07
22 4 1.13 1.13 45.2
23 16 4.52 4.52 49.72
24 8 2.26 2.26 51.98
25 2 0.56 0.56 52.54
26 6 1.69 1.69 54.24
27 1 0.28 0.28 54.52
28 3 0.85 0.85 55.37
29 10 2.82 2.82 58.19
30 1 0.28 0.28 58.47
31 8 2.26 2.26 60.73
32 2 0.56 0.56 61.3
33 8 2.26 2.26 63.56
34 2 0.56 0.56 64.12
36 10 2.82 2.82 66.95
37 3 0.85 0.85 67.8
38 12 3.39 3.39 71.19
39 13 3.67 3.67 74.86
41 4 1.13 1.13 75.99
42 1 0.28 0.28 76.27
43 6 1.69 1.69 77.97
44 38 10.73 10.73 88.7
45 3 0.85 0.85 89.55
46 1 0.28 0.28 89.83
47 5 1.41 1.41 91.24
48 22 6.21 6.21 97.46
50 9 2.54 2.54 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=24.56 · σ=16.31

2.8 Recoding the Perception and Attitude variables

Let’s look at the current values of these variables. Which are Q1-Q3 and Q11_1_1 to Q11_1_17

frq(ic_sub_2$Satisfaction, out="v", show.na=TRUE, title= "Different values in the Behavioral questions") 
Different values in the Behavioral questions
val label frq raw.prc valid.prc cum.prc
1 1 0.28 0.83 0.83
2 1 0.28 0.83 1.67
4 7 1.98 5.83 7.5
5 22 6.21 18.33 25.83
6 43 12.15 35.83 61.67
7 46 12.99 38.33 100
NA NA 234 66.1 NA NA
total N=354 · valid N=120 · x̄=6.01 · σ=1.07

The values span from 1 to 7 therefore we use a 7 Point Likert Scale to recode them. “1” being Extremely negative and “7” the extremely positive.

As these variables would be used for further analyses, to save their natural numerical format, the recoded categorical columns would be appended to end of data set.

ic_sub_2$Satisfaction_cat <- dplyr::recode_factor(ic_sub_2$Satisfaction,
                                                  `1` = "Extremely Dissatisfied",
                                                  `2` = "Dissatisfied",
                                                  `3` = "Somewhat Dissatisfied",
                                                  `4` = "Neither Dissatisfied nor Satisfied",
                                                  `5` = "Somewhat Satisfied",
                                                  `6` = "Satisfied",
                                                  `7` = "Extremely Satisfied"
                       )
y <- c("Extremely Satisfied","Satisfied", "Somewhat Satisfied", "Neither Dissatisfied nor Satisfied",
  "Somewhat Dissatisfied", "Dissatisfied", "Extremely Dissatisfied")

ic_sub_2$Satisfaction_cat <- factor(ic_sub_2$Satisfaction_cat, levels = y )
frq(ic_sub_2$Satisfaction_cat, out="v", show.na=TRUE, title= "How satisfied with Ice Cream Brand")
How satisfied with Ice Cream Brand
val label frq raw.prc valid.prc cum.prc
Extremely Satisfied 46 12.99 38.33 38.33
Satisfied 43 12.15 35.83 74.17
Somewhat Satisfied 22 6.21 18.33 92.5
Neither Dissatisfied nor Satisfied 7 1.98 5.83 98.33
Somewhat Dissatisfied 0 0 0 98.33
Dissatisfied 1 0.28 0.83 99.17
Extremely Dissatisfied 1 0.28 0.83 100
NA NA 234 66.1 NA NA
total N=354 · valid N=120 · x̄=1.99 · σ=1.07

Now we look at Q2 or Buying_likelihood variable in the data set.

ic_sub_2$Buying_likelihood_cat <- dplyr::recode_factor(ic_sub_2$Buying_likelihood,
                                                  `1` = "Not at all Likely to purchase again",
                                                  `2` = "Not Likely to purchase again",
                                                  `3` = "Not Somewhat  Likely to purchase again",
                                                  `4` = "Neither Likely nor Unlikely to purchase again",
                                                  `5` = "Somewhat Likely to purchase again",
                                                  `6` = "Likely to purchase again",
                                                  `7` = "Extremely Likely to purchase again"
                       )

y <- c("Extremely Likely to purchase again", "Likely to purchase again",  "Somewhat Likely to purchase again",
  "Neither Likely nor Unlikely to purchase again",  "Somewhat Not Likely to purchase again",
  "Not Likely to purchase again",  "Not at all Likely to purchase again")

ic_sub_2$Buying_likelihood_cat <- factor(ic_sub_2$Buying_likelihood_cat, levels = y )
frq(ic_sub_2$Buying_likelihood_cat, out="v", show.na=TRUE, title= "How Likely to purchase brand again")
How Likely to purchase brand again
val label frq raw.prc valid.prc cum.prc
Extremely Likely to purchase again 74 20.9 61.67 61.67
Likely to purchase again 29 8.19 24.17 85.83
Somewhat Likely to purchase again 11 3.11 9.17 95
Neither Likely nor Unlikely to purchase again 6 1.69 5 100
Somewhat Not Likely to purchase again 0 0 0 100
Not Likely to purchase again 0 0 0 100
Not at all Likely to purchase again 0 0 0 100
NA NA 234 66.1 NA NA
total N=354 · valid N=120 · x̄=1.57 · σ=0.86

Now we look at Q3 or Recommendation likelihood variable in the data set.

ic_sub_2$Recommend_likehood_cat <- dplyr::recode_factor(ic_sub_2$Recommend_likelihood,
                                                  `1` = "Not at all Likely to recommend",
                                                  `2` = "Not Likely to recommend",
                                                  `3` = "Not Somewhat  Likely to recommend",
                                                  `4` = "Neither Likely nor Unlikely to recommend",
                                                  `5` = "Somewhat Likely to recommend",
                                                  `6` = "Likely to recommend",
                                                  `7` = "Extremely Likely to recommend"
                       )

y <- c("Extremely Likely to recommend", "Likely to recommend",  "Somewhat Likely to recommend",
  "Neither Likely nor Unlikely to recommend",  "Somewhat Not Likely to recommend",
  "Not Likely to recommend",  "Not at all Likely to recommend")

ic_sub_2$Recommend_likehood_cat <- factor(ic_sub_2$Recommend_likehood_cat, levels = y )
frq(ic_sub_2$Recommend_likehood_cat, out="v", show.na=TRUE, title= "How Likely to recommend brand again")
How Likely to recommend brand again
val label frq raw.prc valid.prc cum.prc
Extremely Likely to recommend 57 16.1 48.72 48.72
Likely to recommend 40 11.3 34.19 82.91
Somewhat Likely to recommend 13 3.67 11.11 94.02
Neither Likely nor Unlikely to recommend 6 1.69 5.13 99.15
Somewhat Not Likely to recommend 0 0 0 99.15
Not Likely to recommend 1 0.28 0.85 100
Not at all Likely to recommend 0 0 0 100
NA NA 237 66.95 NA NA
total N=354 · valid N=117 · x̄=1.76 · σ=0.94

2.8.1 Recoding the brand perception and opinion variables

All 17 variables in this area (Q_1_1 to Q_1_17) could be coded with the same format. Again for not losing their numerical for further analyses, they are recoded into new variables and added to the data set. First we make those new categorical variables and then recode them. Tables of 17 “recoded” perception questions of Dreyer’s brand appear next.

library(forcats)

ftt <- function(x) { ordered( x, levels = c("1", "2", "3", "4", "5", "6", "7")) }
ic_sub_2[, 37:53] <- 7:24 %.>% ( function(x) { lapply( ic_sub_2[, (x)], ftt ) } )  (.)

rr <- function(x) { dplyr::recode_factor( x,
                                          `1` = "Does not describe at all",
                                          `2` = "Does not describe",
                                          `3` = "Not Somewhat does describe",
                                          `4` = "Neither does nor does not describe",
                                          `5` = "Somewhat does describe well",
                                          `6` = "Does describe well",
                                          `7` = "Describes extremely well"
) }
ic_sub_2[, 37:53] <- 37:53 %.>% ( function(x) { lapply( ic_sub_2[, (x)], rr ) } )  (.)


frq( fct_rev(ic_sub_2[,37]), out="v", show.na=TRUE, title= "is relaxing")
is relaxing
val label frq raw.prc valid.prc cum.prc
Describes extremely well 15 4.24 4.24 4.24
Does describe well 39 11.02 11.02 15.25
Somewhat does describe well 67 18.93 18.93 34.18
Neither does nor does not describe 157 44.35 44.35 78.53
Not Somewhat does describe 31 8.76 8.76 87.29
Does not describe 25 7.06 7.06 94.35
Does not describe at all 20 5.65 5.65 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.86 · σ=1.37
frq( fct_rev(ic_sub_2[,38]), out="v", show.na=TRUE, title= "is wholesome" )
is wholesome
val label frq raw.prc valid.prc cum.prc
Describes extremely well 13 3.67 3.67 3.67
Does describe well 39 11.02 11.02 14.69
Somewhat does describe well 66 18.64 18.64 33.33
Neither does nor does not describe 161 45.48 45.48 78.81
Not Somewhat does describe 37 10.45 10.45 89.27
Does not describe 26 7.34 7.34 96.61
Does not describe at all 12 3.39 3.39 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.84 · σ=1.28
frq( fct_rev(ic_sub_2[,39]), out="v", show.na=TRUE, title= "is fun"           )
is fun
val label frq raw.prc valid.prc cum.prc
Describes extremely well 14 3.95 3.95 3.95
Does describe well 28 7.91 7.91 11.86
Somewhat does describe well 57 16.1 16.1 27.97
Neither does nor does not describe 181 51.13 51.13 79.1
Not Somewhat does describe 37 10.45 10.45 89.55
Does not describe 14 3.95 3.95 93.5
Does not describe at all 23 6.5 6.5 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.94 · σ=1.30
frq( fct_rev(ic_sub_2[,40]), out="v", show.na=TRUE, title= "is exciting"       )
is exciting
val label frq raw.prc valid.prc cum.prc
Describes extremely well 13 3.67 3.67 3.67
Does describe well 27 7.63 7.63 11.3
Somewhat does describe well 44 12.43 12.43 23.73
Neither does nor does not describe 176 49.72 49.72 73.45
Not Somewhat does describe 38 10.73 10.73 84.18
Does not describe 36 10.17 10.17 94.35
Does not describe at all 20 5.65 5.65 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=4.09 · σ=1.33
frq( fct_rev(ic_sub_2[,41]), out="v", show.na=TRUE, title= "has premium quality")
has premium quality
val label frq raw.prc valid.prc cum.prc
Describes extremely well 31 8.76 8.76 8.76
Does describe well 71 20.06 20.06 28.81
Somewhat does describe well 70 19.77 19.77 48.59
Neither does nor does not describe 130 36.72 36.72 85.31
Not Somewhat does describe 29 8.19 8.19 93.5
Does not describe 13 3.67 3.67 97.18
Does not describe at all 10 2.82 2.82 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.38 · σ=1.37
frq( fct_rev(ic_sub_2[,42]), out="v", show.na=TRUE, title= "is memorable"            )
is memorable
val label frq raw.prc valid.prc cum.prc
Describes extremely well 17 4.8 4.8 4.8
Does describe well 20 5.65 5.65 10.45
Somewhat does describe well 61 17.23 17.23 27.68
Neither does nor does not describe 171 48.31 48.31 75.99
Not Somewhat does describe 48 13.56 13.56 89.55
Does not describe 21 5.93 5.93 95.48
Does not describe at all 16 4.52 4.52 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.96 · σ=1.27
frq( fct_rev(ic_sub_2[,43]), out="v", show.na=TRUE, title= "Bought as special treat but not regularly")
Bought as special treat but not regularly
val label frq raw.prc valid.prc cum.prc
Describes extremely well 17 4.8 4.8 4.8
Does describe well 23 6.5 6.5 11.3
Somewhat does describe well 77 21.75 21.75 33.05
Neither does nor does not describe 103 29.1 29.1 62.15
Not Somewhat does describe 51 14.41 14.41 76.55
Does not describe 40 11.3 11.3 87.85
Does not describe at all 43 12.15 12.15 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=4.24 · σ=1.60
frq( fct_rev(ic_sub_2[,44]), out="v", show.na=TRUE, title= "is good for regular consumption")
is good for regular consumption
val label frq raw.prc valid.prc cum.prc
Describes extremely well 37 10.45 10.45 10.45
Does describe well 60 16.95 16.95 27.4
Somewhat does describe well 100 28.25 28.25 55.65
Neither does nor does not describe 113 31.92 31.92 87.57
Not Somewhat does describe 20 5.65 5.65 93.22
Does not describe 11 3.11 3.11 96.33
Does not describe at all 13 3.67 3.67 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.29 · σ=1.39
frq( fct_rev(ic_sub_2[,45]), out="v", show.na=TRUE, title= "is interesting")
is interesting
val label frq raw.prc valid.prc cum.prc
Describes extremely well 13 3.67 3.67 3.67
Does describe well 32 9.04 9.04 12.71
Somewhat does describe well 62 17.51 17.51 30.23
Neither does nor does not describe 174 49.15 49.15 79.38
Not Somewhat does describe 41 11.58 11.58 90.96
Does not describe 16 4.52 4.52 95.48
Does not describe at all 16 4.52 4.52 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.88 · σ=1.25
frq( fct_rev(ic_sub_2[,46]), out="v", show.na=TRUE, title= "Tastes better than other brands" )
Tastes better than other brands
val label frq raw.prc valid.prc cum.prc
Describes extremely well 24 6.78 6.78 6.78
Does describe well 51 14.41 14.41 21.19
Somewhat does describe well 85 24.01 24.01 45.2
Neither does nor does not describe 135 38.14 38.14 83.33
Not Somewhat does describe 35 9.89 9.89 93.22
Does not describe 13 3.67 3.67 96.89
Does not describe at all 11 3.11 3.11 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.53 · σ=1.32
frq( fct_rev(ic_sub_2[,47]), out="v", show.na=TRUE, title= "Has many flavors")
Has many flavors
val label frq raw.prc valid.prc cum.prc
Describes extremely well 43 12.15 12.15 12.15
Does describe well 70 19.77 19.77 31.92
Somewhat does describe well 98 27.68 27.68 59.6
Neither does nor does not describe 109 30.79 30.79 90.4
Not Somewhat does describe 17 4.8 4.8 95.2
Does not describe 11 3.11 3.11 98.31
Does not describe at all 6 1.69 1.69 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.12 · σ=1.31
frq( fct_rev(ic_sub_2[,48]), out="v", show.na=TRUE, title= "Is enjoyable"   )
Is enjoyable
val label frq raw.prc valid.prc cum.prc
Describes extremely well 36 10.17 10.17 10.17
Does describe well 82 23.16 23.16 33.33
Somewhat does describe well 102 28.81 28.81 62.15
Neither does nor does not describe 97 27.4 27.4 89.55
Not Somewhat does describe 21 5.93 5.93 95.48
Does not describe 6 1.69 1.69 97.18
Does not describe at all 10 2.82 2.82 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.12 · σ=1.32
frq( fct_rev(ic_sub_2[,49]), out="v", show.na=TRUE, title= "Has best value for price"   )
Has best value for price
val label frq raw.prc valid.prc cum.prc
Describes extremely well 18 5.08 5.08 5.08
Does describe well 31 8.76 8.76 13.84
Somewhat does describe well 69 19.49 19.49 33.33
Neither does nor does not describe 170 48.02 48.02 81.36
Not Somewhat does describe 41 11.58 11.58 92.94
Does not describe 13 3.67 3.67 96.61
Does not describe at all 12 3.39 3.39 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.77 · σ=1.23
frq( fct_rev(ic_sub_2[,50]), out="v", show.na=TRUE, title= "is natural/organic")
is natural/organic
val label frq raw.prc valid.prc cum.prc
Describes extremely well 8 2.26 2.26 2.26
Does describe well 18 5.08 5.08 7.34
Somewhat does describe well 45 12.71 12.71 20.06
Neither does nor does not describe 151 42.66 42.66 62.71
Not Somewhat does describe 42 11.86 11.86 74.58
Does not describe 47 13.28 13.28 87.85
Does not describe at all 43 12.15 12.15 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=4.45 · σ=1.44
frq( fct_rev(ic_sub_2[,51]), out="v", show.na=TRUE, title= "is low calorie")
is low calorie
val label frq raw.prc valid.prc cum.prc
Describes extremely well 6 1.69 1.69 1.69
Does describe well 9 2.54 2.54 4.24
Somewhat does describe well 42 11.86 11.86 16.1
Neither does nor does not describe 153 43.22 43.22 59.32
Not Somewhat does describe 61 17.23 17.23 76.55
Does not describe 45 12.71 12.71 89.27
Does not describe at all 38 10.73 10.73 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=4.53 · σ=1.32
frq( fct_rev(ic_sub_2[,52]), out="v", show.na=TRUE, title= "Great for whole family")
Great for whole family
val label frq raw.prc valid.prc cum.prc
Describes extremely well 44 12.43 12.43 12.43
Does describe well 69 19.49 19.49 31.92
Somewhat does describe well 85 24.01 24.01 55.93
Neither does nor does not describe 117 33.05 33.05 88.98
Not Somewhat does describe 22 6.21 6.21 95.2
Does not describe 8 2.26 2.26 97.46
Does not describe at all 9 2.54 2.54 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.18 · σ=1.36
frq( fct_rev(ic_sub_2[,53]), out="v", show.na=TRUE, title= "Great for guests")
Great for guests
val label frq raw.prc valid.prc cum.prc
Describes extremely well 42 11.86 11.86 11.86
Does describe well 75 21.19 21.19 33.05
Somewhat does describe well 100 28.25 28.25 61.3
Neither does nor does not describe 94 26.55 26.55 87.85
Not Somewhat does describe 20 5.65 5.65 93.5
Does not describe 9 2.54 2.54 96.05
Does not describe at all 14 3.95 3.95 100
NA NA 0 0 NA NA
total N=354 · valid N=354 · x̄=3.16 · σ=1.42

Changing the names of new categorical variables. Adding a suffix to imply their categorical format.

q <- c("is_relaxing", "is_wholesome", "is_fun",
                     "is_exciting","is_premium_quality", "is_memorable", "is_treat", "is_good_for_regular",
                     "is_interesting", "taste_better_other_brands", "has_many_flavors", "is_enjoyable", "has_best_value/price",
                     "is_organic", "is_low_cal", "is_great_for_family", "is_great_for_guests")

colnames(ic_sub_2)[37:53] <- gsub("$", "_cat", q)

Recoding the Heard of brand and Purchased brands variables and checking the data.

ic_sub_2$Heard_of_brand <- ifelse(ic_sub_2$Heard_of_brand ==1, 'yes', 'no')
ic_sub_2$Purchased_last6mo <- ifelse(ic_sub_2$Purchased_last6mo ==1, 'yes', 'no')

2.9 Saving the final data set

This recoded, imputed and filtered dataset is saved in your working directory by using the code below.

write.csv(ic_sub_2, "./final_data.csv", row.names = FALSE)

3 Conclusion

The final data set is consisted of 354 observations and 53 variables. It includes the behavioural questions (Q1-Q3 and Q11) for the Dreyer’s brand and the demographic data of participants in both numerical and categorical format. The participants are people who have heard of Dreyer’s brand and have both purchased and not purchased the brand in the past 6 months. The variable names has been recoded to represent their natural meaning and ease further analyses. Further info about the cleaned version of data could be find in previous sections.

ic_sub_2 %>%  datatable( rownames = TRUE, filter="top", options = list(pageLength = 5, scrollX=T))