Executive Summary:
In
this project we have built customer analytics model by using base SAS on a
dataset having customer purchase records of two competing book sellers: Amazon
and Barnes and Noble (BN) along with the customers’ demographics. The
objective of our analysis was to build customized BI model to fit existing
data in order to make business predictions and to identify the consumer
characteristics that drive purchase behavior for Barnes and Nobel over Amazon
products.
The
dataset was first cleaned of missing values by replacing them with appropriate
values. Since we were working towards finding customer purchasing behavior at
Barnes and Noble (BN), we considered only the purchase made from BN and
considered all Amazon purchases as 0. We then created unique records for each
customer and added a variable named Quantity (Total_count) representing the
total purchase made by a customer. This variable: Total_count was used as
target variable for all our models. For each customer, we considered the
earliest purchase date as the Date variable.
We
implemented different models on the processed dataset and analyzed the results
in detail to determine what managerial implications they have. We compared the
models and deliberated on the probable reason of their difference. We also
implemented some model enhancement techniques and Logistic Regression to
determine the odds ratio of a customer buying books from BN.
We
have concluded the report with a section on all the managerial insights that
we gained from the project and how such information can be used to make
business strategies. We have also included a section on our learnings and the
over-all experience of working on the project, at the end of the report.
Contents
Data Pre-processing
i)
Missing
Value Handling
We
observed that out of the all the variables, education, region and age have
missing values which are represented by 99, space and 99 respectively.
We
saw that the number of missing values for education, age and region are 30238,
3 and 9507 records respectively.
As
all the mentioned variables are categorical, we decided to replace the missing
value data with the most frequently occurring value. We wrote the following
code to determine the most frequently occurring value for each of these 3
variables.
proc
freq data = mis6334.abi_project2_data_books;
tables
education region age;
We
observed that the most common value for education, age and race are 4, 6 and 3
respective and we thus replaced the missing values with these value.
data
abi.imputedata;
set
mis6334.abi_project2_data_books;
if
education = 99 then education = 4;
if
age = 99 then age = 6;
if
region = '' then region = 3;
run;
ii) Handling
Amazon data:-
As
the objective of the project is to understand the factors that affect customer
purchasing behavior at Barnes and Noble(BN), we labelled the ‘qty’ for all
Amazon.com records as 0. Thus all customers who bought books only from
Amazon.com, their total count of books purchased would be 0 and would be
equivalent to customers with 0 BN books in the count model. For customers who
bought books from both BN and Amazon, we have considered only the count for BN
and ignored the count for Amazon.
data
abi.bnCheck;
set
abi.imputedata;
if
domain = 'amazon.com' then qty = 0;
run;
Part I: Modeling Count Data
1. We
used sql to create a dataset counting the number of books a customer bought
from BN in 2007. All customers who have bought books only from Amazon would
have count as 0.
The
date field in the data is the date on which a customer did each purchase. As we
created a database containing unique records for each customer displaying the
total number of books bought by them, we selected the first date for each
customer, i.e. we selected the date on which the customer bought a book for the
first time.
However,
for customers who bought books from both BN and Amazon, we selected the date
from a BN purchase, i.e the date on which the customer bought a book from BN
for the first time. This was done because, for customers having bought books
from both Amazon and BN, we have not considered the amazon purchases.
The
following code creates a database with unique records of customers along with
their demographic information, total count of books purchased from BN. However
customers who have bought books from both Amazon and BN have 2 records in the
following dataset, each record displaying the first date of purchase from each
of the 2 domains
PROC sql;
create table abi.tempdata1
as
select userid, education, region, hhsz, age,
income, child, race, country, domain, min(date) AS DATE, sum(qty) as
total_count from abi.bncheck
group by userid, education, region, hhsz, age,
income, child, race, country, domain;
QUIT;
run;
We now split the
dataset into 2; 1 for each of the 2 domains; Amazon and BN.
The
following code creates a dataset for BN:
PROC
sql;
create table abi.tempdatabn as
select userid,
education, region, hhsz, age, income, child, race, country, domain, DATE,
TOTAL_COUNT
from abi.tempdata1
where domain = 'barnesandnoble.com';
quit;
run;
The
following code creates a dataset for Amazon:
PROC sql;
create table abi.tempdataamazon as
select userid, education, region, hhsz, age,
income, child, race, country, domain, DATE, TOTAL_COUNT
from abi.tempdata1
where domain = 'amazon.com';
quit;
run;
The
following code merges the 2 datasets, with all records present in BN dataset
and only those from the Amazon dataset who are not present in the BN dataset.
We thus now have a dataset with ‘Date’ for customers who have bought from both
Amazon and BN as the date of their first purchase from BN.
PROC
sql;
create
table abi.tempdata2 as
select userid, education, region, hhsz, age,
income, child, race, country, DATE, TOTAL_cOUNT
from
abi.tempdatabn
union
select
userid, education, region, hhsz, age, income, child, race, country, DATE,
TOTAL_COUNT
from
abi.tempdataamazon
where
userid not in (select userid from abi.tempdatabn);
QUIT;
run;
We
also observed that the date field in the dataset is in the form YYYYMMDD. We
used the following code to convert date to the SAS date format. The attribute
Date_val contains the SAS dates.
Data abi.Permdata;
Set abi.tempdata2;
Date_val = input (put(DATE,
8.), yymmdd8.);
Put Date
@10 date_val
date9.
@20 date_val 6.;
Run;
The following code prints
the first 10 observations of the created dataset:-
proc print data =
abi.Permdata (obs=10);
title "BN
Data";
run;
Note:-
Since the dataset contains data for only 2007, there was no need to add any
filter for that.
2. NBD
Model:-
proc
freq data = ABI. Permdata;
/*Creates
a table with total_count as the number of books bought from BN and count as the
number of customers corresponding to each number of books bought*/
table
total_count / out = q2table;
run;
proc
nlmixed data=q2table;
/* the 2
parameters of the NBD Model (Poisson-Gamma) - r for shape and a for scale*/
parms
r=1 alpha=1;
/*Calculation
of log-likelihood*/
ll
= count*log((gamma(r+total_count)/(gamma(r)*fact(total_count)))*
((alpha/(alpha+1))**r)*((1/(alpha+1))**total_count));
model
total_count ~ general(ll);
run;
3. The probability of the count being equal
to 0 i.e. probability of a customer buying 0 books from BN:-
p(X=0) = (alpha/(alpha+1))^r =
(0.1299/(0.1299+1))^0.09723 = 0.8103
Mean E(X) = r/alpha = 0.09723/0.1299
= 0.7485
Reach = 1- p(X=0) = 1-0.8103 =
0.1897 = 18.97%
Average
Frequency = E[X] / (1 - P(X=0)) =3.946
GRPs = 100* E[X]
=74.85
4.
Poisson Regression
Model:-
proc
nlmixed data=ABI.Permdata;
/* m
stands for lamdha and b1, b2, b3, b4, b5, b6, b7 and b8 are the beta values for
education, region, hhsz, age, income, child, race and country respectively*/
parms
m0=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0 b8=0;
m=m0*exp(b1*education+b2*region+b3*hhsz+b4*age+b5*income+b6*child+b7*race+b8*country);
/*Calculation of log-likelihood*/
ll
= total_count*log(m)-m-log(fact(total_count));
model
total_count ~ general(ll);
run;
Thus
the estimated values are as follows:
a) Beta
value for education (b_education) = -0.09095
b) Beta
value for region (b_region) = -0.1019
c) Beta
value for Household size (b_hhsz) = -0.01737
d) Beta
value for age (b_age) = 0.0246
e) Beta
value for income (b_income) = 0.01781
f) Beta
value for child (b_child) = 0.07184
g) Beta
value for race (b_race) = -0.2181
h) Beta
value for country (b_country) = -0.1217
Managerial
takeaways:
I.
If the education of a
consumer is increased by 1 level (from 0 to 5), the total no. of his/her
purchase of books in barnesandnoble.com site decrease by a factor of exp(0.09095)=
1.0952. Hence, an increase in the customer’s education by 1 level causes a
[(x-x/1.0952)/x*100] = 8.69% decrease in the customer’s total purchase of books
in barnesandnoble.com site.
II.
If the region of a
consumer is changed from 1 to 4 by 1 level, the total no. of his/her purchase
of books in barnesandnoble.com site decreases by a factor of exp(0.1019)=1.1072.
A unit change in the customer’s region (from 1 to 4) causes a [(x-x/1.1072)/x*100]
= 9.68% decrease in the customer’s total purchase of books in
barnesandnoble.com site.
III.
If the household size
of a consumer is increased by 1, the total no. of his/her purchase of books in
barnesandnoble.com site decreases by a factor of exp(0.01737)=1.0175. Hence, a
unit increase in the customers household size causes a [(x-x/1.0175)/x*100] =
1.71% decrease in the customer’s total purchase of books in barnesandnoble.com
site.
IV.
If the age of a
consumer is increased by 1 level (from 1 to 11), the total no. of his/her
purchase of books in barnesandnoble.com site increases by a factor of
exp(0.0246)=1.0249. An increase in the customer’s age by 1 level causes a 2.49%
increase in the customer’s total purchase of books in barnesandnoble.com site.
V.
If the income of a
consumer is increased by 1 level (from 1 to 7), the total no. of his/her
purchase of books in barnesandnoble.com site increases by a factor of
exp(0.01781)=1.0179. An increase in the customer’s income by 1 level causes a
1.79% increase in the customer’s total purchase of books in barnesandnoble.com
site.
VI.
If a consumer has
children, the total no. of his/her purchase of books in barnesandnoble.com site
is exp(0.07184)=1.0744 times the total no. of purchase of books of a consumer
who has no children. Thus, a customer having children buys 7.44% books in
barnesandnoble.com site more than a customer having no children.
VII.
If the race of a
consumer is changed from 1 to 5 by 1 level,, the total no. of his/her purchase
of books in barnesandnoble.com site decreases by a factor of
exp(0.2181)=1.2437. Hence a unit change in the customer’s region (from 1 to 5)
causes a [(x-x/1.2437)/x*100] = 19.59% decrease in the customer’s total
purchase of books in barnesandnoble.com site.
VIII.
If the country of a
consumer is not US, the total no. of his/her purchase of books in
barnesandnoble.com site is exp(0.1217)=1.1294 times less than the total no. of
purchase of books of a consumer whose country is U.S. Thus, a customer whose
country is not U.S buys [(x-x/1.1294)/x*100] = 11.45% less books in
barnesandnoble.com site than a customer whose country is U.S.
IX.
Also from the
p-values, we observed that Household size and Child are not significant in
determining the total number of books that a customer buys from buys
from barnesandnoble.com site.
We
recommend not using date in the regression due to the following reasons:-
a) The
number of distinct values of the variable Date is too many, for it to be
considered as a categorical variable
b) If the Date variable is considered as an
Interval variable, it is still not suitable for analysis because the data set
does not have data that is spread over a long period of time to use date as an
efficient predictor of the dependent variable.
c) When
the date is converted into SAS date and used for regression, SAS gives a
warning that “Problem needs more than 200 iterations or 2000 function calls.”
5. NBD
Regression Model:-
Formula
used for LL:
Where
r
= shape
alpha
= scale
Y
= total_count
Beta
= Beta Values for all the input Parameters (Xi) - education, region, household
size, age, Income, child, race and country; in our project
6.
NBD Regression Model code:-
proc nlmixed data= abi.Permdata;
/* r stands
for shape, alpha for scale and b_education, b_region, b_hhsz, b_age, b_income,
b_child, b_race and b_country are the beta values for education, region, hhsz,
age, income, child, race and country respectively*/
parms r=1, alpha=1, b_education=0,
b_region=0, b_hhsz=0, b_age=0, b_income=0, b_child=0, b_race=0, b_country=0;
expon=exp(b_education*education+b_region*region+b_hhsz*hhsz+b_age*age+b_income*income+b_child*child+b_race*race+b_country*country);
ll =
log((gamma(r+total_count)/(gamma(r)*fact(total_count)))*((alpha/(alpha+expon))**r)*((expon/(alpha+expon))**total_count));
model total_count ~ general(ll);
run;
Thus
the estimated value are as follows:-
a)
Beta
value for education (b_education) = -0.08480
b)
Beta
value for region (b_region) = -0.09972
c)
Beta
value for Household size (b_hhsz) = -0.00739
d)
Beta
value for age (b_age) = 0.02847
e)
Beta
value for income (b_income) = 0.01800
f)
Beta
value for child (b_child) = 0.05340
g)
Beta
value for race (b_race) = -0.2293
h)
Beta
value for country (b_country) = -0.08087
Lamda = exp (b_education*education + b_region*region +
b_hhsz*hhsz + b_age*age + b_income*income
+ b_child*child + b_race*race + b_country*country );
Managerial takeaways:
I.
If
the education of a consumer is increased by 1 level (from 0 to 5), the total
no. of his/her purchase of books in barnesandnoble.com site decrease by a
factor of exp(0.0848)= 1.088467. Hence, an increase in the customer’s education
by 1 level causes a [(x-x/1.0885)/x*100] = 8.13% decrease in the customer’s
total purchase of books in barnesandnoble.com site.
II.
If
the region of a consumer is changed from 1 to 4 by 1 level, the total no. of
his/her purchase of books in barnesandnoble.com site decreases by
a factor of exp(0.09972)=1.1048. A unit change
in the customer’s region (from 1 to 4) causes a [(x-x/1.1048)/x*100] = 9.49%
decrease in the customer’s total purchase of books in barnesandnoble.com site.
III.
If
the household size of a consumer is increased by 1, the total no. of his/her
purchase of books in barnesandnoble.com site decreases by
a factor of exp(0.00739)=1.0074. Hence, a unit increase in the customers
household size causes a [(x-x/1.0074)/x*100] = 0.73% decrease in the customer’s
total purchase of books in barnesandnoble.com site.
IV.
If
the age of a consumer is increased by 1 level (from 1 to 11), the total no. of
his/her purchase of books in barnesandnoble.com site increases by a factor of
exp(0.02847)=1.0289. An increase in the customer’s age by 1 level causes a
2.88% increase in the customer’s total purchase of books in barnesandnoble.com
site.
V.
If
the income of a consumer is increased by 1 level (from 1 to 7), the total no.
of his/her purchase of books in barnesandnoble.com site increases by a factor
of exp(0.018)=1.0181. An increase in the customer’s income by 1 level causes a
1.81% increase in the customer’s total purchase of books in barnesandnoble.com
site.
VI.
If
a consumer has children, the total no. of his/her purchase of books in barnesandnoble.com
site is exp(0.0534)=1.0548 times the total no. of purchase of books of a
consumer who has no children. Thus, a customer having children buys 5.48% books
in barnesandnoble.com site more than a customer having no children.
VII.
If
the race of a consumer is changed from 1 to 5 by 1 level,, the total no. of
his/her purchase of books in barnesandnoble.com site decreases by
a factor of exp(0.2293)=1.2577. Hence a unit change in the customer’s region
(from 1 to 5) causes a [(x-x/1.2577)/x*100] = 20.49% decrease in the customer’s
total purchase of books in barnesandnoble.com site.
VIII.
If
the country of a consumer is not US, the total no. of his/her purchase of books
in barnesandnoble.com site is exp(0.08087)=1.0842 times less
than the total no. of purchase of books of a consumer whose country is U.S.
Thus, a customer whose country is not U.S buys [(x-x/1.0842)/x*100] = 7.76% less
books in barnesandnoble.com site than a customer whose country is U.S.
IX.
Also
from the p-value, we observed that Household size, child, country and income are
not significant in determining the total number of books that a customer buys from
barnesandnoble.com site.
X.
We
also plotted a gamma distribution with the estimated r = 0.09837 and alpha = 0.07914
value.
DATA random1;
Do i=1 to 1000;
x=0.07914*rand('GAMMA', 0.09837);
output;
end;
RUN;
proc sgplot data=&functiontype;
histogram x;
title "Histogram ";
run;
We
thus observed that the lambda value is high for less number of customers i.e.
few customers buy books from BN frequently.
7. Difference regarding the managerial
takeaways between Poisson Regression and NBD Regression:-
|
Poisson
Regression
|
NBD
Regression
|
Increase
in customer’s education by 1 level
|
8.69%
decrease
|
8.13%
decrease
|
The
region of consumer changed from 1 to 4 by 1 level
|
9.68%
decrease
|
9.49%
decrease
|
The
household size of a consumer is increased by 1
|
1.71% decrease
|
0.73%
decrease
|
Increase
in the customer’s age by 1 level
|
2.49%
increase
|
2.88%
increase
|
Increase
in the customer’s income by 1 level
|
1.79%
increase
|
1.81%
increase
|
Customer
having children
|
7.44%
more
|
5.48%
more
|
A
unit change in the customer’s race
(from 1 to 5)
|
19.59%
decrease
|
20.49%
decrease
|
A
customer whose country is not U.S
|
11.45% less
|
7.76% less
|
We
observed that the effect of each input variable is similar for both the models.
However,
the effect of household size is more pronounced in the Poisson Regression than
the NBD Regression. A unit increase in the household size causes a 1.71%
decrease in lambda in the Poisson Regression and only 0.73% (less than half) in
the NBD Regression. Also, the country of the customer has more of an effect in
determining lambda in the Poisson Model. A non-US customer buys 7.76% lesser books
than an US customer as per the NBD model whereas as per Poisson Regression,
he/she buys 11.45% lesser than an US customer.
In
the NBD Regression Model, that Household size, child, country and income are
not significant in determining the total number of books that a customer buys
from barnesandnoble.com site. Whereas in the Poisson Regression Model, only the
Household size and child are not significant.
8.
Comparison
of whether NBD Regression fit the data better than Poisson Regression:-
NBD Regression Model
Poisson Regression Model
ll
for Poisson Regression Model = -18,806
ll for NBD Regression Model = -8368.5
LR = −2(LLB − LLA)
=-2(-18,806vv-- -- 8368.5)
= -2(-18,806+8368.5)
= -2(-10,437.5)
=20,875
χ2
(.05,k) for df=1 (because the NBD Regression has 1 extra parameter a and r
instead mo of Poisson Regression) à 3.841
Hence,
we observed that LR > χ2; signifying that the complicated model; NBD
Regression in this case fits the data better.
Conclusion:
The NBD Regression Model fits the data better
than Poisson Regression i.e the performance of the more complicated model (10
attributes) is better than the performance of the simpler model (9 attributes).
Probable Reason:
It is often assumed that a more complex model i.e
one with more number of attributes perform better than a simpler model as it
would have more predictor attributes resulting better prediction performance,.
The Poisson
distribution assumes
that the mean and variance are the same. However, sometimes data show extra
variation that is greater than the mean. This situation is called over-dispersion and negative binomial regression is more
flexible in that regard than Poisson regression. The negative
binomial distribution has one
parameter more than the Poisson regression that adjusts the variance
independently from the mean.
Thus, only the nature of the data and the
questions of interest can determine which of these regressions are best for the
situation and one cannot conclude on which model is universally better.
Our data fits the NBD Regression better because
it may be a case of over-dispersion where the variance is more than the mean.
Part II: Improving the Model
9. We
performed 3 tests to determine the variables that are not useful and can be
dropped from analysis
A) Correlation
We
ran correlation of each of the variable with Total_count:
DATA abi.corrdata;
set
abi.Permdata12;
RUN;
PROC CORR DATA = abi.corrdata;
VAR
education region hhsz age income child race country date_val;
WITH
total_count;
TITLE
'Correlations with Total Count';
RUN;
Conclusion:
We
observed that attributes hhsize, child and income are least correlated with
total_count.
B) p-value from NBD Regression and Poisson
Regression
We
observed the p-value of the Poisson Regression and NBD Regression to determine
the variables that are least significant.
Conclusion: We
observed that the attributes; hhsize, child and income are least significant.
C) Variable Clustering on Enterprise Miner
We
executed ‘Variable Clustering’ on the dataset in SAS EM:-
Conclusion:
We
observed that the variables Education, Income, HHSZ, Child and Age are far from
Total_Count.
Conclusion from the 3 tests:
All
the 3 tests showed that the variables; Education, Income and Age are not
significant in determining Toatal_Count and hence, we decided to drop them from
the analysis.
Poisson Regression code with reduced
number of variables:
proc
nlmixed data=abi.Permdata;
/* m
stands for lamdha */
parms
m0=1 b1=0 b2=0 b3=0 b4=0 b5=0;
m=m0*exp(b1*education+b2*region+b3*age+b4*race+b5*country);
ll =
total_count*log(m)-m-log(fact(total_count));
model
total_count ~ general(ll);
run;
Comparison
of Poisson Regression on all the variables with the Poisson Regression on
reduced number of variables:-
ll
of the Poisson Regression on all the variables = -18,806
ll
of the Poisson Regression on the reduced number of variables = -18,813
LR
= - 2(-18,813 - - 18,806) = -2(-7) =14
χ2
(.05,k) for df=3 (because we dropped 3 variables from the analysis) à
7.815
Hence
LR> χ2; signifying that the more complicated model I.e. the Poisson Model
with all the variables is indeed better than the model with reduced number of
variables.
NBD
Regression code with reduced number of variables:-
proc nlmixed data= abi.Permdata;
parms r=1 a=1 b1=0 b2=0 b3=0
b4=0 b5=0;
m=exp(b1*education+b2*region+b3*age+b4*race+b5*country);
num1 = gamma(r+total_count);
denom1 = gamma(r) *
fact(total_count);
term1 = num1 / denom1;
denom2 = a+m;
term2 = (a / denom2)**r;
term3 = (m/denom2)**total_count;
ll = log(term1*term2*term3);
model total_count ~ general(ll);
run;
Comparison
of NBD Regression on all the variables with the NBD Regression on reduced
number of variables:-
ll
of the NBD Regression on all the variables = -8,368.5
ll
of the NBD Regression on the reduced number of variables = -8,369
LR
= - 2(-8,369- -8,368.5) = -2(-0.5) =3
χ2
(.05,k) for df=3 (because we dropped 3 variables from the analysis) à
7.815
Hence
LR< χ2; signifying that the more complicated model I.e. the NBD Model with
all the variables is not different from the model with reduced number of
variables. Hence the simpler model can be used.
Conclusion of the Variable Reduction
Analysis:
The
Poisson Regression fits the original dataset i.e with the one with 8 input
variables better than the dataset with only 5 variables. However for the NBD
Regression model, there is no difference in the fit for the 2 datasets and
hence a simpler dataset i.e. the one with 5 variables can be used. We can thus
conclude that there is no overfitting for Poisson Model whereas there was a
case of overfitting for NBD model and thus one can use a simpler dataset for
it.
1. Construction
of 3 new input variables in the analysis:-
a) Loyalty of Consumer – The variable Loyalcust is set to 1 if the customer
has purchased the books only from Barnes and nobel. It is set to 0 if the
customer has bought from both Barnes and nobel and from amazon. It is set to 0
if the customer has bought from amazon alone.
Code
to add the variable Loyalcust:-
PROC sql;
create table abi.permdata_q10 as
select userid, education, region, hhsz, age,
income, child, race, country, domain, date, total_count ,'1' as LoyalCust
from abi.permdata
QUIT;
Run;
In the above code, we are simply creating
another database with one additional column ‘LoyalCust’ whose value is set to
1.
PROC sql;
update abi.permdata_q10
set LoyalCust = '0'
where
userid in (select userid from
abi.tempdataamazon);
run;
In
the above code, we have set the variable ‘LoyalCust’ to 0 for those customers
who have bought books from Amazon.com. For q1, we had split the original
database to create abi.tempdataamazon which contains customers who have bought
books from Amazon (both who have bought only from Amazon and those who have
bought from both BN and Amazon).
Thus
by setting the ‘LoyalCust’ of customers who are present in the database
abi.tempdataamazon to 0, we have maintained the ‘LoyalCust’ as 1 only for those
customers who have bought exclusively from BN.
b)
Seasons – If the month of purchase is December, Jan or Feb, the season is
considered ‘Winter’ and the variable Season
is set to 1. If the month of purchase is March, April or May, the season is
considered ‘Spring’ and the variable Season
is set to 2. If the month of purchase is June, July or August, the season is
considered ‘Summer’ and the variable Season
is set to 3. If the month of purchase is September, October or November, the
season is considered ‘Fall’ and the variable Season is set to 4.
c)
Weekends – If the day of purchase is Saturday or Sunday, it is considered
weekend and the variable Weekend is
set to 1 , otherwise it is set to 0.
Code
to add Season and Weekend
Data abi.Permdata_q10_date;
Set abi.Permdata_q10;
Weekend = 0;
If weekday(date_val) = 1 or
weekday(date_val) = 7 then do;
weekend = 1;
end;
If month(date_val) = 1 or
month(date_val) = 2 or month(date_val) = 12 then do;
Season = 1;
end;
else if month(date_val) = 3 or
month(date_val) = 4 or month(date_val) = 5 then do;
Season = 2;
end;
else if month(date_val) = 6 or
month(date_val) = 7 or month(date_val) = 8 then do;
Season = 3;
end;
else do;
Season = 4;
end;
Run;
In the above code, we set the variable
weekend = 1 for all those records whose weekday of the date of purchase is 1 or
7 i.e. Sunday or Saturday.
We have also set the variable Season to
Winter or ‘1’ for months December, Jan and Feb; to Spring or ‘2’ for months
March, April and May; to Summer or ‘3’ for months June, July and August; and to
Fall or ‘4’ for September, October or November.
We
now executed the Poisson Regression model on the dataset with the additional 3
variables:
proc nlmixed data= abi.Permdata_q10_date;
/* m stands for lamdha */
parms m0=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0, b8=0, b9=0, b10=0, b11=0;
m=m0*exp(b1*education+b2*region+b3*hhsz+b4*age+b5*income+b6*child+b7*race+b8*country+b9*LoyalCust+b10*Weekend+b11*Season);
ll =
total_count*log(m)-m-log(fact(total_count));
model total_count ~ general(ll);
run;
Comparison between the Poisson Regression Model on the
original dataset and the Poisson Regression on the dataset with the 3
additional attributes:
ll for the original dataset = -18,806
ll for the new dataset = -12,737
LR = -2(-18,806 - -12,737) = -2(-6069) =
12,138
χ2 (.05,k) for df=3 (because we added 3
variables to the analysis) à 7.815
Hence LR> χ2; signifying that the more
complicated model I.e. the Poisson Model with the additional Poisson variables
is better than the original model.
We
now executed the NBD Regression model on the dataset with the additional 3
variables:-
proc nlmixed data= abi.Permdata_q10_date;
/* m stands for lamdha */
parms r=1 alpha=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0, b8=0, b9=0, b10=0, b11=0;
expon=exp(b1*education+b2*region+b3*hhsz+b4*age+b5*income+b6*child+b7*race+b8*country+b9*LoyalCust+b10*Weekend+b11*Season);
ll =
log((gamma(r+total_count)/(gamma(r)*fact(total_count)))*((alpha/(alpha+expon))**r)*((expon/(alpha+expon))**total_count));
model total_count ~ general(ll);
run;
Comparison
between the NBD Regression Model on the original dataset and the NBD Regression
on the dataset with the 3 additional attributes:-
ll for the original dataset = -8,368.5
ll for the new dataset = -7,569.5
LR = -2(-8,368.5 - - 7,569.5) = -2(-799) =
1598
χ2
(.05,k) for df=3 (because we added 3 variables to the analysis) à
7.815
Hence
LR> χ2; signifying that the more complicated model I.e. the NBD Model with
the additional NBD variables is better than the original model.
Conclusion
for constructing additional variables to the dataset:
We
observed that by constructing the 3 variables; loyalty, Season and Weekend, the
performance of both Poisson and NBD models increase. We also observed the
p-value of these 3 variables from both Poisson and NBD Regression and concluded
that the 3 variables especially loyalty and Seasons are significant in
determining the total purchase of customers. Hence constructing significant
variables have increased the performance of the models.
2. Interaction
effects:-
A)
Qualitative consideration
We
brainstormed on the effect of each input variable on the target variable and
also their effect and relationship with each other. We decided on the following
3 interaction effects:-
a) Hhsz*child
In the age of nuclear families, household
size is strongly correlated to having children. If the customer has children
his/her household size would generally be more than a customer who has no
children.
b) region*race*country
Country, region and race are demographic
information of a customer and are strongly correlated with each other. The race
and region where a customer resides is often related to the country of
residence of the consumer.
c) age*income*education
The older a consumer is, it is more likely
that he/she would be more educated and earning better. Also more educated a
consumer is, it is more likely that he/she earns more.
B) Variable Clustering On Enterprise Miner
On
applying Variable Clustering on SAS EM, we get the following clusters:-
i)
Cluster 1- HHSZ and Child
ii)
Cluster 2- Education, Income and Age
iii)
Cluster 3 – Region , Country and Race
We
thus concluded that HHSZ and Child, Education, Income and Age and Region,
Country and Race are strongly related to each other and hence their interaction
effects can be considered for the analysis.
data
abi.Permdataq11;
set abi.Permdata;
parameter1 = hhsz*child;
parameter2 = region*race*country;
parameter3
= age*income*education;
run;
i)
Interaction Effect 1- hhsz*child:
Poisson
Regression:-
Proc contents data = abi.Permdataq11 POSITION;
proc nlmixed data= abi.Permdataq11;
/* m stands for lamdha */
parms m0=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0;
m=m0*exp(b1*education+b2*region+b3*age+b4*income+b5*race+b6*country+b7*
parameer1);
ll = total_count*log(m)-m-log(fact(total_count));
model total_count ~ general(ll);
run;
Comparison of Poisson Regression
model on the dataset with the Interaction effect with Poisson Regression model
on the original dataset:-
ll of Poisson Regression model on the
dataset with the Interaction effect = -18,808.5
ll of Poisson Regression model on the
original dataset = -18,806
LR = -2(-18,808.5 - -18,806) = 5
χ2 (.05,k) for df=1 (because we have drooped hhsz and child and
added hhsz*child to the analysis) à 3.841
Thus, LR > χ2; signifying that the more
complicated model I.e. the Poisson Model on the original dataset is better than the model on the new dataset.
NBD
Regression:-
proc
nlmixed data= abi.Permdataq11;
parms
r=1 a=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0 b7=0;
m=exp(b1*education+b2*region+b3*age+b4*income+b5*race+b6*country+
b7*parameer1);
num1
= gamma(r+total_count);
denom1
= gamma(r) * fact(total_count);
term1
= num1 / denom1;
denom2
= a+m;
term2
= (a / denom2)**r;
term3
= (m/denom2)**total_count;
ll
= log(term1*term2*term3);
model
total_count ~ general(ll);
run;
Comparison of NBD Regression model on
the dataset with the Interaction effect with NBD Regression model on the
original dataset:-
ll
of NBD Regression model on the dataset with the Interaction effect = -8368.5
ll
of NBD Regression model on the original dataset = -8368.5
LR
= -2(-8368.5 - - 8368.5) = 0
χ2
(.05,k) for df=1 (because we have
drooped hhsz and child and added hhsz*child to the analysis) à
3.841
Thus, LR < χ2; signifying that the more
complicated model I.e. the NBD Model on the original dataset is not different from the model on the new
dataset. Hence the new, simpler model can be used.
ii) Interaction
Effect 2- region*race*country
Poisson
Regression:-
Proc
contents data = abi.Permdataq11 POSITION;
proc
nlmixed data= abi.Permdataq11;
/* m stands for lamdha */
parms m0=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0;
m=m0*exp(b1*education+b2*hhsize+b3*age+b4*income+b5*child+b6* parameter2);
ll =
total_count*log(m)-m-log(fact(total_count));
model total_count ~ general(ll);
run;
Comparison of Poisson Regression
model on the dataset with the Interaction effect with Poisson Regression model
on the original dataset:-
ll
of Poisson Regression model on the dataset with the Interaction effect =
-18,859
ll
of Poisson Regression model on the original dataset = -18,806
LR
= -2(-18,859 - -18,806) = 106
χ2
(.05,k) for df=2 (because we have
drooped region, race and country and added region*race*country to the analysis)
à
5.991
Thus, LR > χ2; signifying that the more
complicated model I.e. the Poisson Model on the original dataset is better than the model on the new dataset.
NBD Regression:-
proc
nlmixed data= abi.Permdataq11;
parms
r=1 a=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0;
m=exp(b1*education+b2*hhsz+b3*age+b4*income+b5*child+b6*
parameter2);
num1
= gamma(r+total_count);
denom1
= gamma(r) * fact(total_count);
term1
= num1 / denom1;
denom2
= a+m;
term2
= (a / denom2)**r;
term3
= (m/denom2)**total_count;
ll
= log(term1*term2*term3);
model
total_count ~ general(ll);
run;
Comparison of NBD Regression model on
the dataset with the Interaction effect with NBD Regression model on the
original dataset:-
ll
of NBD Regression model on the dataset with the Interaction effect = -8374.5
ll
of NBD Regression model on the original dataset = -8368.5
LR
= -2(-8374.5 - -8368.5) = 12
χ2
(.05,k) for df=2 (because we have
drooped region, race and country and added region*race*country to the analysis)
à
5.991
Thus, LR > χ2; signifying that the more
complicated model I.e. the NBD Model on the original dataset is better than the model on the new dataset.
iii) Interaction
Effect 3- age*income*education
Poisson Regression:-
Proc
contents data = abi.Permdataq11 POSITION;
proc
nlmixed data= abi.Permdataq11;
/* m stands for lamdha */
parms m0=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0;
m=m0*exp(b1*region+b2*hhsz+b3*child+b4*race+b5*country+
b6* parameter3);
ll =
total_count*log(m)-m-log(fact(total_count));
model total_count ~ general(ll);
run;
Comparison of Poisson Regression
model on the dataset with the Interaction effect with Poisson Regression model
on the original dataset:-
ll
of Poisson Regression model on the dataset with the Interaction effect =
-18,850
ll
of Poisson Regression model on the original dataset = -18,806
LR
= -2(-18,850 - - 18,806) = 88
χ2
(.05,k) for df=2 (because we have
drooped age, income and education and added age*income*education to the
analysis) à
5.991
Thus, LR > χ2; signifying that the more
complicated model I.e. the Poisson Model
on the original dataset is better than
the model on the new dataset.
NBD Regression:-
proc
nlmixed data= abi.Permdataq11;
parms
r=1 a=1 b1=0 b2=0 b3=0 b4=0 b5=0 b6=0;
m=exp(b1*region+b2*hhsz+b3*child+b4*race+b5*country+
b6*parameter3);
num1
= gamma(r+total_count);
denom1
= gamma(r) * fact(total_count);
term1
= num1 / denom1;
denom2
= a+m;
term2
= (a / denom2)**r;
term3
= (m/denom2)**total_count;
ll
= log(term1*term2*term3);
model
total_count ~ general(ll);
run;
Comparison
of NBD Regression model on the dataset with the Interaction effect with NBD
Regression model on the original dataset:-
ll of NBD Regression model on the dataset
with the Interaction effect = -8373.5
ll of NBD Regression model on the original
dataset = -8368.5
LR = -2(-8373.5 - -8368.5) = 10
χ2
(.05,k) for df=2 (because we have
drooped age, income and education and added age*income*education to the
analysis) à
5.991
Thus,
LR > χ2; signifying that the more complicated model I.e. the NBD Model on the original dataset is better than the model on the new dataset.
Conclusion on adding the Interaction
Effects:
We
thus observed that adding the interaction effects and removing the individual
variables did not result in performance enhancement of the models.
Conclusion drawn from using the Model
Enhancement Techniques:-
We
observed a trend that the performance of a model on a dataset with more number
of attributes is better than its performance on a dataset with lesser number of
variables. As the dataset does not have a large number of attributes and none
of the attributes are redundant and they provide new information about the
customers, removing any of them is detrimental to the predictive power of the
models. Techniques like Attribute Reduction and adding Interaction Effects did
not produce the expected performance enhancement of the models probably because
the effects were overshadowed by the loss of predictive power of the models due
to removal of attributes.
Part III: Why certain customers prefer Amazon over
BN
3. Logistic
Regression:-
We
executed the logistic regression and results are reported
DATA Permdata_q12;
SET ABI.Permdata;
/* we created
another column BN which is set to 1 for all customers who have bought from BN
and 0 otherwise*/
if total_count > 0 then BN = 1; else BN
= 0; run;
proc logistic data=Permdata_q12;
class Region child race;
model BN= education region hhsz age income
child race country /expb;
run;
From results, we observed
region is the only independent variable that is significant for the regression.
Managerial
Takeaways:
a.
Living in region 1 vs region 4, increases the odds of purchasing
from Barnes and Nobles (versus not purchasing from Barnes and Nobles) by a
factor of 0.715
b.
Living in region 2 vs region 4, increases the odds of purchasing
from Barnes and Nobles (versus not purchasing from Barnes and Nobles) by a
factor of 0.905
c.
Living in region 3 vs region 4, increases the odds of purchasing
from Barnes and Nobles (versus not purchasing from Barnes and Nobles) by a
factor of 1.018
Part IV: Summary
4.
Managerial Insights
We
used the NBD model results to calculate the ‘Reach’ or the proportion of
customers who have bought books from BN at least once in 2007 and ‘Average
Frequency’ or the average number of books a Barnes and Noble customer has
bought in 2007 which was 18.97% and 3.946 respectively. We thus concluded that
both the reach and the frequency is quite low for BN and it needs to strengthen
it marketing efforts in order to boost sale.
On
studying and analyzing the data using various models, we formulated the
following insights which can be used to make marketing strategies:
i)
We observed that the effect of ‘Race’ is
most pronounced in determining lambda. A
change in the customer’s race (from 1 to 5) causes around 20% decrease in
his/her purchase from BN. Hence BN should concentrate its customer engagement
strategies on customers of Race 1 and 2. Churn of customers from Race 1 and 2
can prove to be detrimental for the sale, as a customer from Race 2 would buy 20%
less books from BN than Race 1 every other factor keeping constant, Race 3
would buy 20% less than Race 2 and so on. Hence it would be profitable to
maintain and attract customers of Race 1 and 2 as such customers tend to buy
more books from BN. BN can target the community centers of Race 1 and 2 for
their promotional activities and can also start discount offers for the
festivals of these 2 races.
ii)
We have also observed that an Increase in
customer’s education by 1 level causes a 8% decrease in their total purchase
from BN. This might be due to the fact that a more educated population has more
access to internet and thus may prefer online reading. We recommend BN to
capitalize more on their slightly less educated customer base by stocking on
more books of their choice.
iii)
We have observed from the Logistic
Regression analysis that the probability of purchasing from BN (versus not purchasing
from BN) is highest in Region 3 out of the 4 regions. BN in order to maintain
the cash cow should keep its stores in Region 3 well stocked. Customers in
Region 1 has the least probability of purchasing from BN (versus not purchasing
from BN) and thus BN must focus its customer attraction and retention
strategies in Region 1.
iv)
When we added the 3 additional attributes
in the analysis, we observed that loyalty to BN and Seasons attributes were
significant in determining the total purchase from BN. Hence, a customer who
buys only from BN and not from Amazon tend to purchase more books than a
customer who buys from both BN and Amazon. BN can introduce some loyalty points
programs in order engage the loyal customer base. We also observed that the
total purchase is maximum at Season =1 i.e Winters which is expected, it being
the holiday season. BN can have ‘Winter special Sale’ to further boost the
sale. BN can also have some discounts in Fall in order to promote sale, as Fall
is the most unfavorable season for books purchase.
v)
An increase in age of a customer by 1
level causes around 25.5% increase in their total purchase from BN, which is
expected as we often observe that older people tend to prefer books to online
reading. BN can retain their customer base by marketing the kind of books
preferred by such age groups.
vi)
We also observed from the variable
clusters that the variables Region, Race and Country are most important in
determining total purchase from BN. Thus combing this analysis with that from
the Regression models, we can create a portrait of an ideal BN customer who
would buy the maximum number of books from BN. Such a customer is a person from
US, from Region 1 and is from Race 1. BN can implement target marketing on
these kind of customers who are more likely to spend on BN products.
Learnings:
This
project gave us an opportunity to work with a real life dataset. We built
customized BA models by coding in Base SAS and determined the business
implications of the results of each our analysis; thus extracting business
intelligence from raw data.
We gained immense
learnings from working on the project, which can be categorized under the
following-
a)
Exploration of the raw data, performing various
preprocessing techniques and building the customized models gave us substantial
experience in coding in Base SAS.
b)
The Regression Models strengthened our grasp of
Statistics; by increasing our understanding of each statistical parameter and
its effect on the business implications
c)
We utilized our learning from Project 1, in the
variable selection part. We deliberated on the suitability of various
techniques for attribute reduction and brainstormed on the significance of each
variable in determining the target variable and whether they are provide any
additional information. This exercise made us appreciate the need to thoroughly
understand the data before applying any kind of analysis on it.
d)
We tried many techniques to improve the
performance of the models and saw in accordance with the ‘No Free Lunch’
theorem that no single technique enhances model performance. How well a model
performs and what techniques enhance model performance depends on the nature of
the data. Thus, in our project some enhancement techniques gave expected
results, whereas some did not.
Lastly,
this project gave us an opportunity to consider the ‘business’ aspect of Business
Analytics. Instead of mechanically building models, we tried to understand the
data first, deliberated on the questions that we were trying to answer through
our analysis, built model and determined the effect of each new information
found in the analysis on the business.
ReplyDeleteThank you for your post. This is excellent information. It is amazing and wonderful to visit your site.
sap implementation services in north America
ReplyDeleteThank you for your post. This is excellent information. It is amazing and wonderful to visit your site.
SAS and SAP implementation services
sas implementation services
sap implementation services
sap installation and configuration service
sas installation and configuration service
sas implementation services in north america
sap customer relationship management service in NA