Executive
Summary
For this project, we have created the entire
SEMMA process in SAS Enterprise Miner. The dataset used in this project is from
Expedia; a travel company that facilitates booking flights, hotels, renting
transportation etc. for travelers. The objective of our analysis was to
determine whether a user is going to complete the booking process in the
remainder of their session on the website, based on a variety of information
about user demographics, browsing habits, details of time spent on the site,
browsing sessions, previous bookings etc.
While creating the data source, we created a
cost-based confusion matrix to be used for evaluating models. We explored the
data extensively and performed various pre-processing techniques on it. We then
trained various models using the data. We have also tried to enhance model
performance by reducing the number of attributes, sampling the data and using
Bagging and Ensemble techniques. We have then compared all the models and
scored a new data using the champion models.
We have followed the entire SEMMA process in the
project:-
·
Sample - we defined the data
source, assigned roles and measurements, and partitioned the data into training
data and validation data. We have also sampled the data set for the 2nd process
flow to balance out the number of 1 and 0 values of the Target Variable. We
used File Import node to convert xls file to sas file for X12 pre-processing.
·
Explore – we generated
descriptive statistics of the data using the StatExplore node and also used the
Variable Selection node for reducing the number of input variables
·
Modify – we used the Impute
node for handling missing values, Drop node to drop attributes which may not be
significant for further analysis, Transform Variables node for transforming
variables to improve the fit of the models to the data and Principal Component node
to do PCA.
·
Model – we have used the
analytical tools to train a statistical or machine learning model to reliably
predict a desired outcome. For this, we have used logistic regression, decision
trees, ensemble and neural networks nodes.
·
Assess – We have evaluated the
usefulness and reliability of the findings from the data mining process by
comparing models using Model Compare node and performed score code management
using the Score node.
Under Utility,
we have used Start Groups and End Groups nodes for running Bagging and SAS code
node to run a SAS code for displaying the predicted target values of the score
data.
In this project, we have created 2 process
flows; one for the entire data set and the other for the sampled data set. In
the report, we have presented a detailed snapshot of each of the 2 process
flows along with explanation and reasons behind the various steps taken by us
in the flow.
We have also included a section on our learnings
and the over-all experience of working on the project, at the end of the
report.
Contents
Setting Up The
Project
a)
We created a new project, a new library where the data sets
to be used are stored and a new diagram.
b)
For creating the datasourse, we selected the data set
Expediatrain from the library and studied the data distribution of each
variable to set the their levels to appropriate values.
c)
We
selected the advanced option in the Data Source Wizard - Metadata Advisor
options and changed role of X32, X38 from Rejected to Input and the role of
depend to Target. We also changed Level of X3, X35, X5, and X7 from Nominal to
Interval.
d)
We
then selected the Yes option button to indicate that we want to build models based on the values of
decisions.
On the Decision Weights tab, we created the cost-based confusion matrix,
with the cost misclassifying 1 as 0 being 5 and the cost misclassifying 0 as 1
being 1. We select the Minimize
button indicating that we want to minimize the loss in the analysis.
e)
In
the Data Source Wizard — Create Sample
window, you selected No for creating sample data set as we wished to use the
entire data set. We then set the Role of the data source to raw ad click on
Finish.
Part I.
Basic Data Preprocessing
1. Use
dataset expediatrain.sas7bdat. Provide a brief summary of the variables in the
data set (you can choose to use StatExplore for that purpose).
We added a StatExplore
node to the datasource – em_train_trainlatest.sas7bdat
to study the statistical summary of the input data. Stat Explorer node is used
to analyze the variable statistics.
By default, the StatExplore node creates
Chi-Square statistics and correlation statistics. A chi square (X2)
statistic is used to investigate whether distributions of categorical variables
differ from one another
and thus are used for categorical variables only.
To use Chi-Square statistics for the
interval variables in addition to the class variables, one needs to bin the interval variables. Hence
before running this node we changed the selection for interval variables in the Chi-Square
Statistics properties group in SAS Enterprise Miner to ‘Yes’
in order to distribute interval variables into five (by default) bins and
compute chi-square statistics for the binned variables when the node is run.
We
executed the StatExplore node.
The Result window displays the following:-
a)
The Variable Worth plot orders the variables
by their worth in predicting the target variable based on the Gini split worth
statistic.
b)
The SAS output window provides distribution and summary statistics
for the class and interval inputs, including summaries that are relative to the
target. The Interval Variables Summary Statistics section and the Class Variable Summary
Statistics sections of the output has the Non-Missing column and the Missing column which list the number of
observations that have valid values and the number of observations that have
missing values for each interval variable respectively.
c)
The Chi-square plot orders the top 20
variables by their chi-square statistics.
We observed that all the variables have missing
values. There are 8 class variables including the target variable. All the
remaining variables are interval. We also observed that quite a few variables
have a skewed distribution.
2.
Explore the statistical properties of the variables in
the input data set. The results that are generated in this step will give you
an idea of which variables are most useful in predicting the target response.
Unless you see anything interesting, no need to report the details of this
step.
The Variable
Worth plot gives us a good idea on the relative worth of input variables in
predicting the target variables.
We observe X33 has the highest worth, followed by x11,
x30, x9 and so on
3.
Check the Class Variable Summary Statistics and the
Interval Variable Summary Statistics sections of the output.
a.
Are there any missing values for any of the variables?
Use imputation to fill in all missing data (describe how you did imputation in
the report).
By observing the Class
Variable Summary Statistics and the Interval
Variable Summary Statistics sections of the Output, we saw that all Class
as well Interval variables have missing values.
As observed above, all variables have a substantial proportion
of missing values. For Decision Trees, missing values do not pose any problems
as Surrogate splitting rule enables the node to use the values of other input
variables to perform splits with missing values.
However, models like Regression and Neural Network
altogether ignore the records that contain missing values, which would
substantially reduce the size of the training data set in our case, which in
turn would have reduced the predictive power of these models. Hence there is a
need to impute missing values before using Regression and Neural Network
models.
As Decision
Tree nodes can handle missing values themselves, we decided to impute the missing
values in the data set using the Impute node only before using Logistic
Regression and Neural Network models. Imputing the missing values before
fitting these models is essential also because we would be comparing these
models with Decision Trees. Model comparisons are more appropriate between
models that are fit with the same set of observations.
4.
Partition dataset expediatrain.sas7bdat. Use 55% of
the data for training and 45% for validation.
The Data
Partition node was added by setting the training
and validation property in Data Set Allocation to 0.55 and 0.45
respectively.
The below figure shows the Partition summary of the
dataset.
Part II. Building
Decision Trees
Decision trees are a
simple, but powerful form of multiple variable analysis. Decision trees are
produced by algorithms that identify various ways of splitting a data set into
branch-like segments by breaking down the dataset into smaller and smaller
subsets until a termination criteria is reached. These segments form an
inverted decision tree that originates with a root node at the top of the tree.
The final result is a tree with decision nodes and leaf nodes.
Each leaf node represents a class assignment and the decision node represents a
test on a particular attribute to further split the tree into branches.
Optimal
Decision Tree
1.
Enable SAS Enterprise Miner to automatically train a
full decision tree and to automatically prune the tree to an optimal size. When
training the tree, you select split rules at each step to maximize the split
decision logworth. Split decision logworth is a statistic that measures the
effectiveness of a particular split decision at differentiating values of the
target variable. For more information about logworth, see SAS Enterprise Miner
Help. Report the results.
SAS automatically trains a
full decision tree and prunes the tree to an optimal size. When training the
tree, SAS selects logworth criteria to split rules at each step to maximize the
split decision logworth.
After verifying the input data, we followed the following steps to
model the input data using nonparametric decision trees. After Data Partition we added the Decision Tree node which enables us to perform multi-way
splitting of the database, based on nominal, ordinal, and continuous variables.
The SAS implementation of decision trees represents a hybrid of the best of
CHAID, CART, and C4.5 algorithms. When the Tree node is run in the Automatic
mode, SAS automatically ranks the input variables by the strength of their
contribution to the tree. This ranking can be used to select variables for use
in subsequent modeling.
In the Decision Tree node, in the Properties Panel, under Train
properties, we made the below settings:
a) Maximum Depth splitting rule property was set to 6. This specification enables SAS
Enterprise Miner to train a tree that includes up to six generations of the
root node.
b) Leaf Size node
property was set to 5. This specification constrains the minimum number of
training observations in any leaf to five.
c)
Number of Surrogate Rules node property was set to 4. This specification enables SAS
Enterprise Miner to use up to four surrogate rules in each non-leaf node if the main splitting rule
relies on an input whose value is missing.
After making these changes we
executed the decision tree node. The results of the tree are displayed below:
The optimal tree created is shown
below:
The above Decision Tree shows some
unusual behavior of the variable X12, which provides a pure split when its
value is 1. On further studying the data set, we observed that whenever the
value of X12 is 1, the value of the target variable is always 0.
The variable X12 indicates whether a user has booked
at this site up to this point in the current session. If a user has already
booked at this site up to this point in the current session i.e. when X12 is 1,
he/she is definitely not going to book in the remainder of the session, i.e.
the Target variable would be 0.
Hence we decided to do some pre-processing for X12.
X12
Pre-processing
Due to this anomalous behavior of X12, we decided to
drop it from the analysis. However before dropping it, we needed to change the
value of target variable from 0 to 1 when X12 is 1. This is because X12=1
indicates that the user has booked in this session which can be equated to the
Target variable=1 i.e. user is going to book in the current session. Hence all
factors that influence the Target Variable to be equal to 1 would be
influencing X12 to be 1 as well.
To do the above mentioned pre-processing we followed
the following steps:-
a) We exported the SAS Data source to an excel file using
Save Data Node.
b) In Excel sheet we altered the target variable values
from 0 to 1 when X12 is set to 1. The proportion of 0 and 1 in the dataset
changed from 877:133 to 740:270.
c) After altering the Target variable, we converted excel
back to a SAS dataset using the File Import and Save Data node and used it for
our analysis.
d) We set up
the project using this new data set with the above mentioned settings:-
i.
We
set the role of x12 from Input to Reject, x32 and x38 from Rejected to Input
and the role of depend to Target. We also changed the levels of x3, x5, x7 and
x35 from Nominal to Interval.
ii.
On
the Decision Weights tab, we created
the cost-based confusion matrix, with the cost misclassifying 1 as 0 being 5
and the cost misclassifying 0 as 1 being 1. We select the Minimize button indicating that we want to minimize the loss in the
analysis.
e) The Data Partition node was added by setting the training and validation property in Data
Set Allocation to 0.55 and 0.45 respectively.
A Decision Tree node was
added, with the following settings in the Properties Panel, under Train
properties:
a)
Maximum Depth splitting rule property was set
to 6. This specification enables SAS Enterprise Miner to train a tree that
includes up to six generations of the root node.
b)
Leaf Size node property was set to 5. This specification constrains the
minimum number of training observations in any leaf to five.
c) Number of Surrogate Rules node
property was set to 4. This specification enables SAS Enterprise Miner to use
up to four surrogate rules
in each non-leaf node if the main splitting rule relies on an input whose value
is missing.
After making these changes we
executed the decision tree node. The results of the tree are displayed below:
The optimal tree that is formed, initially split on
the attribute bookgc with highest
logworth. After that, the second node has split on the attribute booksh and the last split happened on
the node 5 with the attribute booksc.
Split decision logworth is a statistic that measures the effectiveness of a
particular split decision at differentiating values of the target variable. The
leaf nodes thus formed have purer and purer subsets as we go down the various
levels. The object of analysis, “bookfut” is reflected in the root node as a simple,
one-dimensional display in the decision tree.
Each leaf node contains the following information:
·
node number
·
number of training observations in the node
·
percentage of training observations in the node with bookfut =1 (user will book), adjusted
for prior probabilities percentage of training observations in the node with bookfut =0 (user will not book),
adjusted for prior probabilities
The subtree assessment plot for the Misclassification
rate showed that the error for both the training and validation data set
displays a downward trend initially. After the initial reduction it stabilized
after the 3rd node in both the data set. After the 4th leaf
node, Misclassification rate plummets in the training set. However the rate of
decrease of Misclassification rate is much more gradual for the Validation set.
The subtree assessment plot for the average
square error rate shows that the error displays a steady downward trend in
the training dataset. However, in the Validation set, after an initial
reduction the average square error increases, indicating over-fitting of data
with increase in the number of leaves.
Interactive
Decision Tree
2.
Then, interactively train a decision tree. At each
step, you select from a list of candidate rules to define the split rule that
you deem to be the best. Report the results.
In this step, we
created the Decision Tree interactively at each step by selecting from a list
of candidate rules to define the split rule that seemed be the best, based on –Log
(p).
We thus added another decision tree node for - interactive
decision tree making - to our workflow. Using interactive training, we can
override any automatic step by defining a splitting rule or by pruning a node
or subtree.
We added split points and edited the
splitting condition as needed. The tree was split as the different conditions
were entered.
We selected the Interactive
ellipsis from the Decision tree properties panel and in the interactive
decision tree window - split the root node into further sub nodes. After the
first split, the tree now had two additional nodes.
After selecting the lower left node (bookgc) in a new Split Node, we selected minutshc, which ranked second in
logworth.
After selecting the lower right node (minutesh) in a new Split Node, we selected hitshc,
manually split the node for this rule. We observed from the screenshot below,
that we were able to create a pure subset (Node id 8) based on this attribute.
After further continuing this method
to split a few nodes, we trained the tree using Train Node option and the tree formed is shown below:
We observed that the Misclassification Rate is steadily
decreasing for both Training and Validation data till leaf 3. For the
subsequent leaves the Misclassification Rate decreases at a much higher rate
for Training Data compared to the Validation Data, indicating a better performance
of the interactive tree model with respect to the training data compared to the
validation data.
We observed that the average square
error for the Training dataset has a downward trend. However, after the 10th
leaf, the average square error for the Validation dataset increases, indicating
over-fitting of data after the 10th leaf.
Due to this over fitting issue, the
predictions results might not be generalize to new data outside the analysis
dataset we have.
Part III. Building
Neural Networks and a Regression Model
Impute
Using the stat explore node, we
observed that many of the input variables have missing values.
Regression and Neural Networks
ignore records containing missing values. Therefore, it is recommended to
impute the missing values before performing Regression or Neural Network.
The missing cut-off value for the Impute node is set
to 26%, which rejects x32, and x38. For
class variables, Tree Surrogate is used as the Default Input Method so that Enterprise
Miner predicts the missing value by building a decision tree with that
(missing) variable as the target and the other input (non-missing) variables as
predictors. For interval variables,
Median is selected as the Default Input Method so that the
missing values are replaced by the median of the non-missing values.
Variable
Transformations
1.
Transform input variables to make the usual
assumptions of regression more appropriate for the input data. Explain the
transformations you did.
We added a StatExplore node (with the previously
mentioned settings) to explore the imputed data.
From the Interval Variable Summary Statistics in the
Results - Output, we calculated the Coefficient of Variation and considered all
variables with coefficient of Variation of more than 0.85 as having high
variation.
We found that x7, x9, x13, x14, x16, x18, x19, x22,
x25, x26, x28, x29, x31, x34 and x36 have high variance or standard deviation.
As standard deviation is directly proportional to Skewness, one way to reduce
variance would be to reduce the Skewness of the above mentioned variables.
We thus added a Transform Variables node:-
We clicked on the Formulas ellipses
in the Train properties of the node to observe the data distribution of the
input variables.
As expected, we observed that the
above mentioned variables have a highly skewed distribution. We also observed
that the distribution graph of few other variables are also heavily skewed –
x10, x11, x17, x35, x39 and x40.
As Regression and Neural network
does not work well with skewed data, we selected Log10 as the transformation
Method in the Variables – Trans window for all the variables, in order to
improve the fit of the model to the data. Log transformations are used to
control Skewness.
Logistic
Regression
2.
Model the input data using logistic regression. Report the results.
a)
Logistic
Regression without Transformation:-
We added
a Regression node to the Impute Node:-
The Regression node automatically
performed logistic regression as the target variable in this data set is a
binary variable. We selected Stepwise
as the Selection Model property which
begins regression with all candidate effects in the model and removes effects
until the Stay Significance Level or the Stop Criterion is met and also removes
effects already present in the model.
The node is executed.
We observed the output window:-
In the Results – Output, we
observed the Odds ratio estimate associated with the input variables. From the
Lift graph we also observed the effectiveness of the Logistic Regression model
with respect to the Training and Validation data set and found the
effectiveness to be comparable.
We observed the Mean Expected Loss
in more detail:-
For this data set, we had defined
the cost matrix. Thus the expected loss is a function of both the probability
of a user booking or not booking in the given site and the estimated cost
associated with each corresponding outcome. A value is computed for each
decision by multiplying estimated loss values from the decision matrix with the
classification probabilities. The decision with the lowest value is selected,
and the value of that selected decision for each observation is used to compute
the loss measures.
b)
Logistic
Regression with Transformation:-
We
added a Regression node to the Transform Variables Node:-
The
Regression node automatically performed logistic regression as the target
variable in this data set is a binary variable. As in the above case, we
selected Stepwise as the Selection Model property.
The node is executed.
We observed the output window:-
In the Results –
Output, we observed the Odds ratio estimate associated with the input
variables. From this Lift graph too,
we can deduce that the effectiveness of the Logistic Regression model with
respect to the Training and Validation data set to be comparable.
We observed the Cumulative Expected Loss in more
detail:-
As explained above, the expected loss is a function of
both the probability of a user booking or not booking in the given site and the
estimated cost associated with each corresponding outcome. A value is computed
for each decision by multiplying estimated loss values from the decision matrix
with the classification probabilities. The decision with the lowest value is
selected, and the value of that selected decision for each observation is used
to compute the loss measures.
Comparison
between Logistic Regression with Transformation and without Transformation:-
From the Cumulative Expected Loss graphs, we observed
that the performance of the model has increased after transforming the data.
The cumulative loss for both Training and Validation data is lesser for
Logistic Regression model with Transformation compared to the one without
Transformation.
Neural
Network
3. Model the
input data using neural networks, which are more flexible than logistic
regression (and more complicated). Report the results.
Neural networks are a class of parametric models that
can accommodate a wider variety of nonlinear relationships between a set of
predictors and a target variable than can logistic regression.
Building a neural network model involves two main
phases.
• Definition of the network configuration
• Iteratively training the model
a)
Neural
Network without Transformation:-
We added the Neural Network node to the Impute Node:-
We set the Hidden
units property to yes to create hidden units variables in our scoring data.
We also set the Standardization
property to Yes create standardization variables in our scoring data.
In the Neural Network node, in the Properties Panel,
under the Train properties, we selected the ellipses that represent the value
of Network:-
·
Direct Connection is set to Yes. This allows network to have connections directly
between the inputs and the outputs in addition to connections through the
hidden units.
·
The Number of Hidden Units is set to 26.
Thus, it trains a multilayer perceptron neural network with 26 units on the
hidden layer.
After executing the Neural Network Node, in the result
window, under the Score Rankings Overlay window. Cumulative Expected Loss graph
shows a decrease in the loss after 35% depth in both the datasets.
For the Neural Networks model too, the expected loss
is a function of both the probability of a user booking or not booking in the
given site and the estimated cost associated with each corresponding outcome as
defined in the cost matrix.
The classification table shows the frequency of the
misclassification errors for both Training and Validation data.
b)
Neural
Network with Transformation:-
We added the Neural Network node to the Transform
Variables Node:-
We set the Hidden
units property to yes to create hidden units variables in our scoring data.
We also set the Standardization
property to YES create standardization variables in our scoring data.
In the Neural Network node, in the Properties Panel,
under the Train properties, we selected the ellipses that represent the value
of Network:-
• Direct Connection is
set to YES. This allows network to have connections directly between the inputs
and the outputs in addition to connections through the hidden units.
• The Number of Hidden Units
is set to 26. Thus it trains a multilayer perceptron neural network with 26
units on the hidden layer.
After executing the Neural Network Node, in the result
window, under the Score Rankings Overlay window. Cumulative Expected Loss graph
shows a decrease in the loss after the 45% depth in both the datasets.
As explained above, the expected loss is a function of
both the probability of a user booking or not booking in the given site and the
estimated cost associated with each corresponding outcome as defined in the
cost matrix.
The classification table shows the frequency of the
misclassification errors for both Training and Validation data.
Comparison between
Neural Network with Transformation and without Transformation:-
From the Classification Table, we observed that the
performance of the model has increased after transforming the data. The number
of errors misclassifying 1 as 0 has decreased from 47 to 41 in the Validation
data when we transformed the data. Consequently the number of records
predicting 1 as 1 increases from 75 to 81 when we transformed the data.
Part IV. Model
Comparison and Champion Model Evaluation
Model Comparison
1.
Compare the above four models you tried, and select a
champion model. When evaluating the model performance, try to use confusion
matrix as the main evaluation criterion. And let’s use a cost 5 for
misclassifying 1 as 0, and a cost of 1 for misclassifying 0 as 1.
Since we now have many candidate models used for
predicting whether the individual will book or not in the remainder of his
current session, these models can be compared to determine a champion model
that will be used to score new data. The Model Comparison node is used to
compare the models that we have built so far.
We have compared models separately for sampled and
unsampled data sets. After adding the Model Comparison Node, the updated
workflow looks like the below screenshot:
Non-sampled workflow:-
Sampled workflow:-
We had created a cost-based confusion matrix while
creating the datasource:
Since the confusion matrix is used by the Model
Compare node for comparing the models, the Selection
criteria is automatically set to Average
Loss for depend for the Validation data set.
After the Node was executed, the fit statistics window
shows that the champion model selected by the Model Comparison node using non-sampled
data is the ‘Neural Network after transforming skewed variables’.
The champion model selected by the model comparison
node after sampling data is ‘Neural Network after transforming skewed variables
and applying subjective variable selection’.
Scoring New
Data
2.
Score the new
evaluation dataset -- expediaevaluation.sas7bdat -- using the champion model.
A new ExpediaEvaluation dataset is created. While
creating data source, the role of the Expediatest data set is set to “score”.
We connected the model comparison node and the score data source to the
Score node in the workflow and executed the score node for scoring the new test
data using the selected champion model. We then used SAS Code to print the
predicted depend for ExpediaEvaluation dataset. The part of the workflow is
given below:
We used the following SAS Code to print the predicted
depend for ExpediaEvaluation dataset:
The ExpediaEvaluation dataset has 2111 records. For
both the sampled and unsampled data, the Champion model has predicted 284 1’s
and rest as 0’s.
A sample screenshot of the display is given below:-
Part V.
Improve Your Model Performance
Imputation
1.
Should we do imputation
on all variables with missing data? And use imputed data for all classifiers?
Data mining databases often contain observations that
have missing values for one or more variables. Missing values can result from
data collection errors, incomplete customer responses, actual system and
measurement failures, or from a revision of the data collection scope over
time, such as tracking new variables that were not included in the previous
data collection schema. If an observation contains a missing value, then by
default that observation is not used for modeling by nodes such as Neural Network,
or Regression. However, rejecting all incomplete observations may ignore useful
or important information which is still contained in the non-missing variables.
Rejecting all incomplete observations may also bias the sample, since
observations that missing values may have other things in common as well.
The Impute node is used to replace missing values in
data sets that are used for data mining. The median is less sensitive to
extreme values than the mean or midrange. Therefore, we have used the Impute node
to replace Interval missing values with the Median value which is more suitable
for our dataset which has a number of skewed variables. We have used the Tree
Surrogate rule for replacing the missing class values.
While imputing
missing values ensures that useful information is not lost, replacing missing
values can greatly affect a variable's sample distribution. Imputing missing
values with mean, median or other specified values may often lead to creation
of a data distribution that does not accurately represent the original data
distribution of the given variable, especially if the proportion of the missing
values in a variable is substantial. Thus, it is often appropriate to reject
variables having missing values over a threshold.
We decided on a threshold of 26% or 263 missing values
(out of 1010 total observations of a variable) for a variable to be rejected.
Thus on that basis, we set the value of the Missing Cutoff of the Impute node
to be 26.0. This setting rejects the variables which have the missing values
that are greater than 263. We observed that x38 out of the class variables and
x32 out of the interval variables are rejected.
Decision Trees as can handle missing values themselves
through various mechanism. SAS EM gives us the option of specifying the maximum
number of surrogate rules that the
Decision Tree node seeks in each non-leaf node. The first surrogate rule is
used when the main splitting rule relies on an input whose value is
missing. We have specified the Number of Surrogate Rules node property
to be 4. This specification enables SAS Enterprise Miner to use up to four
surrogate rules in each non-leaf node if the main splitting rule relies on an
input whose value is missing. The Interactive decision tree assigns missing values
to the branch that maximizes purity or logworth. Hence we do not need to impute
or replace missing values for Decision Trees. However as mentioned above,
models like Regression and Neural Networks need imputation of missing values.
Hence we have added the Impute node only for the Regression and Neural Network
model and not for the Decision Tree in our process flow.
Variable
Transformation
2.
Skewed data? Data with
high variance?
A dataset for modelling is perfectly balanced when the
percentage of occurrence of each class is 100/n, where n is the number of
classes. If one or more classes differ significantly from the others, this
dataset is called skewed or unbalanced. Classification methods are generally
not designed to cope with skewed data, thus, various action have to be taken
when dealing with imbalanced data sets. The skewed data thus needs to be
transformations in order to improve the fit of the model.
We however had decided to check for skewness and
variance only after imputing the missing values as imputing the missing values
with some other values like median etc. may change the data distribution of
variables. We thus first imputed the missing values, checked for skewness and
variance, transform the variables as needed and then used the Regression and
Neural Networks models. As Decision Trees are insensitive to Skewness etc., no
transformation is needed to be done on the data before applying this model.
As it is difficult to conclude whether a given
variance is large or not, we considered Coefficient of
Variation to measure variance. We calculated the
Coefficient of Variation by dividing Standard Deviation with Mean; the values
of which we got from the Interval
Variable Summary Statistics sections of the Output of StatExplore. We
considered a threshold of 0.85, i.e. variables having Coefficient of Variation
0.85 and above have high variance and variables having Coefficient of Variation
lesser than 0.85 have low variance.
We found that x7, x9, x13, x14, x16, x18, x19, x22,
x25, x26, x28, x29, x31, x34 and x36 have high variance or standard deviation.
As standard deviation is directly proportional to skewness, one way to reduce
variance would be to reduce the skewness of the above mentioned variables by
applying the transformation Log10. We clicked on the Formulas ellipses in the Train
properties of the node to observe the data distribution of the input variables.
As expected, we observed that the above mentioned
variables have a highly skewed distribution and thus we selected Log10 as the
transformation Method in the Variables – Trans window for these variables. Log
transformations are used to control skewness.
We also observed that the distribution graph of few
other variables are heavily skewed – x10, x11, x17, x35, x39 and x40.
We thus applied Log10 transformations on the above
mentioned variables as well.
We observed performance enhancement in both Logistic
Regression and Neural Network models after applying Log10 transformation to the
data. The average loss for the validation data for neural networks and logistic
regression is $0.41978 and 0.483516 respectively. However, the average loss for
neural networks and logistic regression after transforming the data, goes down
to $0.371429 and $0.454945 respectively.
Hence, the performance of both models get enhanced.
Variable
Reduction
3.
Do we need all 40
attributes for prediction? If no, how about removing some variables?
We considered that all the 40 attributes are not
needed for prediction and some need to be dropped due to the following
reasons:-
• Curse of Dimensionality - Convergence of
any estimator to the true value is very slow in a high dimensional space
• Redundant data – A number of variables
in the data set provide very similar information or are deduced from each other
thus giving no new information.
We have reduced the attributes using the following
techniques:-
i)
Qualitative
& Correlation analysis on the input variables:-
a) Variables X5 and X6 represent ‘household size’ and
‘whether the user has children or not’ respectively. In the age of nuclear
families, we felt that household size and having children are highly correlated
and we can thus reject x6, as x6 to some extent can be inferred from x5. We also observed that the correlation between
X5 and X6 is 0.7291 which is quite high, thus indicating that one of the
variables can be dropped from the analysis.
b) Variables
X18 and X19 represent ‘Total no. of sessions visited of all sites so far’ and
‘Total minutes of all sites’ and are thus highly correlated with each other
with the correlation value of 0.8749. As we observed that the correlation
between X19 and the target variable is higher than the correlation between X18
and the target variable, we decided to retain X19 and drop X18 from the
analysis.
c) Variables X26 and X29 representing ‘Percentage of
total hits are to this site’ and ‘No. of sessions start with this site/total
sessions of this site’ are highly correlated with each other with the
correlation value of 0.9713. The correlation between x29 and the target
variable is higher than the correlation between X26 and the target variable. We
thus decided to retain X29 and drop X26 from the analysis.
d) Similarly, variables X39 and X40 represent ‘Hits to
this site/ hits to all sites in this session’ and ‘Minutes to this site/total
minutes in this session’ and are thus highly correlated with each other with
the correlation value of 0.927879. The correlation between x40 and the target
variable is higher than the correlation between X39 and the target variable. .
Hence we decided to retain X40 and drop X39.
The below table shows the correlation of each variable
with every other variable in the data set. We have highlighted the correlation
values that we have considered as high.
We thus added a Drop node after the Impute Node in
Sampled workflow:-
We dropped variables X5, X18, X26 and X39.
After dropping these variables, the average loss for
Neural Networks (in the sampled workflow) decreases from $0.463115 to $0.377049
and it became the champion model.
ii) Principal
Component Analysis (PCA) – is a
feature reduction technique where the transformed features are the linear
combinations of the original features.
Applying PCA in the non-sampled workflow:-
Applying PCA in the sampled workflow, both before and
after dropping variables based on qualitative analysis:-
However, applying PCA did not bring about the expected
performance enhancements of our models. In the non-sampled workflow, PCA
improved the performance of only the Logistic Regression model, whereas it
deteriorated the performance of all models in the sampled workflow.
As Principal components are uncorrelated linear
combinations of the original input variables; and depend on the covariance
matrix or the correlation matrix of the original input variables, PCA technique
is suitable for reducing the number of interval variables. As our data is a mix
of interval and class variables, we think that PCA technique is not suitable
for the data, which is why the performance enhancement was not as expected.
iii)
Using the
Variable Selection node
Variable Selection node reduces the number of inputs
by setting the status of the input variables that are not related to the target
as rejected. Although rejected variables are passed to subsequent nodes in the
process flow, these variables are not used as model inputs by a successor
modeling node.
Applying Variable Selection node in the non-sampled
workflow:-
Applying Variable Selection node in the sampled
workflow:-
In both the sampled and non-sampled work flow, 21
variables are rejected by the Variable Selection node. In non-sampled workflow,
the performance of Logistic regression model increases and performance of
neural network model decreases, after applying the Variable Selection node. In
the sampled workflow, the performance of the Neural Network increases and the
performance of Logistic Regression decreases, after applying the Variable
Selection node. Hence, we did not observe any clear trend with respect to model
performance on applying the Variable Selection node.
Comparison
between the 3 Variable Reduction Techniques:-
We observed that out of the 3 variable reduction
techniques used, variables dropped on the basis of ‘Qualitative &
Correlation analysis on the input variables’ resulted in best performance of
the models.
Sampling
4.
Is the dataset too
large in terms of number of records? If you think so, how about sampling for a
smaller size? Will this actually help with the prediction performance?
The size of the dataset is 1010 which is not
considered a very large data set. However we considered sampling the data to
mitigate the effect of an unbalanced data. A training dataset can consist of
disproportionately high number of one value than the other and our dataset is
an example of such an unbalanced dataset with the proportion of one’s being 13%
and proportion of 0’s being 87%. Such unbalanced dataset when used to create a classifier
often produces biased models. The prediction of such biased models have
overestimated accuracy which needs to be fixed. There are several ways to fix this
data imbalance and the most common method is sampling. Here we have under-sample the most frequent value- in our
dataset which is 0’s and over-sampled the less frequent 1’s. With this balanced
dataset, the models would be able to predict with better accuracy. Sampling
also significantly decreases model training time and if it sufficiently
represents the source dataset then the prediction models which are created on
the sample dataset can be applied on the complete data set as well.
a) We have added a Sample node to the dataset. Here we
used stratified random sampling
method, where the population consists of N elements and is divided into H
groups called strata. Each element of the population are assigned to only one
stratum and we obtain probability sample from each of these stratum. Stratified
sampling method provides greater accuracy and required smaller sample thus
saving money. It also reduces selection bias and ensures each subgroup within
the population receives proper representation within the sample. We have set Stratified Criteria to ‘Equal’ in order to balance the 0s and 1s
in the dataset.
b) The Data Partition node was added by setting the
training and validation property in Data Set Allocation to 0.55 and 0.45
respectively.
c) We added a Decision Tree node to the Data
Partition node with the settings similar to the ones used in the Decision Tree
used on un-sampled dataset.
When we executed the decision tree with the above
properties we found the following results:-
The
optimal tree that is formed, initially split on the attribute bookgc with
highest logworth. After that, the second node has split on the attribute booksh
and the third split happened on the node 6 with the attribute minutehl. The final split happened at the attribute
edu.
The subtree assessment plot for the Misclassification
Rate showed that the error rate for both Training and Validation data decreases
till 2nd node. It then stabilizes for both the datasets till 4th
node before decreasing it further for training dataset and increasing for
validation dataset. This indicating that overfitting after the 5th
node.
d) We then added an Interactive Decision Tree to the
Data Partition node
The
results are shown below:-
On the
first manual split the logworth value was highest for the bookgc variable.
The
lower left node was selected for the second manual split, as the logworth value
for minutsch was found to be second highest.
After further continuing this method to split a few nodes, we
trained the tree using Train Node option and the tree formed is shown below:
We observed that the Misclassification Rate decreases at a high
rate till the 3rd node for both training and validation datasets, after which
the error rate decreases further for training dataset and increases slightly
for validation dataset. This indicates overfitting of data after the 3rd node.
We observed that the
average square error has a downward trend. After the 3rd leaf, the
average square error for the training dataset has reduced drastically. And the
Average square error in the validation set increases indicating over-fitting of
data after the 3rd leaf.
e) We added the Impute node with the following settings:-
For class variables, we selected Tree Surrogate as the Default
Input Method, enabling SAS Enterprise Miner to build a decision tree with
that variable as the target and the other input variables as predictors.
For interval variables, we selected Median as the Default Input Method, so that the values of missing interval
variables are replaced by the median of the non-missing values.
We set the missing cutoff as 26% which means any
variable having missing values more than 26% will be dropped.
f) Variable transformations are used to stabilize
variance, remove nonlinearity, improve additivity, and counter non-normality.
For our dataset, we used transform variable node to remove Skewness by using
log10 transformation.
We found that x7, x9, x13, x14, x16, x18, x19, x22,
x25, x26, x28, x29, x31, x34, x36, x10, x11, x17, x35, x39 and x40 are skewed.
We applied log10 transformations on all the variables from variable tab like in
the diagram below.
The variable transformation node and its output screen
are shown below:
We have added Neural Network and Regression nodes with
and without transformations.
g)
Neural networks without transformation.
We added a Neural Network node to the Impute node and set
the Hidden units property to yes to
create hidden units variables in our scoring data. We also set the Standardization property to yes create
standardization variables in our scoring data.
In the Neural Network node, in the Properties Panel,
under the Train properties, we selected the ellipses that represent the value
of Network:-
• Direct Connection is set to yes. This
allows network to have connections
directly between the inputs and the outputs in addition to connections through
the hidden units.
• The Number of Hidden Units is set to 10.
Thus it trains a multilayer perceptron neural network with 10 units on the
hidden layer.
We observed the cumulative expected loss for neural
networks to be a function of both the probability of a user booking or not
booking in the given site and the estimated cost associated with each
corresponding outcome as defined in the cost matrix.
The classification table shows the frequency of the
misclassification errors for both Training and Validation data.
h)
Neural networks with transformation
We added a Neural Network node to the Transform
Variables node and set the Hidden units
property to yes to create hidden units variables in our scoring data. We also
set the Standardization property to YES
create standardization variables in our scoring data.
In the Neural Network node, in the Properties Panel,
under the Train properties, we selected the ellipses that represent the value
of Network:-
• Direct Connection is set to YES. This
allows network to have connections
directly between the inputs and the outputs in addition to connections through
the hidden units.
• The Number of Hidden Units is set to 10.
Thus it trains a multilayer perceptron neural network with 10 units on the
hidden layer.
The output of the node is as follows;-
We observe that the cumulative expected loss for
neural network with transformation is comparable to that of the neural network
without transformation.
i)
Logistic Regression without transformation
We added a Regression node after the Impute node. The
node automatically performed logistic regression as the target variable in this
data set is a binary variable. We selected Stepwise
as the Selection Model property which
begins regression with all candidate effects in the model and removes effects
until the Stay Significance Level or the Stop Criterion is met and also removes
effects already present in the model.
The output of the sampled regression node.
The cumulative expected loss for regression on sampled
data is lesser than the regression on non-sampled as per the graph down below.
j)
Logistic
Regression with Transformation
We added a Regression node after the Transform
Variables node. The node automatically performed logistic regression as the
target variable in this data set is a binary variable. As in the above case, we
selected Stepwise as the Selection Model property.
The output result is as follows:-
The cumulative expected loss for sampled logistic
regression with transformation is lesser than the sampled logistic regression
with transformation. Hence the model works better after transformation of the
data.
Comparison
of Model performance with Sampled and Non-sampled data
We observed that the performance of the Logistic
Regression increases on sampling the data. The Logistic Regression on sampled
data, both with and without transformation performs better than their
counterparts on non-sampled data. However the performance of Neural Network
deteriorates with sampled data.
Ensemble & Bagging
5.
Can ensemble help? How
will you do it?
We have tried to do Model Enhancement by
a) Ensemble
by ‘Voting;-
The Ensemble node creates new models by combining the
posterior probabilities (for class targets) or the predicted values (for
interval targets) from multiple predecessor models. It thus creates a combined
model to improve the stability of disparate nonlinear models, such as those
created from the Neural Network and Decision Tree nodes.
It is important
to note that the ensemble model can only be more accurate than the individual
models if the individual models disagree with one another. In our sampled workflow,
we have used the 3 most weak performing models namely; Decision Tree, Neural
Network after transformation and PCA and Logistic Regression after
transformation and PCA for ensemble. The component models from the three
complementary modeling methods are integrated by the Ensemble node to form the
final model solution.
Thus, each of the 3 models used, created a separate
model using the same training data. The Ensemble node co-mines the posterior
probabilities of the class target through ‘voting’ to create the ensemble
model.
We observed that the performance of the Ensemble model
is better than all the 3 individual model used:-
Model
|
Average Loss for Validation Data
|
Ensemble
|
0.442623
|
Decision Tree - Sampled
|
0.467213
|
Neural Network – Sam Trans
PCA
|
0.545082
|
Regression – Sam Trans PCA
|
0.442623
|
b) Bagging
The Bootstrap Aggregation, or Bagging mode in the
Start Groups node in the Sampled workflow created unweighted samples of the
active training data set for bagging. The Bagging method uses random sampling
with replacement to create the n sample replicates. We had set the number of samples
as 15.
The Bagging method ran the Decision Tree model 15
times over different training datasets. However the performance of the Decision
Tree did not improve through Bagging. A probable reason for this is that,
Bagging is based on the assumption of independence. As our data contains a
number of attributes that are highly correlated and dependent to each other,
Bagging did not cause any performance enhancement.
Part VI. Summary
Final
Diagrams
1. Include the complete final diagram you get. Make sure
it is legible, and if needed use several pages in print.
Workflow 1: Without Sampling
Workflow 2: With Sampling
Learnings
2.
Summarize what you
learned from this project. Be concise.
This project not only provided us with additional
exposure to the SEMMA process in SAS Enterprise Miner, but also gave us an
opportunity to work on a real-time data set and perform predictive analysis on
it from the scratch. The focus of this project was performance enhancements of
the models through various techniques like missing value handling,
transformations, sampling, variable reduction and ensemble.
We gained immense learnings from working on the
project, which can be categorized under the following-
• Data
Pre-processing - We explored the data in detail, studied the data distribution
and the role of each variable and made the required modifications before using
it for the predictive analysis. This step was a good learning experience for us
as it gave us an opportunity to analyze the effect of each feature of the data
- on the predictive analysis that we wished to do. We understood the effect of conditions
like standard deviation, coefficient of variation, Skewness and kurtosis etc. on
the fit of the various models to the data. We gained understanding of how
sensitive each model is to conditions like Skewness, missing values etc. We
brainstormed on various transformation techniques like Maximum normal, Log10 etc.
to improve the fit of the models to the data.
• Variable
Selection – We got hand-on experience on using PCA technique and Variable
Selection node on the data. We deliberated on the suitability of these
techniques to the data. We also subjectively dropped attributes by doing
in-depth analysis of each variable and brainstorming their significance in
determining the target variable and whether they provide any additional
information. This exercise made us appreciate the need to thoroughly understand
the data before applying any kind of analysis on it.
• Better
understanding of the Models – This project increased our grasp on the various
models we used, by giving us an opportunity to understand the significance of
each setting and experience how they change the performance of a model. We did
trial and testing to determine the optimal setting for each model with respect
to model performance. We also learnt to evaluate their performance by studying
various graphs like misclassification graph, average square error etc.
• Model
Enhancement Techniques – The project gave us an exposure on how to use Ensemble
and Bagging techniques for predictive analysis. The exercise improved our
understanding of the working of these techniques and scenarios where it is
suitable to use these techniques.
Finally, we have compared the performance of the
models under various combinations like transformation, sampling, PCA, Model
Enhancement Techniques etc which has given us a good understanding on methods
to improve model performance and how the effectiveness of model enhancement
techniques are often data –dependent.
References
1)
Getting Started with SAS® Enterprise Miner™ 13.1
2)
SAS
Enterprise Miner Help
3)
Journal of
Targeting, Measurement and Analysis for Marketing (2007), Merits of
interactive decision tree building — Part 2: How to do it - Bas van den
Berg and Tom Breur,
4)
Variable Selection and Transformation of
Variables in SAS® Enterprise Miner ™ 5.2; Kattamuri S. Sarma, Ph.D. Ecostat
Research Corp., White Plains NY
great intro to machine learning with decision trees. This is helpful to start doing image recognition and prevent machine learning with decision trees and random forests book at a large scale. I appreciate your book referral too. Peter Harrington is a pro.
ReplyDelete