SPSS
Answer Tree is a tool for statistical analysis. It’s handy for analyses
that involve classification of large amounts of data into homogeneous
groups, and helps you in better decision-making. Answer Tree is quite adept
at sifting through large numbers of records and establishing meaningful
patterns among them. This tool should be useful for market researchers, data
analysts, management consultants, and the like.
To understand the specific
capabilities of the product, let’s take the case of a consumer finance
company disbursing loans to individuals. The foremost question facing the
company would be whether a given applicant would unerringly pay up the loan
over the loan re-payment period, or whether he’s more likely to default.
Like most statistical tools,
Answer Tree obviously can’t provide you with absolutely accurate answers.
Instead, based on criteria that you specify as being important for assessing
the loan payment capabilities of applicants, it sifts through existing
databases and classifies them into homogeneous groups. This can help the
company to understand the profile of likely defaulters, and thereby take
decisions to minimize the number of defaulters.
Installing the software was a
snap, the most tedious part being feeding in the 38-number license code.
Once installed, it ran smooth and quick on a P/200 MMX test machine with 32
MB RAM, 2.1 GB HDD, and VGA at 800x600 resolution.
The built-in tutorials and
help files would be adequate for users well-versed with research
methodology, statistics, and decision tree analysis. However, to the lay
user, these are insufficient, since the basic concepts are not covered.
However, the documentation–consisting of a well laid-out and comprehensive
book consisting of over 200 pages spread across 14 chapters–is virtually a
textbook on decision tree analysis, and explains basic as well as advanced
concepts. So, lay users can also get started with decision trees, with the
help of the documentation.
To begin using Answer Tree,
you need to have records of past loan applicants in one of the following
file formats:
-
SPSS file (*.SAV)
-
SYSTAT file (*.SYD,
*.SYS) -
Common database formats
(*.DBF, etc) -
ODBC (MS-Access files,
etc)
The process of building the
answer tree involves two steps. In the first step, the Minimal Tree is
drawn, which classifies data into homogeneous groups. In the second step,
the Minimal Tree may be grown, so as to arrive at an even better answer to
the question at hand, in this case, the likely loan payment defaulters.
The Minimal Tree may be drawn
using one of the following methods.
-
CHAID: This uses
Chi-square or F statistics to select predictors for each homogeneous
group -
Exhaustive CHAID:
This is a modification of the CHAID method that’s more exhaustive and
rigorous in selecting predictors. As a result, it also takes longer to
run -
C&RT: This
method identifies homogeneous subsets of data, with each split
generating two nodes -
QUEST: This method
is similar to the C&RT method with one difference, the target
(dependent) variable has to be nominal (that is, you can’t do any
further mathematical calculations on it, for example, rankings given to
individuals). Two nodes are generated at each split, as in the case of
the C&RT method.
The Minimal Tree
To draw
the Minimal Tree, the first step is to specify variables that are predictors
to the segregation into heterogeneous groups. In our loan repayment example,
variables such as monthly salary, educational qualifications, type of
service (government, private service, self-employed, etc), number of
dependents, other possessions of the customer (car, house, etc) may
all be predictors to behavior with respect to loan payment.
You can also specify the
following tree characteristics:
-
Maximum tree depth
-
Minimum number of cases
in parent and child nodes -
Minimum change in
impurity (this is the degree of difference between individual cases
within a homogeneous group)
In addition to these, there’s
an option for validating/cross-validating the tree. Validation is achieved
by partitioning the data set into Training and Testing Sample, in specified
proportions.
Growing the Minimal Tree
Since Answer
Tree is an exploratory tool, it’s almost always necessary to re-look at
the initial assumptions once the Minimal Tree is drawn. Extensive facilities
for growing/pruning individual branches are available, so that you can best
classify the given data into homogeneous groups. It should be kept in mind
that knowledge of the situation at hand is key to arriving at the best
grouping. Once the final tree is drawn, its interpretation is quite
straightforward.
Risk charts
The
misclassification matrix counts up the predicted and actual category values
and displays them in a table. A correct classification is added to the
counts in the diagonal cells of the table. The diagonal elements of the
table represent agreement between the predicted and actual value, often
called a "hit". An incorrect classification—called a
"miss"—means that there’s disagreement between predicted and
actual values. Misclassifications are counted in the off-diagonal elements
of the matrix. In this example, 11 applicants with no credit or no debt
(NCR/NODEB) were misclassified as having current, up-to-date credit accounts
(PD BK). This table is helpful in determining exactly where the model
performs well or poorly.
The risk estimate and
standard error of risk estimate indicate how well the classifier (the
variable you use for classification at a given node) is performing. In this
case, the risk estimate for the four-level C&RT tree is 0.2880, and the
standard error for the risk estimate is 0.0143. In other words, we are
missing 28.8 percent of the time. If necessary, you can look at ways to
further improve the model.
In conclusion
At a steep
license fee of Rs 110,000 per user, Answer Tree is clearly beyond the reach
of individual researchers or even the smaller research firms. Its pricing
renders it a viable buy only for large organizations.