To understand the specific capabilities of the product, let’s take the case of a consumer finance company disbursing loans to individuals. The foremost question facing the company would be whether a given applicant would unerringly pay up the loan over the loan re-payment period, or whether he’s more likely to default.
Like most statistical tools, Answer Tree obviously can’t provide you with absolutely accurate answers. Instead, based on criteria that you specify as being important for assessing the loan payment capabilities of applicants, it sifts through existing databases and classifies them into homogeneous groups. This can help the company to understand the profile of likely defaulters, and thereby take decisions to minimize the number of defaulters.
Installing the software was a snap, the most tedious part being feeding in the 38-number license code. Once installed, it ran smooth and quick on a P/200 MMX test machine with 32 MB RAM, 2.1 GB HDD, and VGA at 800x600 resolution.
The built-in tutorials and help files would be adequate for users well-versed with research methodology, statistics, and decision tree analysis. However, to the lay user, these are insufficient, since the basic concepts are not covered. However, the documentation—consisting of a well laid-out and comprehensive book consisting of over 200 pages spread across 14 chapters—is virtually a textbook on decision tree analysis, and explains basic as well as advanced concepts. So, lay users can also get started with decision trees, with the help of the documentation.
To begin using Answer Tree, you need to have records of past loan applicants in one of the following file formats:
SPSS file (*.SAV)
SYSTAT file (*.SYD, *.SYS)
Common database formats (*.DBF, etc)
ODBC (MS-Access files, etc)
The process of building the answer tree involves two steps. In the first step, the Minimal Tree is drawn, which classifies data into homogeneous groups. In the second step, the Minimal Tree may be grown, so as to arrive at an even better answer to the question at hand, in this case, the likely loan payment defaulters.
The Minimal Tree may be drawn
using one of the following methods.
CHAID: This uses Chi-square or F statistics to select predictors for each homogeneous group
Exhaustive CHAID: This is a modification of the CHAID method that’s more exhaustive and rigorous in selecting predictors. As a result, it also takes longer to run
C&RT: This method identifies homogeneous subsets of data, with each split generating two nodes
QUEST: This method is similar to the C&RT method with one difference, the target (dependent) variable has to be nominal (that is, you can’t do any further mathematical calculations on it, for example, rankings given to individuals). Two nodes are generated at each split, as in the case of the C&RT method.
The Minimal Tree
To draw the Minimal Tree, the first step is to specify variables that are predictors to the segregation into heterogeneous groups. In our loan repayment example, variables such as monthly salary, educational qualifications, type of service (government, private service, self-employed, etc), number of dependents, other possessions of the customer (car, house, etc) may all be predictors to behavior with respect to loan payment.
You can also specify the following tree characteristics:
Maximum tree depth
Minimum number of cases in parent and child nodes
Minimum change in impurity (this is the degree of difference between individual cases within a homogeneous group)
In addition to these, there’s an option for validating/cross-validating the tree. Validation is achieved by partitioning the data set into Training and Testing Sample, in specified proportions.
Growing the Minimal Tree
Since Answer Tree is an exploratory tool, its almost always necessary to re-look at the initial assumptions once the Minimal Tree is drawn. Extensive facilities for growing/pruning individual branches are available, so that you can best classify the given data into homogeneous groups. It should be kept in mind that knowledge of the situation at hand is key to arriving at the best grouping. Once the final tree is drawn, its interpretation is quite straightforward.
The misclassification matrix counts up the predicted and actual category values and displays them in a table. A correct classification is added to the counts in the diagonal cells of the table. The diagonal elements of the table represent agreement between the predicted and actual value, often called a "hit". An incorrect classificationcalled a "miss"means that theres disagreement between predicted and actual values. Misclassifications are counted in the off-diagonal elements of the matrix. In this example, 11 applicants with no credit or no debt (NCR/NODEB) were misclassified as having current, up-to-date credit accounts (PD BK). This table is helpful in determining exactly where the model performs well or poorly.
The risk estimate and standard error of risk estimate indicate how well the classifier (the variable you use for classification at a given node) is performing. In this case, the risk estimate for the four-level C&RT tree is 0.2880, and the standard error for the risk estimate is 0.0143. In other words, we are missing 28.8 percent of the time. If necessary, you can look at ways to further improve the model.
At a steep license fee of Rs 110,000 per user, Answer Tree is clearly beyond the reach of individual researchers or even the smaller research firms. Its pricing renders it a viable buy only for large organizations.