Cross-validation is a method for estimating what the error rate of a sub-tree (of the maximal tree) would be if you had test data. Regardless of what value you set for V-fold cross validation, CART grows the same maximal tree. The monograph provides evidence that using a V of 10-20 gives better results than using a smaller number, but each number could result in a slightly different error estimate. The optimal tree — which is derived from the maximal tree by pruning — could differ from one V to another because each cross-validation run will come up with slightly different estimates of the error rates of sub-trees and thus might differ in which tree was actually best.
Normally, a test sample is used to prune the maximal tree down to an "optimal" tree. This is especially recommended for large data sets, from which a test sample can be withdrawn. However, there are times when the size of the data set makes withdrawing a test sample difficult. In the absence of a test sample and without using cross validation, no pruning is done — this is called EXPLORATORY — and the maximal tree is the result. Note that the maximal tree in an exploratory run is identical to the maximal tree when using a test sample, provided that the learn sample is the same for each run.
When you are unwilling to use a test sample but still desire estimates of the error rates of each tree in the sequence, cross validation may be used. In a nutshell, cross validation establishes how much to prune the maximal tree by building a series of "ancillary-cross validation trees" from which error rates of the maximal tree and its subtrees can be estimated. Cross validation does not affect the growth of the maximal tree at all because it is conducted after the maximal tree is grown. The V ancillary cross-validation trees may be similar to the maximal tree, but not necessarily. Here is how it works:
- The maximal tree is grown and saved. Note that we do not have any "independent" estimate of the error rates for each node in the maximal tree, because we do not have a test sample. A pruning sequence is defined based on node complexities of the maximal tree, although the error rate for each tree in the sequence is not yet known. In other words, we know which nodes to prune off the tree and in what order, and we have a series of subtrees defined by the pruning sequence, but we do not know how far to prune.
- V ancillary cross-validation trees are then grown, each on a partition of the learn sample. For instance, if 10 cross-validation trees are grown, each uses 90% of the learn sample for tree growth and the remaining 10% as a pseudo test sample with which to estimate error rates for the nodes in the cross-validation tree.
- Error rates from each of the V cross-validation trees are combined and mapped to the nodes in the original maximal tree. The V cross-validation trees are then discarded.
Now that estimates of the error/cost for each node in the maximal tree are known, we are in a position to prune the maximal tree and declare an optimal tree.
Q: We typically use the default of 10-fold cross validation in CART. However, when we change to, say, 20-fold cross validation, CART indicates a different optimal tree. Why?
A: In both cases the maximal tree is the same. 20-fold cross validation will partition the learning sample into 20 subsets and will generate 20 ancillary cross-validation trees. These trees, each with their own error rates, will be combined to yield estimated error rates for the maximal tree. Since we are combining 20 trees rather than 10, it is almost certain that the 20-fold combined error rates estimated for the maximal tree will differ from those estimated by combining 10-fold cross-validation trees. Although the pruning sequence is the same in both runs, a different tree may be chosen as optimal between the two runs due to the differing error rate estimates. In other words, the maximal tree and pruning sequence is the same, but the 10- and 20-fold cross-validation procedures will result in a different amount of pruning.
Q: In the tree sequence and on the "select tree" dialog we see "cross-validated relative cost" (with confidence intervals) and "resubstitution relative cost," for each tree in the tree sequence, e.g.:
Â
|
Terminal Tree Nodes
|
Cross-Validation Relative Cost
|
Resubstitution Relative Cost
|
Complexity Parameter
|
1
|
15
|
0.7457930 +/- 0.0142744
|
0.6738151
|
0.0019035
|
2
|
10
|
0.7506419 +/- 0.0135887
|
0.6981514
|
0.0024436
|
3
|
9
|
0.7533725 +/- 0.0136544
|
0.7033467
|
0.0026077
|
4
|
7
|
0.7476655 +/- 0.0137743
|
0.7145392
|
0.0028081
|
5**
|
6
|
0.7439012 +/- 0.0135847
|
0.7221265
|
0.0038037
|
6
|
3
|
0.7605784 +/- 0.0142045
|
0.7499018
|
0.0046392
|
7
|
1
|
1.0000000 +/- 0.0000896
|
1.0000000
|
0.0625345
|
A: Cross-validated relative cost is the error rate of the tree, relative to the root node, using the cross-validation method. If you had used a test sample instead of cross validation, you would have been presented with “test sample relative cost.” The resubstitution relative cost depicts the error rate that would be estimated had you used a copy of the learn sample as your test sample. Note that this rate always decreases as the tree gets larger. This is a property of using the same data to estimate errors that were used to build the tree in the first place. The +/- number is a measure of the uncertainty around the actual (cross-validation or test sample) error rate of the tree in question when confronted with new data. The cross-validation error rate is derived from one cross-validation procedure, whereas a test sample error rate is derived from a one-test sample. Either way, if you ran another cross-validation procedure or used a different test sample you would likely see another (slightly) different error rate. The +/- gives an idea of the uncertainty of the error rate estimate.