Data Mining-Accuracy and Ensemble Methods

* The preview only display some random pages of manuals. You can download full content via the form below.

The preview is being generated... Please wait a moment!
  • Submitted by: Raj Endran
  • File size: 106.9 KB
  • File type: application/pdf
  • Words: 1,155
  • Pages: 6
Report / DMCA this file Add to bookmark

Description

ACCURACY AND ENSEMBLE METHODS Classifier Accuracy Measures  Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M  Error rate (misclassification rate) of M = 1 – acc(M)  Given m classes, CMi,j, an entry in a confusion matrix, indicates # of tuples in class i that are labeled by the classifier as class j  Alternative accuracy measures (e.g., for cancer diagnosis) sensitivity = t-pos/pos /* true positive recognition rate */ specificity = t-neg/neg /* true negative recognition rate */ precision = t-pos/(t-pos + f-pos) accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)  This model can also be used for cost-benefit analysis  Cost associated with misclassification  False Positive Vs False Negative  If there is a possibility of tuples belonging to more than one class – probability class distribution may be returned Predictor Error Measures  Measure predictor accuracy: measure how far off the predicted value is from the actual known value  Loss function: measures the error betw. yi and the predicted value yi’  Absolute error: | yi – yi’| 2  Squared error: (yi – yi’)



Test error (generalization error): the average loss over the test set  Mean absolute error: Mean squared error:



Relative absolute error: squared error:



The mean squared-error exaggerates the presence of outliers Popularly used root mean-square error, similarly root relative squared error



Relative

Evaluating the Accuracy  Holdout method  Given data is randomly partitioned into two independent sets  Training set (e.g., 2/3) for model construction  Test set (e.g., 1/3) for accuracy estimation  Random sampling: a variation of holdout  Repeat holdout k times, accuracy = avg. of the accuracies obtained  Cross-validation (k-fold, where k = 10 is most popular)  Randomly partition the data into k mutually exclusive subsets, each approximately equal size  At i-th iteration, use Di as test set and others as training set  Leave-one-out: k folds where k = # of tuples, for small sized data  Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as

that in the initial data Evaluating the Accuracy  Bootstrap  Works well with small data sets  Samples the given training tuples uniformly with replacement  i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set  .632 boostrap  Given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a training set of d samples. Others form the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set  Repeating the sampling procedure k times, overall accuracy of the model: Ensemble Methods: Increasing the Accuracy  Ensemble methods  Use a combination of models to increase accuracy  Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*  Popular ensemble methods  Bagging: averaging the prediction over a collection of classifiers  Boosting: weighted vote with a collection of classifiers  Ensemble: combining a set of heterogeneous classifiers

Bagging: Boostrap Aggregation  Analogy: Diagnosis based on multiple doctors’ majority vote  Training  Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., boostrap)  A classifier model Mi is learned for each training set Di  Classification: classify an unknown sample X  Each classifier Mi returns its class prediction  The bagged classifier M* counts the votes and assigns the class with the most votes to X Prediction: can be applied to the prediction of  continuous values by taking the average value of each prediction for a given test tuple  Accuracy  Often significant better than a single classifier derived from D  For noise data: not considerably worse, more robust  Proved to have improved accuracy in prediction Bagging &Boosting  Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy  Working  Weights are assigned to each training tuple  A series of k classifiers is iteratively learned  After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training





tuples that were misclassified by Mi  The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy The boosting algorithm can be extended for the prediction of continuous values Comparing with bagging: boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data

Adaboost  Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)  Initially, all the weights of tuples are set the same (1/d)  Generate k classifiers in k rounds. At round i,  Tuples from D are sampled (with replacement) to form a training set Di of the same size  Each tuple’s chance of being selected is based on its weight A classification model Mi is derived from Di  Its error rate is calculated using Di as a test set  If a tuple is misclassified, its weight is increased, else it is decreased Error rate: err(Xj) is the misclassification error of  tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples: 



The weight of classifier Mi’s vote is

Boosting Model Selection  Determining best out of two models  During cross validation mean error rates give only estimates  Considerable variation between folds maybe present  Estimating Confidence levels  Statistical t-test  Pair-wise comparison Vs Non-paired Comparison  Same partition for each fold  Error rates  err(M1)i and err(M2)i – error for fold I  Mean error rate:  k – degrees of freedom / number of folds  Significance value: 5% or 0.05; z = 0.025  Look up table for z  If t>z or t<-z – reject null hypothesis  Conclude that there is a difference between the models Model Selection: ROC Curves  ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models  Originated from signal detection theory  Shows the trade-off between the true positive rate and the false positive rate  The area under the ROC curve is a measure of the accuracy of the model  Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list  The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model