San Jose State University Data Mining Using R for Data Analysis and Graphics Questions
Read the chapter 2, and 5 to answers these questions.
Chapter 2 Problems: 2, 5, 6, 10
2) Describe the difference in roles assumed by the validation partition and the test partition.
5) Using the concept of overfitting, explain why when a model is fit to training data, zero error with those data is not necessarily good.
6) In fitting a model to classify prospects as purchasers or nonpurchasers, a certain company drew the training data from internal data that include demographic and purchase information. Future data to be classified will be lists purchased from other sources, with demographic (but not purchase) data included. It was found that “refund issued” was a useful predictor in the training data. Why is this not an appropriate variable to include in the model?
10) Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which model are you more likely to consider for final deployment?
Chapter 5 Problems: 1, 2, 5, 7
1) A data mining routine has been applied to a transaction dataset and has classified 88 records as fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so). Construct the confusion matrix and calculate the overall error rate.
2) Suppose that this routine has an adjustable cutoff (threshold) mechanism by which you can alter the proportion of records classified as fraudulent. Describe how moving the cutoff up or down would affect
a. the classification error rate for records that are truly fraudulent
b. the classification error rate for records that are truly nonfraudulent
5) A large number of insurance records are to be examined to develop a model for predicting fraudulent claims. Of the claims in the historical database, 1% were judged 148 EVALUATING PREDICTIVE PERFORMANCE to be fraudulent. A sample is taken to develop a model, and oversampling is used to provide a balanced sample in light of the very low response rate. When applied to this sample (n = 800), the model ends up correctly classifying 310 frauds, and 270 nonfrauds. It missed 90 frauds, and classified 130 records incorrectly as frauds when they were not.
a. Produce the confusion matrix for the sample as it stands.
b. Find the adjusted misclassification rate (adjusting for the oversampling).
c. What percentage of new records would you expect to be classified as fraudulent?
7) Table 5.7 shows a small set of predictive model validation results for a classification model, with both actual values and propensities.
a. Calculate error rates, sensitivity, and specificity using cutoffs of 0.25, 0.5, and 0.75.
b. Create a decile-wise lift chart in R.