CP 420 BriannaJuict Computer Programing Exercise
Q1: Read these two data sets into R. Use the variable names of the summary
statistics from the file “crime-data-info.txt” as the 6-th to 147th column names of
the data set “CrimeData”. The first five column names of the data set “CrimeData”
are, respectively, “State”,”County”,”CommCode”,”CommName” and “fold”.
Find the feature variables in “CrimeData” that have missing values and delete the
columns with missing values. Also delete the columns corresponding to “State”,
“County”,”CommCode” and “fold”. The resulting data set is a complete data set
without missing values (name it as “CompleteCrimeData”), which has 101
columns. Among them, the first column is the community names and last column
is the target variable: the total number of violent crimes per 100K populations
Q2: Use the first 1500 rows (communities) as the training data set and the last
494 rows as the test data set. Fit a linear regression using all the 99 feature
variables with the target variable as the response, and estimate the coefficients in
the linear regression model using Least Angle Regression (LAR) for a sequence of
tuning parameters. Plot the solution paths of all the LAR coefficient estimators.
Q3: Based on the LAR estimator in Q2, if one would like to obtain a LASSO
estimate of coefficients in the above linear regression, could you specify the
smallest tuning parameter that would make the LAR estimator and LASSO
estimator different?
Q4: Compare the LAR estimator with the LASSO estimator via the plots of the
entire solution paths, and identify the portion of the solution paths where these
two estimators are the same. Which feature variable has different solution paths
at the tuning parameter you find in Q9? In terms of computational complexity,
how many more steps LASSO estimator used when it is compared with the LAR
estimator?
Q5: LASSO estimator depends on the tuning parameter. Different tuning
parameter would produce different estimators with different numbers of nonzeros. Use the cross-validation method to choose tuning parameters for the
LASSO estimator and identify the tuning parameter that would minimize the
mean square errors of the predictions.
Q6: Based on the tuning parameter chosen in Q5, predict the target variable use
the test data set given in Q3. Find the sum of the square errors (SSE) of the
prediction using the LASSO estimator.
Q7: Use the first 1500 rows (communities) as the training data set and the last
494 rows as the test data set. Fit a linear regression using the feature variables
with non-zero coefficients selected by LASSO in Q6 with the target variable as the
response, and estimate the coefficients in the linear regression model using ridge
estimator for a sequence of tuning parameters. Plot the solution paths of all the
ridge coefficient estimators.
Q8: Apply a ten-fold cross-validation method to the training data set in Q7 and
find the tuning parameter that minimizes the prediction error. Use the tuning
parameter chosen by the ten-fold cross-validation, predict the target variables in
the test data set and evaluate the SSE of the prediction errors. Compare the SSE
given by LASSO and the SSE given by the ridge regression.
we will apply ridge estimation and LASSO methods to a crime
data set available at
http://archive.ics.uci.edu/ml/datasets/communities+and+crime. The data set
contains socio-economic data from three sources: the 1990 US Census, law
enforcement data from the 1990 US LEMAS survey, and crime data from the 1995
FBI UCR. There are 127 attributes including 122 predictive feature variables and 5
non-predictive attributes in the data set. These 122 attributes are considered to
be related to crime. There are two types of feature variables included in the data
set:
1. Community-related survey data: such as the percent of the population
considered urban, and the median family income;
2. Law enforcement data: such as per capita number of police officers, and
percent of officers assigned to drug units.