Part 2 – Run the exercise on the imports-85 dataset and write a report on your findings and interpretation of results in your own words. The report needs to cover the exercise key points below in order.  

Save the commands you run, in an R script.  

 

Download the imports-85.csv file to your hard drive. Click on the dataset description URL and read the dataset description.

http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names

  1. Introduction
  1. What is the dependent variable in the auto imports dataset?
  2. What are the independent variables in the auto imports dataset?
  3. What do you expect the multiple linear regression method to accomplish for the auto imports dataset?
  1. Data preprocessing
  1. Load the data into RStudio.  Run the commands to remove variables engine_type, make, num_of_cylinders, and fuel_system.  Include the commands and output in the report.
  2. Discuss any additional data pre-processing commands that you may run. For each command, include the command, command output, and an explanation of the command purpose.
  1. Training and test data
  1. Why do we split the dataset into training and test data?
  2. Run the set.seed(1234) command and the commands to split the data into training set containing 70% of instances and test set containing the remaining 30% of instances.  Include the commands in the report.
  1. Build the multiple regression model on training data
  1. Run the command to build the multiple linear regression model. Store the model in a variable called “model”. Include the command in the report, and discuss the input parameters you used.
  2. Run the summary(model) command.  Include the command and the output in the report.  Answer the following questions about the output.
    • How does the model represent the relationships between dependent and independent variables in the auto import dataset?
    • How does the method handle categorical variables?
    • What does the residuals section of the output mean?
    • What are coefficients, and what do they mean?
    • What is an intercept, and what does it mean?
    • What do the p-values tell us about the significance of each variable?
    • What is the overall accuracy of the model?
  1. Evaluate the model on a test set
  1. Run the command to evaluate the model on the test data. Include the command in the report.
  2. Run the command to build the predicted vs. actual (observed) value scatter plot. Add a diagonal line to the plot (i.e. fit a line).  Include the commands and the plots in the report.
  3. What does the distance between the points and the diagonal line tell about the accuracy of prediction?
  1. Run the plot(model) command to build the residuals plots.   Interpret at least one of the plots.  Include the command, the plot, and plot interpretation in the report.
  2. Minimal adequate model.
  1. What is the minimal adequate model?  Why do we build it?
  2. Run the command to build the minimal adequate model and store the model in the “Model2” variable, include the command and the output in the report.

Model2<-step(model, direction=”backward”)

  1. Run the summary (Model2) command, and answer the following questions.  Include the command, the output, and the answers in the report.
    • Which variables were eliminated and which variables remain?
    • What are the coefficients and the intercept, and what do they mean?
    • Compare the prediction accuracy of the minimum adequate model with the prediction accuracy of the original model.
  1.  Suppose that we have a new car, and we know the values for the independent variables. How would you use the model to predict the value of the dependent variable for the new car?
  2.  Summary
  1. Is the multiple linear regression method appropriate for predicting the values of dependent variable in this dataset?  Explain why or why not.
  2. Which part of this exercise did you find the most challenging and what approach did you take to solve the challenge?

 

Exercise Deliverables

 

Submit the following files in the Exercise 4 Assignment folder:

  • The report addressing the key points above
  • An R script with commands you ran and brief comments on the purpose of the commands