Getting the Data

The data source for this project is from: http://groupware.les.inf.puc-rio.br/har.
Reading in the training dataset that was downloaded from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv :

Reading is the test set that has been downloaded from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv:

 Exploring the Data

The response variable is "classe" and the rest of the variables are all potential predictors of this response variable. To get an idea of the size of this dataset, here are some basic numbers:

  •  the number of variables is 160
  • the number of observations in this dataset is 19622

In order to have some idea of what the response variable looks like, here is the summary of it:

 

After some further examination of the dataset, there are a few things I need to note:

  • Some of the values are missing, as in the column "skewness_yaw_belt" and some of the values are "NA", as in the column "max_roll_belt":

  •  Some of the variables are factor variables with over 100 factors:

 

Plotting Predictors

In order to best determine which model to choose to predict "classe", I chose to graph some of the predictors in a feature plot.

featurePlotWLExercisesDataset

In order to closer examine the feature plot, I plotted many of them separately, here is an example of a close-up:

 

 

WLExericesDataset

In order to understand what is going with the strange groupings on I created a histogram of "roll_belt" and of "roll_forearm":

WLExercisesDataHist

However, the graphs did not help me understand the data any better, other than to note the absence of a normal distribution. There are simply too many variables to dwell on them individually for too long.

Preprocessing the Data

The first 7 variables in the training data set are:

X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window,

I removed these from the data set since they were not relevant towards predicting "classe". The removed variables included the time stamp ones as well, since I did not intend to do a time series analysis.

 

Next, I removed all the columns with missing values from the dataset:

 

Then, I found all the columns that are factors, while ignoring the last column which was the response variable "classe."

 

I then removed these columns from the data frame, since some of the machine learning algorithms cannot work with factor variables that have over 32 levels.

Overall, I have reduced the number of predictive variables from 159 to 52.

Cross Validation Using Random Subsampling and Random Forest

I used a for loop to set up cross validation using random subsampling to fit three random forest models to random subsets of the training data, called "trainingSet". I then used these models to predict the "classe" variable of the testing subsets, called "testingSet". I was hoping for an out of sample error of less than 20%.

 

The mean accuracy of these models turned out to be 0.9946302, which is a good estimate of the out of sample error.

Applying the Random Forest Model to the 20 Test Cases

I fit a random forest model to the entire training data set this time, and I used the model to predict the "classe" variable for the 20 test cases in the testing data set.

 

 

Prediction Using Random Forests in R - An Example

One thought on “Prediction Using Random Forests in R - An Example

  • May 7, 2016 at 5:29 am
    Permalink

    So you haven t noticed the votes component of the random forest object? There s a pretty clear description of it in the docs.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *