Wednesday, February 11, 2015

Getting the Data

The data source for this project is from: http://groupware.les.inf.puc-rio.br/har.

Reading in the training dataset that was downloaded from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv:

 

Preprocessing the Data

The first 7 variables in the training data set are:

X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window,

I removed these from the data set since they were not relevant towards predicting "classe". The removed variables included the time stamp ones as well, since I did not inted to do a time series analysis.

 

Next, I removed all the columns with missing values from the dataset:

 

Then, I found all the columns that are factors, while ignoring the last column which was the response variable "classe."

 

I then removed these columns from the data frame, since some of the machine learning algorithms cannot work with factor variables that have over 32 levels.

 

Overall, I have reduced the number of predictive variables from 159 to 52.

Modeling
Random Forests

 

 

 

In order to find the most important of the predictors, let's look at modelFit$importance. Random forests has four different ways of looking at variable importance. See: https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf

WLExercisesMeasure1

 

WLExercisesMeasure2

 

WLExercisesMeasure3

 

WLExercisesMeasure4

 

 

 

 

 

So, let's fit a model using only the "important" predictors and guestimate its accuracy by trying it out on the test set.

 

 

 

 

 

 

 

Hmmm, would a single decision tree work just as well as using random forests? Let's see!

Decision Trees

 

 

WLExercisesSizeOfTree

 

ClassificationTree4WLExercisesData

So, how accurate is this classification tree on the testing set? Let's see:

 

 

 

I guess I will be sticking to random forests for now...

Weightlifting Exercises Dataset Revisited
Tagged on:

Leave a Reply

Your email address will not be published. Required fields are marked *