Wednesday, February 11, 2015

Getting the Data

The data source for this project is from:

Reading in the training dataset that was downloaded from


Preprocessing the Data

The first 7 variables in the training data set are:

X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window,

I removed these from the data set since they were not relevant towards predicting "classe". The removed variables included the time stamp ones as well, since I did not inted to do a time series analysis.


Next, I removed all the columns with missing values from the dataset:


Then, I found all the columns that are factors, while ignoring the last column which was the response variable "classe."


I then removed these columns from the data frame, since some of the machine learning algorithms cannot work with factor variables that have over 32 levels.


Overall, I have reduced the number of predictive variables from 159 to 52.

Random Forests




In order to find the most important of the predictors, let's look at modelFit$importance. Random forests has four different ways of looking at variable importance. See:













So, let's fit a model using only the "important" predictors and guestimate its accuracy by trying it out on the test set.








Hmmm, would a single decision tree work just as well as using random forests? Let's see!

Decision Trees






So, how accurate is this classification tree on the testing set? Let's see:




I guess I will be sticking to random forests for now...

Weightlifting Exercises Dataset Revisited
Tagged on:

Leave a Reply

Your email address will not be published. Required fields are marked *