- Downloaded a zip file at the specified link: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
- Extracted the contents into folder "UCI HAR Dataset" in working directory
561 types of observations from smartphone-embedded accelerometer and gyroscope were collected on 30 subjects for 6 physical activity types at a sampling rate of 50Hz
UCI HAR Dataset contains
- activity_labels.txt - lists the 6 types of physical activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) against corresponding activity IDs ranging from 1 to 6
- features.txt: lists names of the 561 types of accelerometer and gyroscope measurements NOTE: more on features at the bottom of this document
- Test sub-folder: containing
- subject_test.txt: 2947 instances of IDs for 9 randomly selected members of the total 30 subjects
- X_test.txt: 2947 instances of 561 feature measurements for the 9 subjects
- y_test.txt: 2947 instances of one of the 6 types of physical activities
- Train sub-folder: containing
- subject_train.txt: 7352 instances of IDs for the remaining 21 subjects
- X_train.txt: 7352 instances of 561 feature measurements for the 21 subjects
- y_train.txt: 7352 instances of one of the 6 types of physical activities
Following steps were taken, encapsulated in a .R script:
library(dplyr) library(plyr) library(reshape2)
activity <- read.table("./UCI HAR Dataset/activity_labels.txt", stringsAsFactors = FALSE) names(activity) <- c("activity_ID","activity_label")
subject_test <- read.table("./UCI HAR Dataset/test/subject_test.txt") names(subject_test) <- "subject_ID"
Y_test <- read.table("./UCI HAR Dataset/test/y_test.txt") names(Y_test) <- "activity_ID"
X_test <- read.table("./UCI HAR Dataset/test/X_test.txt")
1.1.4 Then name X_test columns based on feature names taken from 561 feature measurement types and their labels read from features.txt
features <- read.table("./UCI HAR Dataset/features.txt", stringsAsFactors = FALSE) names(features) <- c("feature_ID", "feature_label") names(X_test) <- features$feature_label
test_Data <- cbind(subject_test, Y_test, X_test)
subject_train <- read.table("./UCI HAR Dataset/train/subject_train.txt") names(subject_train) <- "subject_ID"
Y_train <- read.table("./UCI HAR Dataset/train/y_train.txt") names(Y_train) <- "activity_ID"
X_train <- read.table("./UCI HAR Dataset/train/X_train.txt")
names(X_train) <- features$feature_label
train_Data <- cbind(subject_train, Y_train, X_train)
combined_Data <- rbind(test_Data, train_Data)
retainList <- grep("mean|std", features$feature_label, value = TRUE)
retainColIndex <- match(retainList, names(combined_Data))
extract_Data <- combined_Data[, c(1, 2, retainColIndex)]
3.1 Join above extracted data set with activity dataset. This will include into extract_Data a column activity_label to the end.
extract_Data <- join(extract_Data,activity,by="activity_ID")
lastCol <- ncol(extract_Data) lastbutoneCol <- lastCol-1 extract_Data <- extract_Data[,c(1,2,lastCol,3:lastbutoneCol)]
Already achieved in steps 1.1.4 and 1.2.4 above. We had altered column names based on data in "features.txt".
Step 5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
5.1 Melt extract_Data to convert all feature measurements (4-82) into a single column identified by columns 1 to 3, i.e. subject ID, activity ID, and activity label
melt_Data <- melt(extract_Data, id=1:3, measure.vars=4:82)
group_Data <- group_by(melt_Data,subject_ID,activity_ID,activity_label,variable)
avg_Data <- summarize(group_Data, average = mean(value, na.rm=TRUE))
write.table(avg_Data, file = "tidy_Data.txt", row.names = FALSE)
===================================================== ABOUT FEATURES
The features selected for this database come from the accelerometer and gyroscope 3-axial raw signals tAcc-XYZ and tGyro-XYZ. These time domain signals (prefix 't' to denote time) were captured at a constant rate of 50 Hz. Then they were filtered using a median filter and a 3rd order low pass Butterworth filter with a corner frequency of 20 Hz to remove noise. Similarly, the acceleration signal was then separated into body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using another low pass Butterworth filter with a corner frequency of 0.3 Hz.
Subsequently, the body linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ). Also the magnitude of these three-dimensional signals were calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag).
Finally a Fast Fourier Transform (FFT) was applied to some of these signals producing fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroMag, fBodyGyroJerkMag. (Note the 'f' to indicate frequency domain signals).
These signals were used to estimate variables of the feature vector for each pattern:
'-XYZ' is used to denote 3-axial signals in the X, Y and Z directions.
- tBodyAcc-XYZ
- tGravityAcc-XYZ
- tBodyAccJerk-XYZ
- tBodyGyro-XYZ
- tBodyGyroJerk-XYZ
- tBodyAccMag
- tGravityAccMag
- tBodyAccJerkMag
- tBodyGyroMag
- tBodyGyroJerkMag
- fBodyAcc-XYZ
- fBodyAccJerk-XYZ
- fBodyGyro-XYZ
- fBodyAccMag
- fBodyAccJerkMag
- fBodyGyroMag
- fBodyGyroJerkMag
The set of variables that were estimated from these signals are:
- mean(): Mean value
- std(): Standard deviation
- mad(): Median absolute deviation
- max(): Largest value in array
- min(): Smallest value in array
- sma(): Signal magnitude area
- energy(): Energy measure. Sum of the squares divided by the number of values.
- iqr(): Interquartile range
- entropy(): Signal entropy
- arCoeff(): Autorregresion coefficients with Burg order equal to 4
- correlation(): correlation coefficient between two signals
- maxInds(): index of the frequency component with largest magnitude
- meanFreq(): Weighted average of the frequency components to obtain a mean frequency
- skewness(): skewness of the frequency domain signal
- kurtosis(): kurtosis of the frequency domain signal
- bandsEnergy(): Energy of a frequency interval within the 64 bins of the FFT of each window.
- angle(): Angle between to vectors.
Additional vectors obtained by averaging the signals in a signal window sample. These are used on the angle() variable:
- gravityMean
- tBodyAccMean
- tBodyAccJerkMean
- tBodyGyroMean
- tBodyGyroJerkMean