Building a classifier to predict activity category of a human

Levi Matheri
DataDrivenInvestor
Published in
4 min readNov 30, 2018

--

According to UCI Machine Learning Repository, the Human Activity Recognition data set is derived from recordings of 30 subjects in their day-to-day activities while carrying around a waist-mounted smartphone with embedded inertial sensors. Each person performed six activities, namely: walking, walking upstairs, walking downstairs, sitting, standing and laying. The procedure for how the data was obtained from the sensors, the data and its description, as well as the pre-processing can all be obtained here.

I built a multi-class classifier using a Support Vector Machine (SVM). An SVM seeks to separate the categories of data with a clear decision boundary that has a large distance to the nearest data point of any class. I will now walk through my approach.

Using Google Colab, we open a new Python3 notebook. The data files can be imported into Google Drive and accessed from Google Colab. We can import these files into our notebook using the following lines of code.

from google.colab import drive
drive.mount('/content/gdrive')

The server will then ask us to follow a link which will forward us to authorize Google Drive File Stream to access our Google drive. A token will be created which we will then copy into the text input provided on the notebook. When all is said and done, we should see Mounted at /content/gdrive.

We will now split our data into training and testing data. First of all, we obtain our feature list from features.txt thus:

with open('/content/gdrive/My Drive/features.txt') as feature_file:
names = feature_file.readlines()
names = map(lambda x: x.strip(), names)
names = list(names)

Then split our data:

X_train = pd.read_csv('/content/gdrive/My Drive/X_train.txt', header=None, delimiter=r"\s+", names=names)
X_test = pd.read_csv('/content/gdrive/My Drive/X_test.txt', header=None, delimiter=r"\s+", names=names)
y_train = pd.read_csv('/content/gdrive/My Drive/y_train.txt', header=None)
y_test = pd.read_csv('/content/gdrive/My Drive/y_test.txt', header=None
y_train.columns = ['label']
y_test.columns = ['label']
# we will need this for fitting the model
y_train_fin = y_train.values.ravel()
y_test_fin = y_test.values.ravel()

To ensure our data is normalized, we use sklearn's StandardScaler from the preprocessing module and transform our data, essentially removing the mean and scaling to unit variance. We can do this with a few lines of code as shown below.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_fin = scaler.fit_transform(X_train)
X_test_fin = scaler.transform(X_test)

We can implement the SVM model using different forms of kernels. I basically went through Linear, Gaussian and Polynomial kernels to gauge which model would achieve the highest accuracy. In order to get a more accurate representation of the findings, we use the KFold crossvalidation technique. We split the training data into k consecutive folds. Each fold is used once as a validation while the k -1 remaining folds are used as the training set. We then consolidate the scores for the various models and print the result as follows:

from sklearn.model_selection import KFold
from sklearn import svm
import numpy as np
svc_linear = svm.SVC(kernel='linear')
svc_gaussian = svm.SVC(kernel='rbf')
svc_poly = svm.SVC(kernel='poly')
# split into 5 folds
k_fold = KFold(n_splits=5)
scores_linear = np.array([svc_linear.fit(X_train_fin[train], y_train_fin[train]).score(X_train_fin[test], y_train_fin[test])
for train, test in k_fold.split(X_train_fin)])
scores_gaussian = np.array([svc_gaussian.fit(X_train_fin[train], y_train_fin[train]).score(X_train_fin[test], y_train_fin[test])
for train, test in k_fold.split(X_train_fin)])
scores_poly = np.array([svc_poly.fit(X_train_fin[train], y_train_fin[train]).score(X_train_fin[test], y_train_fin[test])
for train, test in k_fold.split(X_train_fin)])
print("Scores for linear SVM ", scores_linear)
print("Train accuracy LINEAR: %0.2f (+/- %0.2f)" % (scores_linear.mean(), scores_linear.std()/2))
print("Scores for gaussian SVM ", scores_gaussian)
print("Train accuracy GAUSSIAN: %0.2f (+/- %0.2f)" % (scores_gaussian.mean(), scores_gaussian.std()/2))
print("Scores for polynomial SVM ", scores_poly)
print("Train accuracy POLY: %0.2f (+/- %0.2f)" % (scores_poly.mean(), scores_poly.std()/2))
print("Train accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()/2))

After running the above code, we get this:

Scores for linear SVM  [0.92454113 0.90210741 0.93197279 0.94761905 0.96462585]
Train accuracy LINEAR: 0.93 (+/- 0.01)
Scores for gaussian SVM [0.92726037 0.90754589 0.91632653 0.92789116 0.95782313]
Train accuracy GAUSSIAN: 0.93 (+/- 0.01)
Scores for polynomial SVM [0.91434398 0.90210741 0.91768707 0.92040816 0.93061224]
Train accuracy POLY: 0.92 (+/- 0.00)

As we can see, Linear and Gaussian SVM kernels are comparable. We can then see how each does on the test set:

y_pred_linear = svc_linear.predict(X_test_fin)
y_pred_gaussian = svc_gaussian.predict(X_test_fin)
y_pred_poly = svc_poly.predict(X_test_fin)
from sklearn.metrics import accuracy_scoreprint("Linear ", accuracy_score(y_test_fin, y_pred_linear))
print("Gaussian ", accuracy_score(y_test_fin, y_pred_gaussian))
print("Poly ", accuracy_score(y_test_fin, y_pred_poly))

After running the above code, we obtain the following results:

Linear  0.9491007804546997
Gaussian 0.9433322022395657
Poly 0.9124533423820834

As we can see, the Linear kernel did best, although still quite comparable with the Gaussian kernel.

Furthermore, going with Linear kernel, we can create a cross-tab to get a better visualization of how well the SVM does on the test set:

svc_linear.fit(X_train_fin, y_train_fin)crosstab = pd.crosstab(y_test_fin.flatten(),   svc_linear.predict(X_test_fin),
rownames=['True'], colnames=['Predicted'],
margins=True)
crosstab

We obtain the table below. It looks like our model did very well. However, it seems our model had a little bit of trouble dealing with 4 and 5. Why? Because if we look at the activities they represent, 4 represents sitting while 5 represents standing. It is evident that these activities are stationary, the accelerometer and gyroscope data on these activities would be quite similar.

+------------+------+------+------+------+------+------+------+
| Predicted | 1 | 2 | 3 | 4 | 5 | 6 | All |
+------------+------+------+------+------+------+------+------+
| True | | | | | | | |
| 1 | 495 | 0 | 1 | 0 | 0 | 0 | 496 |
| 2 | 16 | 453 | 2 | 0 | 0 | 0 | 471 |
| 3 | 6 | 15 | 399 | 0 | 0 | 0 | 420 |
| 4 | 0 | 2 | 0 | 434 | 55 | 0 | 491 |
| 5 | 0 | 0 | 0 | 18 | 514 | 0 | 532 |
| 6 | 0 | 0 | 0 | 0 | 0 | 537 | 537 |
| All | 517 | 470 | 402 | 452 | 569 | 537 | 2947 |
+------------+------+------+------+------+------+------+------+

That’s it! One thing we could try with this is apply Principal Component Analysis (PCA) to reduce the dimensionality of the data. This is because we have so many features which could lead to overfitting. Another thing would be to make this a binary classification problem where we try to predict if an activity is stationary or mobile.

Conclusion

I hope you found this article insightful and helpful. Please share your thoughts in consideration.

--

--