Basics of Machine Learning

Basics of Machine Learning

2020, Jul 22    

Preface

We've extensively discussed the theory behind several Machine Learning (ML) algorithms, namely Logistic Regression , Random Forests , and Deep Learning . These also encompass a practical application on the enhancer-prediction exercise (See this ). I am not going to delve into that back again. Rather, in this session, let us figure out the ways to implement ML algorithms in Galaxy, from setting it up to inferring results. Let's cut to the chase. This tutorial has been sourced from this link .

In loose terms, ML can help with classification, clustering, and regression for the data to come up with meaningful patterns.


Figure 1. Generic ML Applications


Uploading Data

The datasets required for this tutorial contain 9 features/ variables from profiling of breast cells which include the thickness of clump, cell-size, cell-shape, etc. In addition to these features, the training dataset contains one more column as target, or a class. It has a binary value (0 or 1) for each row. 0 indicates no breast cancer and 1 indicates breast cancer. The test dataset does not contain the target column.

The data we'll be using is listed below. Again, we can choose to upload the data via web-links (below) as mentioned in a previous tutorial.

  1. https://zenodo.org/record/1401230/files/breast-w_train.tsv
  2. https://zenodo.org/record/1401230/files/breast-w_test.tsv



Rename the data for easy working.



Again, if we want to view data, we can click on the "eye" icon next to the file name. The initial data points in the train data look as below.



Setting up the Classifier

Since we are interested in a classification problem, we shall be structuring a ML model or classifier that better represents the patterns in the data and encompasses it's inherent features. The efficiency of the classifier is solely derived from the data it is trained upon and hence it is on the utmost importance to provide clean and comprehensive data to bring out the best.

There are manu flavors of classifiers available out there, but for this exercise, we shall be using the Support Vector Machines (SVMs) to accomplish the task at hand. We will employ the scikit-learn library in python; search for sklearn_svm_classifier in the repository and install the tool. Particularly, we shall be using the Linear Support Vector Classifier (Linear SVC) because of it's effiency over other options.

Support Vector Machines (SVMs)

SVMs have supported classification tasks quite formidably and essentially need no introduction to the field. The following graphics depict the fundamental idea. Also, various texts are available in print and online that can be accessed for the same.


Figure 2. Class seperating hyperplanes in SVM



Text Graphic 1. SVM: Under the hood


Once installed, we can trace the following screenshots.





Next, we can proceed with the execution with the following input parameters.

  • “Select a Classification Task”: Train a model
  • “Classifier type”: Linear Support Vector Classification
  • “Select input type”: tabular data
  • “Training samples dataset”: breast-w_train
  • “Does the dataset contain header” : Yes
  • “Choose how to select data by column” : All columns EXCLUDING some by column header name(s)
  • “Type header name(s)” : target
  • “Dataset containing class labels or target values” : breast-w_train
  • “Does the dataset contain header” : Yes
  • “Choose how to select data by column” : Select columns by column header name(s)
  • “Type header name(s):” : target

The output of the above execution is a zip archive that shall subsequently be used for making predictions on the test data.



Making Predicitions

Make choices for parameters as follows.



The output is a data table with the last column as the predicted class labels.



Exercise

  1. Run SVM with C- and Nu- Support Vector Classification profiles for training data and examine the results.

References

  1. Anup Kumar, 2020 Basics of machine learning (Galaxy Training Materials). /training-material/topics/statistics/tutorials/machinelearning/tutorial.html Online; accessed Sun Jul 26 2020