There are several ways that machine learning tools can be used on existing data sets to potentially discover a cure for cancer. First, anybody can download the tools for free nearly anywhere in the world with a consistent internet connection. One of my favorite programs is R that works on both Windows and Mac machines and installs in a matter of minutes. I particularly like R because of the machine learning libraries in it that can be leveraged in programming. While I previously shared some general machine learning algorithms, in this post, I am going to take it a little deeper for those that do have a technical background and want to expand their toolkit and experiment with some of these machine learning techniques.

The first step is understanding what variables you might have access to as it relates to cancer and the nature of those variables. A variety of both structured and unstructured data can be combined in frameworks like Hadoop to prepare the data for analysis. If you want to leverage different machine learning techniques, it is useful to understand how trees work because with decision trees there is not the assumption of linearity which is helpful when trying to glean insights through non-linear data analysis.

Classification trees help separate data into classes that belong to the response variable. If the target variable has more than two categories, different variants of the algorithm are leveraged, but overall classification trees are useful when the target variable is categorical (like yes/no). On the other hand, regression trees or prediction trees can be useful when a response variable is numeric or continuous. The target variable determines whether or not to use classification or regression tree. Conditional logistic regression can be useful in tackling sparse data type issues.

The advantages of decision trees include fast computations, invariance under the monotone transformation of variables, an easy extension to categorical outcomes, resistance to irrelevant variables, one tuning parameter, ability to handle missing data and outputs that can be easily understood by non-technical audiences. The disadvantages can include accuracy since the function needs to involve higher order interactions and variance since each split depends on previous splits and small changes can cause big changes in the decision tree. Some important definitions to understand include:

**Root**is the topmost node of the tree**Edge**is the link between two nodes**Child**is a node that has a parent node**Parent**is a node that has an edge to a child node**Leaf**is a node that does not have a child node in the tree**Height**is the length of the longest path to a leaf**Depth**is the length of the path to its root

Let’s start with considering the existing prostate cancer data set available in R. The data represents a population of 97 males. This is a good data set to illustrate how easily different tree growth algorithms and classification techniques can be used to predict tumor spread in males. In this specific example, the measures for prediction are PSA, the size of the prostate, benign prostatic hyperplasia, Gleason score, and capsular penetration. Therefore, to better understand and predict the tumor spread (seminar vesicle invation=svi) the following variables were used for the tree growth algorithms: log of benign prostatic hyperplasia amount (lbph), log of prostate-specific antigen (lpsa), Gleason score (gleason), log of capsular penetration (lcp) and log of cancer volume (lcavol).

Here is a quick program that I wrote in R to better understand this data set:

R Script# Loading the proper libraries to conduct this analysis on the prostate cancer dataset in Rinstall.packages(lasso2) library(lasso2) data("Prostate") install.packages("rpart") library(rpart) install.packages("party") library(party)# Setting up the classification treeclassification=rpart(svi~lbph+lpsa+gleason+lcp+lcavol,data=Prostate,method="class")# Lets look at the resultsprintcp(classification)# Plotting the resultsplotcp(classification)# Making the plot treeplot(classification,uniform=T,main="Classification tree for prostate cancer") text(classification,use.n = T, all=T, cex=.8)# Making the treeregression=rpart(svi~lbph+lpsa+gleason+lcp+lcavol,data=Prostate,method="anova")# Looking at the resultsprintcp(regression) plotcp(regression) plot(regression,uniform=T,main="Regression tree for prostate cancer") text(regression,use.n = T, all=T,cex=.8)# Now doing the conditional inference treeconditional=ctree(svi~lbph+lpsa+gleason+lcp+lcavol,data=Prostate)# Lets look at the resultsconditional# Plotting the resultsplot(conditional,main="Conditional inference tree for prostate cancer")

This script resulted in the following information:

> printcp(classification) Classification tree: rpart(formula = svi ~ lbph + lpsa + gleason + lcp + lcavol, data = Prostate, method = "class") Variables actually used in tree construction: [1] lcp Root node error: 21/97 = 0.21649 n= 97 CP nsplit rel error xerror xstd 1 0.52381 0 1.00000 1.00000 0.19316 2 0.01000 1 0.47619 0.80952 0.17831 Regression tree: rpart(formula = svi ~ lbph + lpsa + gleason + lcp + lcavol, data = Prostate, method = "anova") Variables actually used in tree construction: [1] lcp lpsa Root node error: 16.454/97 = 0.16962 n= 97 CP nsplit rel error xerror xstd 1 0.45551 0 1.00000 1.00780 0.14079 2 0.21489 1 0.54449 0.68052 0.15327 3 0.01000 2 0.32960 0.53091 0.11726 > conditional Conditional inference tree with 3 terminal nodes Response: svi Inputs: lbph, lpsa, gleason, lcp, lcavol Number of observations: 97 1) lcp <= 1.7492; criterion = 1, statistic = 43.496 2) lpsa <= 2.972975; criterion = 1, statistic = 20.148 3)* weights = 66 2) lpsa > 2.972975 4)* weights = 18 1) lcp > 1.7492 5)* weights = 13

Note that head node is the seminal vesicle invasion which shows the tumor spread. The cross-validation results show there is only one split in the three with a relative value for the first split of .80952 and a standard deviation of .17831. The log of capsular penetration was used to split the tree when the log of capsular penetration at <1.791. There were three leaf nodes in the regression tree algorithm because the script split the data set three times. The relative error for the first split was 0.68052, and a standard deviation of 0.15327 and at the second split the relative error is 0.53091 and a standard deviation of 0.11726. The tree was split at the first log of capsular penetration at <1.791 and the log of the prostate-specific antigen at < 2.973. The conditional tree algorithm produced a split at <1.749 of the log of capsular penetration at the 0.001 significance level and <2.973 for the log of prostate-specific antigen also at the 0.001 significance level.

In this particular example, the condition tree growth algorithm produced more useful information than the classification and regression tree growth algorithm. That being said, while sometimes the language as it relates to machine learning is complicated to understand, it really just comes down to using the right variables as input and testing different machine learning algorithms relative to the problem being solved. Testing different machine learning algorithms boils down to running a few lines of code in R, python or your favorite programming language.

Clinical data around pathology related detail, tumor evolution and cell-level information is being generated at exponentially increasing levels. Many of these data sets are starting to be available online for analysis. The type of algorithms used in this example could be used on these big data sets to accelerate the discovery of a cure for cancer. But, it is not going to happen without individuals that are willing to embrace these types of tools for analysis.

#BigData #AI #Oncology #MachineLearning