Cure Cancer: AI and Machine Learning

There are several ways that machine learning tools can be used on existing data sets to potentially discover a cure for cancer. First, anybody can download the tools for free nearly anywhere in the world with a consistent internet connection. One of my favorite programs is R that works on both Windows and Mac machines and installs in a matter of minutes. I particularly like R because of the machine learning libraries in it that can be leveraged in programming. While I previously shared some general machine learning algorithms, in this post, I am going to take it a little deeper for those that do have a technical background and want to expand their toolkit and experiment with some of these machine learning techniques.

The first step is understanding what variables you might have access to as it relates to cancer and the nature of those variables. A variety of both structured and unstructured data can be combined in frameworks like Hadoop to prepare the data for analysis. If you want to leverage different machine learning techniques, it is useful to understand how trees work because with decision trees there is not the assumption of linearity which is helpful when trying to glean insights through non-linear data analysis.

No alt text provided for this image

Classification trees help separate data into classes that belong to the response variable. If the target variable has more than two categories, different variants of the algorithm are leveraged, but overall classification trees are useful when the target variable is categorical (like yes/no).  On the other hand, regression trees or prediction trees can be useful when a response variable is numeric or continuous. The target variable determines whether or not to use classification or regression tree. Conditional logistic regression can be useful in tackling sparse data type issues.

   The advantages of decision trees include fast computations, invariance under the monotone transformation of variables, an easy extension to categorical outcomes, resistance to irrelevant variables, one tuning parameter, ability to handle missing data and outputs that can be easily understood by non-technical audiences. The disadvantages can include accuracy since the function needs to involve higher order interactions and variance since each split depends on previous splits and small changes can cause big changes in the decision tree. Some important definitions to understand include:

  • Root is the topmost node of the tree
  • Edge is the link between two nodes
  • Child is a node that has a parent node
  • Parent is a node that has an edge to a child node
  • Leaf is a node that does not have a child node in the tree
  • Height is the length of the longest path to a leaf
  • Depth is the length of the path to its root
No alt text provided for this image

Let’s start with considering the existing prostate cancer data set available in R. The data represents a population of 97 males. This is a good data set to illustrate how easily different tree growth algorithms and classification techniques can be used to predict tumor spread in males. In this specific example, the measures for prediction are PSA, the size of the prostate, benign prostatic hyperplasia, Gleason score, and capsular penetration. Therefore, to better understand and predict the tumor spread (seminar vesicle invation=svi) the following variables were used for the tree growth algorithms: log of benign prostatic hyperplasia amount (lbph), log of prostate-specific antigen (lpsa), Gleason score (gleason), log of capsular penetration (lcp) and log of cancer volume (lcavol).

Here is a quick program that I wrote in R to better understand this data set:

R Script

# Loading the proper libraries to conduct this analysis on the prostate cancer dataset in R
install.packages(lasso2)
library(lasso2)
data("Prostate")
install.packages("rpart")
library(rpart)
install.packages("party")
library(party)
# Setting up the classification tree
classification=rpart(svi~lbph+lpsa+gleason+lcp+lcavol,data=Prostate,method="class")
# Lets look at the results
printcp(classification)
# Plotting the results
plotcp(classification)
# Making the plot tree
plot(classification,uniform=T,main="Classification tree for prostate cancer")
text(classification,use.n = T, all=T, cex=.8)
# Making the tree
regression=rpart(svi~lbph+lpsa+gleason+lcp+lcavol,data=Prostate,method="anova")
# Looking at the results
printcp(regression)
plotcp(regression)
plot(regression,uniform=T,main="Regression tree for prostate cancer")
text(regression,use.n = T, all=T,cex=.8)
# Now doing the conditional inference tree
conditional=ctree(svi~lbph+lpsa+gleason+lcp+lcavol,data=Prostate)
# Lets look at the results
conditional
# Plotting the results
plot(conditional,main="Conditional inference tree for prostate cancer")

This script resulted in the following information:

> printcp(classification)

 

Classification tree:

rpart(formula = svi ~ lbph + lpsa + gleason + lcp + lcavol, data = Prostate,

    method = "class")

 

Variables actually used in tree construction:

[1] lcp

 

Root node error: 21/97 = 0.21649

 

n= 97

 

       CP nsplit rel error  xerror    xstd

1 0.52381      0   1.00000 1.00000 0.19316

2 0.01000      1   0.47619 0.80952 0.17831


Regression tree:

rpart(formula = svi ~ lbph + lpsa + gleason + lcp + lcavol, data = Prostate, 
    method = "anova")

Variables actually used in tree construction:
[1] lcp  lpsa

Root node error: 16.454/97 = 0.16962

n= 97 

       CP nsplit rel error  xerror    xstd
1 0.45551      0   1.00000 1.00780 0.14079
2 0.21489      1   0.54449 0.68052 0.15327
3 0.01000      2   0.32960 0.53091 0.11726




> conditional


Conditional inference tree with 3 terminal nodes

Response:  svi 
Inputs:  lbph, lpsa, gleason, lcp, lcavol 
Number of observations:  97 

1) lcp <= 1.7492; criterion = 1, statistic = 43.496
  2) lpsa <= 2.972975; criterion = 1, statistic = 20.148
    3)*  weights = 66 
  2) lpsa > 2.972975
    4)*  weights = 18 
1) lcp > 1.7492
  5)*  weights = 13 
No alt text provided for this image

Note that head node is the seminal vesicle invasion which shows the tumor spread. The cross-validation results show there is only one split in the three with a relative value for the first split of .80952 and a standard deviation of .17831. The log of capsular penetration was used to split the tree when the log of capsular penetration at <1.791. There were three leaf nodes in the regression tree algorithm because the script split the data set three times. The relative error for the first split was 0.68052, and a standard deviation of 0.15327 and at the second split the relative error is 0.53091 and a standard deviation of 0.11726. The tree was split at the first log of capsular penetration at <1.791 and the log of the prostate-specific antigen at < 2.973. The conditional tree algorithm produced a split at <1.749 of the log of capsular penetration at the 0.001 significance level and <2.973 for the log of prostate-specific antigen also at the 0.001 significance level.

In this particular example, the condition tree growth algorithm produced more useful information than the classification and regression tree growth algorithm. That being said, while sometimes the language as it relates to machine learning is complicated to understand, it really just comes down to using the right variables as input and testing different machine learning algorithms relative to the problem being solved. Testing different machine learning algorithms boils down to running a few lines of code in R, python or your favorite programming language.

Clinical data around pathology related detail, tumor evolution and cell-level information is being generated at exponentially increasing levels. Many of these data sets are starting to be available online for analysis. The type of algorithms used in this example could be used on these big data sets to accelerate the discovery of a cure for cancer. But, it is not going to happen without individuals that are willing to embrace these types of tools for analysis.

#BigData #AI #Oncology #MachineLearning

Ensembles and Random Forest Analysis: How it Works

Ensemble methods can use multiple machine learning algorithms to predict performance. Ensemble is essentially about combining methods to have better predictions.  For example, in terms of logistic regression with ensemble classification, if the first classifier is a base classifier and the second is a corrector classifier, then the first does the initial classification, and the predicted class is then fed into the feature of the second classifier.  The second classifier can either result in a classification which is identical to the first or can correct the prediction if more accuracy is found.  The base classifier helps with the initial prediction of the target class.  The corrector classifier attempts to correct any errors in the prediction of the base classifier by focusing on the decision boundary of the base classifier.  For example, a choice of the base classifier could be logistic regression.  Logistic regression is a parametric discriminative classifier that can be used for training.  Also, for a corrector, the k-nearest neighbors can be the parametric classifier which would take the average of k nearest training data to make the prediction.

Random Forest is a type of ensemble method that performs both regression and classification with the use of multiple decision trees.  The technique is often referred to as Bootstrap Aggregation.  The Bootstrap Aggregation method involves training each decision tree on a different random.  The sampling in this instance occurs through replacement.

AI versus Big Data: What’s the Difference?

Artificial intelligence is fueled by computers, big data, and algorithms. Big data is the input for business intelligence capabilities. Big data represents the large volume of data that often needs to go through a data quality process of cleansing before it can be turned into business insights. Artificial intelligence, on the other hand, occurs when computers act on that data input. Artificial intelligence changes behavior based on findings and then modifies the approach. Big data analytics are more about looking for a given piece of data to produce insight versus having the computer act on the results that are found. Big data analytics produces insights through the identification of patterns through things like the sequential analysis, leveraging technologies like Hadoop that can analyze both structured and unstructured data. While artificial intelligence can also be based off structured and unstructured data, with artificial intelligence, the computer learns from that big data and keeps collecting it and then acting upon it.

Industry examples of how big data is being leveraged in artificial intelligence range from consumer goods to the creative arts to media. For example, in consumer goods, Hello Barbie runs off of machine learning where the microphone on Barbie’s necklace records what the child says and analyzes it to determine a fitting response. The server gets the response back to Barbie in under a second. In the creative arts, music-generating algorithms are being used from newspapers and speeches to create themes for new lyrics and help musicians better understand target audiences to increase record sales. In media, the BBC project, Talking with Machines lets listeners engage in conversation with their smart devices to insert their perspective to become part of the story creation.

Artificial intelligence influences big data analytics and vice-versa. Artificial Intelligence uses big data to run algorithms, like machine learning algorithms. In machine learning algorithms, training and test datasets are used for the analysis.  Big data analytics can be useful to prepare those test and training datasets for machine learning. Also, access to big data allows artificial intelligence to continue to learn more additional data sources. Machine learning algorithms can reproduce behaviors based on big data that is feeding processors that it puts through a trial and error type of algorithms. 

Essentially big data is what can teach artificial intelligence, and the rise of artificial intelligence is complementary to the exponential growth of big data. Understanding the basics of how big data and artificial intelligence intersect is important as they are both here to stay and have the potential to boost, not only revenue but innovative and creative capabilities for businesses.

#AI #BigData

Will AI Replace Humans?

Should artificial intelligence be used as a tool to support or replace decision makers? After all, decision making relates to reasoning. Fifty-two million American jobs will be displaced or changed due to automation by 2030. While the changing nature of work causes some anxiety, the machines are just acting human, not actually human. And, while technology eliminates some jobs, it does not eliminate work.

Artificial intelligence can be used as a tool to support decision makers. But, technology empowered by artificial intelligence definitely does not eliminate the need for governance and ethics as it relates to the social good. Humans have the unique ability to create a vision and plan to achieve it. The strength of artificial intelligence lies in data processing, not in complex judgment and decision making. However, artificial intelligence is complementary to complex decision making.  Organizations should be asking themselves how computers can support humans in solving complex problems? For example, AI for Good is an initiative that focuses on the United Nations platform sharing the beneficial use of artificial intelligence projects to solve some of society’s biggest challenges. 

No alt text provided for this image

While the goals of some other artificial intelligence initiatives are to generate a software program that can solve complex problems and moderate itself with thoughts and emotions similar to human intelligence, it is important to understand the limitations of this scientific pursuit. There are many philosophical challenges in executing this intent from how to freedom is defined to how values are determined and understanding is measured. The challenge in pursuing these type of initiatives is in the programming. Artificially intelligent systems create their own rules upon existing rules and cannot deviate from them or make random decisions which in turn makes it difficult for the system to gain understanding similar to the human experience. Numerous studies have shown that free will influences mental processing and intelligence. In terms of artificial intelligence breakthroughs, although there have been some wins like in 2012 scenario when one of Google’s supercomputers scanned 10 million YouTube videos and learned to identify a cat with 75% accuracy, a four-year-old does that experience flawlessly, and it is not exactly tackling the issues of culture, vision or values as it relates to complex decision making. In summary, we are far cry from what you might believe from watching an episode of West World.

No alt text provided for this image

Humans have a history of adapting and thriving when new kinds of work have emerged in society so even defining the human experience is a changing goal post for any programmer to master. A simple walk through an art museum reflects how complicated the human perception of reality over time is to mimic let alone predict.

One of the challenges that artificial intelligence faces is that is it developing at a rate faster than some social systems which is why there is increasing interest in ethics and public policy as it relates to artificial intelligence. Also, some of the data input being used to drive machine learning programs is not reflective of the communities that the programs ultimately seek to serve. However, much of this issue is simply reflective of an age-old data quality issue in programming that poor data sources result in weak data outputs that in turn can lead to poor decision making. Bad data has been estimated to cost the United States $3 trillion per year.

No alt text provided for this image

One potential takeaway from this rapidly evolving digital economy is that a purpose driven life is uniquely human. And, the purpose each of us finds in terms of living a meaningful existence, comes from a complex understanding of where we’ve been and where we are going, along with some seemingly random but transformative events along the way. Regardless, those that embrace artificial intelligence to tackle problems with purpose are likely to create more impact than those that reject these innovative technologies.

#Purpose #HumanExperience #AIforGood

The Future of Artificial Intelligence

One of the future challenges of big data analytics for artificial intelligence includes the role that it plays relative to human judgment.  For example, it was found that human parole boards do much worse than machines powered by artificial intelligence in calculating which dangerous prisoners should be entered back into society.  Similarly, skilled pathologists were not able to beat machine learning algorithms as it related to the diagnosis of breast cancer. Banks are currently delivering advice to wealthy clients using artificial intelligence from a Singapore-based technology company.  In a William Grove study covering 136 different research studies, expert judgment was found to be better than machine learning equivalents in only eight studies.

Business can access more data than ever before but research has found that organizations struggle still to see the bigger picture in terms of organizational priorities. Proper framing and focus relative to the problem that is attempted to be solved has found to be critical earlier on in the process as large data sets continue to grow. Organizations must take a step back to think about what is needed before digging deeper and getting lost in the weeds.

A vital concern as big data continues to grow is that there is still a gap in the literature as it relates to the role of leadership in big data governance effectiveness.  Also, with more pressure likely to increase from the public as the nature of work continues to change, those serving in governing roles are under more scrutiny than ever before.  Despite numerous peer review research findings that have indicated how essential high-level support is to effective big data governance, many governing bodies still do have the necessary knowledge to govern effectively.  However, it does not take a technical mind to understand that the output of machine learning programs is only as good as the inputs.  Inclusion is critical from the beginning especially in representing marginalized voices.  While the future of artificial intelligence certainly will lead to business efficiencies and improved customer service experience, there still is a role for a human touch reflective of our shared vision and values.

#AI #BigData #MachineLearning

Writing Your Own Machine Learning Programs

If you’re trying your first machine learning algorithm, there are some formulas that will be useful to you (that might be overwhelming to learn at first). All machine learning algorithms are governed by a set of conditions, and your job is it make sure your algorithm fits the assumptions to ensure superior performance. There are different algorithms for different conditions. For example, don’t even try to use linear regression on a categorical dependent variable or you will be disappointed with low values of R² and F statistics. Instead, use algorithms like Logistic Regression, Decision Trees, SVM, and Random Forest. Here is a good reading to get a better sense of these algorithms: Essentials of Machine Learning Algorithms.

For beginners, I also highly recommend this website that talks through some of the programs that can be done in R:

#AI #MachineLearning #FutureofWork

Developing a Data Warehouse to Support Business Intelligence Analytics

The first step in developing a data warehouse to support a business intelligence program is to develop a plan.  In that plan, it should include some forward-thinking regarding what questions various users from the Board of Directors to frontline staff may ask to ensure that in the design that the functionality will be possible to meet the business objectives of the organization, as well to properly manage expectations at the beginning.

The following provides some general guidance as it relates to building a data warehouse and the foundational aspects of business intelligence programs:

  1. Determine business objectives
  2. Collect and analyze information
  3. Identify core business processes
  4. Construct a conceptual data model
  5. Locate data sources and plan data transformations
  6. Set tracking duration
  7. Implement the plan

Having a thoughtful plan can help ensure that the project is appropriately resourced, saves anticipated costs, as well as ensure that the value that is perceived to be added by the business intelligence program is realized.

#DataWarehouse #BusinessIntelligence

Parametric and Non-Parametric Testing

Generally speaking, parametric tests work well with skewed and non-normal distributions where the spread of each group is different and non-parametric tests work well when the area is better represented by the medium, and there is a very small sample size.  The decision on which test to use has to do on evaluating the mean or medium and how accurately it represents the center of the dataset’s distribution.  Assuming an accurate sample size, if the mean does seem to represent the center of the dataset’s distribution, the parametric test may be a good choice.  However, if the medium more accurately represents the center of the distribution, then a nonparametric test may want to be considered.

Some common parametric and non-parametric tests that can be useful in statistical analysis include:

  • If comparing means between two groups, the parametric procedure could be the two-samples t-Test and the nonparametric procedure could include a Wilcoxon Rank-Sum test.
  • If comparing two quantitative measurements from the same individual, the parametric procedure could be the Paired t-Test and the nonparametric procedure could be the Wilcoxon Signed-Rank test.
  • If comparing means between three or more independent group, the parametric procedure could be the Analysis of Variance (or ANOVA) and the nonparametric procedure could be the Kruskal-Wallis test.
  • If estimating the degree of association between two quantitative variables, the parametric procedures could include the Pearson coefficient of correlation and the non-parametric procedure could include the Spearman’s Rank Correlation.

Regarding which test to use in the list, it is best to first consider what you are trying to compare or estimate.  In terms of scientific research, a useful process includes thinking through the potential the null and alternative hypotheses.  The null hypothesis can include ideas like if the two approaches are equally effective.  If it is found that the two approaches are the same, then the null hypothesis is good. Otherwise, an alternative hypothesis such as the two approaches not being equally effective may hold. However, each dataset may require a different type of comparison as well as parametric or non-parametric test.  However, as illustrated above, the parametric and non-parametric tests kind of run parallel depending on the statistical analysis selected by the researcher that best fits the nature of the research question and data collected.

#ParametricTesting #NonParametricTesting #AI #MachineLearning

 

Business Intelligence Belongs in the Boardroom

The lack of effective big data governance in the digital age is a growing problem. From 2013 to 2014, big data breaches increased by 64%.  In 2016 alone, 1.1 billion identities were stolen.  Experts predict by 2020; big data related damages will cost the world $6 trillion. From the machine learning to the internet of things, the boardroom is no longer exempt from responsibility as it relates to the opportunities and risks posed by increasing volumes and usage of big data.  Public interest in ethical leadership in business continues to grow due to the continuing vacuum of effective management in many of our leading organizations.

In 2017, 42 states across America introduced over 240 cybersecurity-related regulations.  With growing concerns on accountability in the digital age, the Center of Strategic and International Studies has been advocating for a National Cybersecurity Safety Board. Furthermore, US federal regulations have recently proposed a plethora of information governance accountability requirements that could present more severe civil and criminal penalties in the future. With increased pressure from the public, those serving in governing roles are now under more heightened scrutiny than ever before.  Despite numerous peer-reviewed research findings that have demonstrated how essential high-level support is to effective information governance, many governing bodies still do not to have the necessary knowledge to govern effectively.

While organizations have collected and analyzed data for many decades, big data leverages new and powerful algorithms that have changed the entire environment on how opportunities and risks can be assessed and acted upon quickly. Big data is generally more complicated, faster and more varied than traditional datasets. With the explosion of big data sources from telematics, text, time and location, radio frequency identification, smart grid, telemetry, and social networking, there is unpredicted opportunity leverage both structured and unstructured data to gain insights into customer behaviors, desires, and opinions. With these opportunities comes new risk which is reflected by scandals by Equifax, Yahoo, and Facebook’s Cambridge Analytics. When dealing with a company’s big data assets, Board of Directors must appropriately balance core values and strategic goals with risk management processes.

The strategic obstacles and challenges that organizations face in the era of big data frankly may be overwhelming to the Board of Directors. According to a survey by Amrop (2016) that mapped the digital competencies of 300 boards and profiles of 3,342 board members, only “5% of board members in non-tech companies have digital competencies” and only “Forty-three percent of board members in technology companies have digital competencies.” Furthermore, the Amrop (2016) study found that only “3% of Boards have a Technology Committee.” PwC’s (2018) Annual Corporate Director Survey shared the views of 714 directors with 76% of respondents representing companies of annual revenues of more than $1 billion and found that there is an information governance disconnect in boardrooms.

Board of Directors can treat information as a valuable, provide effective Business Intelligence (BI) governance to align BI strategy with business objectives, and implement a comprehensive BI governance framework. Given that the boardroom represents the highest level of decision-making in a company, better understanding the BI capabilities available to provide transparency for boardroom strategic decision-making and risk management is essential.

What Business Intelligence tools have you found most valuable in the boardroom?  Please leave your thoughts in the comments or message me privately.

#BI #Governance #BOD