Machine Learning
Discussed at meeting on 06 September 2018; notes taken by @Libby Megna
Machine Learning - What is it, and what can it offer biologists?
by @Liz Mandeville
Learning Objectives:
Define machine learning
Understand the distinction between supervised and unsupervised machine learning
Identify the limits of prediction and classification schemes
Definition: Machine learning enables computers to do tasks without being explicitly programmed for those tasks.
Neural networks are a type of machine learning, but it does not encompass everything
Machine learning includes:
linear regression
logistic regression
decision tree
support vector machines
naive Bayes
k nearest neighbor
k-means
Random Forest
Dimensionality reduction (e.g. PCA)
Machine learning algorithms are ubiquitous for commerical applicatioins–e.g. Netflix, Facebook, email spam filters
Machine learning has lots of potential in biology: medical imaging (classifying MRI images or classifying cells), wildlife management (classifying camera trap images; e.g. paper from UW Comp Sci dept (Clune? sp?)
Cool guide: A visual introduction to machine learning (super cool visualizations)
Model often doesn't work on test data as well as it did on training data
Supervised vs. unsupervised machine learning
Supervised ML requires labeled training data
Unsupervised ML look for patterns within the data without labeling categories
Classifying candy ("organisms") activity - discussion points:
Which procedure was supervised and which unsupervised? (Procedure A was supervised, Procedure B was unsupervised)
How does sample size or variation in the training set affect the outcome? If your training dataset is different from your test dataset, accuracy may decline)
Connections to biological problems
Would phylogeny be a supervised or unsupervised algorithm? Could make arguments for both, but Liz argues that it is unsupervised because you don't know the actual evolutionary relationships before you start.
What about using DNA barcoding to identify thousands of individuals to species? DNA barcoding for animal species relies on using a few hundred base pairs of mitochondrial DNA to identify a species.
Applying machine learning to biology
Defining your question well is essential
There is a potential trade-off between prediction and mechanistic understanding–could get good predictions but have no idea of underlying mechanisms
What are your goals, and what are your data like?
No shortage of easy-to-use ML tools in R
nnet
randomForest
CART
kmeans
lda (MASS)
lm (base R)
prcomp (base R)
See also https://cran.r-project.org/web/views/MachineLearning.html
~~~
Random notes by Libby
I think this is a good read on trade-off between prediction power vs. simplicty/mechanistic understanding: Breiman 2001
Deep learning is just a neural net with more layers
From Liz: Here is the presentation PDF and the candy-based exercise. Note that these materials were prepared for a teaching demo I had to do for a job interview.