Machine Learning
Discussed at meeting on 06 September 2018; notes taken by Libby Megna
Machine Learning - What is it, and what can it offer biologists?
Learning Objectives:
- Define machine learning
- Understand the distinction between supervised and unsupervised machine learning
- Identify the limits of prediction and classification schemes
Definition: Machine learning enables computers to do tasks without being explicitly programmed for those tasks.
Neural networks are a type of machine learning, but it does not encompass everything
Machine learning includes:
- linear regression
- logistic regression
- decision tree
- support vector machines
- naive Bayes
- k nearest neighbor
- k-means
- Random Forest
- Dimensionality reduction (e.g. PCA)
Machine learning algorithms are ubiquitous for commerical applicatioins–e.g. Netflix, Facebook, email spam filters
Machine learning has lots of potential in biology: medical imaging (classifying MRI images or classifying cells), wildlife management (classifying camera trap images; e.g. paper from UW Comp Sci dept (Clune? sp?)
Cool guide: A visual introduction to machine learning (super cool visualizations)
Model often doesn't work on test data as well as it did on training data
Supervised vs. unsupervised machine learning
- Supervised ML requires labeled training data
- Unsupervised ML look for patterns within the data without labeling categories
Classifying candy ("organisms") activity - discussion points:
- Which procedure was supervised and which unsupervised? (Procedure A was supervised, Procedure B was unsupervised)
- How does sample size or variation in the training set affect the outcome? If your training dataset is different from your test dataset, accuracy may decline)
Connections to biological problems
- Would phylogeny be a supervised or unsupervised algorithm? Could make arguments for both, but Liz argues that it is unsupervised because you don't know the actual evolutionary relationships before you start.
- What about using DNA barcoding to identify thousands of individuals to species? DNA barcoding for animal species relies on using a few hundred base pairs of mitochondrial DNA to identify a species.
Applying machine learning to biology
- Defining your question well is essential
- There is a potential trade-off between prediction and mechanistic understanding–could get good predictions but have no idea of underlying mechanisms
- What are your goals, and what are your data like?
No shortage of easy-to-use ML tools in R
- nnet
- randomForest
- CART
- kmeans
- lda (MASS)
- lm (base R)
- prcomp (base R)
- See also https://cran.r-project.org/web/views/MachineLearning.html
~~~
Random notes by Libby
I think this is a good read on trade-off between prediction power vs. simplicty/mechanistic understanding: Breiman 2001
Deep learning is just a neural net with more layers
From Liz: Here is the presentation PDF and the candy-based exercise. Note that these materials were prepared for a teaching demo I had to do for a job interview.