In this working group we learn about methods and ideas for confronting models with data, and about data science generally. Anyone is encouraged and welcome to attend.
In fall 2021, we will meet online again, using Zoom, Wednesdays 11-noon (Mountain Time).
This was based on a poll http://whenisgood.net/gc33wpw. As of , 57 people had responded to the poll. That’s a great response. Unfortunately it means that our best time works for 35 of us and 22 have a conflict.
We can also consider a second weekly meeting slot for the many interested people who won’t be able to make this time. People are welcome to use these webpages on Confluence to organize additional interest groups.
Meeting schedule and topics:
Fall 2021
– Organizational meeting
develop list of topics for the semester
pair and share – break out into smaller groups and share what you have been up with respect to data science, what you’ve been wanting to learn, and learn about other people in the group.
– Lars Kotthoff will do a brief intro to mlr3pipelines
– Dylan Perkins (ARCC/End User Support): Intro to shared computing at UW – Teton compute and storage resources (video is below, slides are in https://docs.google.com/presentation/d/145AVEOLHi22CPn0IwpLkVCNWJ7ZzGJq1fZDWt4IBE8I/edit?usp=sharing).
As a follow-up, please contact arcc-info@uwyo.edu with questions.
Please drop them a line if you are interested in participating in some testing of the new browser-based, graphical interface they are developing to the teton compute resources. They would appreciate several people testing the system.
– hands on teton
Tasks you would like help with or demonstrated, from entry-level to advanced
we did one big group screen share to show:
how to configure ssh so that we can provide Teton password and validation (2FA) once per session
how to launch a SLURM interactive session to do some text wrangling with UNIX command line tools
demonstrate two ways of using text editors to write bash scripts that can be executed on teton; this include making a SLURM compliant script that we submitted withs batch
our script demo’d the use of /dev/shm, /lscratch, and /gscratch for simulation output and how to move data home and clean up after yourself at the end
We did not get to demo how to install R packages yourself. Instead, I started a Knowledge Base entry on this, which you are welcome to add to, edit, and improve.
– Short demonstration of LaTeX as implemented in Overleaf, followed by hands-on session for participants to sign-up for a free account, make documents with one or more templates, and ask questions of more experienced users. Alex Buerkle will do initial demo and will ask for helpers to assist others in hands-on session.
Overleaf features:
the direct submission button (arxiv, biorxiv, society journals, etc.)
built-in version control
built-in multi-author, shared editing
Latex features
references to figures, tables, supplementary sections
citation management and bibliography
sectioning
Follow-up:
– an introduction to Bayesian modeling. Eryn McFarlane Topher Weiss-Lehman will discuss the basis of Bayesian thinking and talk about why one might want to use Bayesian methods.
– an introduction to computational Bayesian modeling (Andrew Siefert , Joshua Harrison. Why do computers help when doing Bayesian statistics? What does sampling and convergence mean? A high level overview of the different tools one can use to do Bayesian statistics. Finish with illustration of a model specified with R and Stan so that folks can get an idea of the modeling process.
Animation of samplers for Bayesian modeling: https://chi-feng.github.io/mcmc-demo/app.html
HERE is a git repo that has the code for the little mini-talk that Josh gave. We can keep posting Bayesian stuff here if we want. Feel free to do pull requests. If you have not used git, you can go to that link and view the different files and download them as you like.
– Breakout groups for hands-on and Q&A regarding Bayesian models for parameter estimation and inference. Request a group below. We’ll add one or two more at the beginning of the meeting.
…
– Reproducible research with R, Git, LaTeX, etc. Jessi Rick & others welcome to join in
let Jessi know if you have workflows/ideas that you’d like to add to the discussion
– …
In the queue to place on the schedule:
Reproducible research with workflow managers like Nextflow, and visit from bioinformatician at Wyoming Department of Health
Wish list for topics for fall 2021
See page that lists previous and potential people to invite to talk about careers in non-academic data science (requires login).
(on the schedule)
Intro to shared computing at UW – Teton compute and storage resources, with demos by request (how to …?)
Machine learning
More with mlr3 - I’ve (J. Harrison) been using the software for simple ML tasks on manageable data, but am curious how to scale up to larger data and even if mlr3 is the right tool for that task.
Scaling up machine learning. Would like to do a hands-on project on some really large data (leaving this vague, so we can find something useful to the broader group).
A more introductory overview of machine learning for beginners. Maybe someone can give some background for part of a group meeting, and several of us can share how we are implementing machine learning (or would like to) in our own research? Joshua Harrison
Text mining intro - a basic primer of questions and tools and maybe a follow-up if anyone is interested in digging deeper. Maybe with a focus on how the methods have been applied to biological questions, rather than social science questions.
(on the schedule)
Machine learning pipelines – I (LK) could do a brief intro to mlr3pipelines (https://mlr3pipelines.mlr-org.com).Bayesian methods that don’t involve MCMC: variational inference, INLA, ABC
A primer of what Bayesian stats are and why one might want to use them.
Bayesian machine learning/neural networks
STAN (super basic, please?)
Overview of various statistical tests (when to use what for what purpose, etc.)
Practicing with loops and apply functions in R
Regular expressions tutorials/practice
Approximate Bayesian computation, see https://www.pnas.org/content/104/6/1760
Parallelization in R
Intermediate bash tips & tricks
Bayesian multilevel modeling (using the brms R package or other methods)
Nonlinear modelling (frequentist and Bayesian approaches)
Compositional data analysis, potentially as applied to microbiome data
Functional programming in R
replicable data cleaning and manipulation
Intro to replicability in data science (i.e. github, etc)
Best practices/how-to for sharing code, data, etc. (e.g., on GitHub)
Database management
Webscraping
Interactive plotting
Spatial analysis
A primer of OverLeaf, LaTex, r markdown and integration of these tools.
Math for machine learning/statistics: linear algebra, probability, calculus
Spatial data
Series of short (5-10 minute) presentations by someone on their research with Q&A afterwards
Effective story-telling and visualization for scientific research with applications in R
How to make a dashboard, with an example for hands-on learning.
General tools for API calls in Python or R.
Machine learning collaborative group (see HERE for more)
Causal inference (see this page if you’re interested in a reading group)
Spring 2021
10 February 2021 – organizational meeting
17 February 2021 – Introduction to mlr3 – Damir Pulatov (UW Computer Science graduate student)
In advance of our meeting, please read this very short introduction to the use of R6 in R (one form of object-oriented programming in R) from the mlr3 book (~5 minutes).
24 February 2021 – profiling and comparison of code running on
data.table
versusdata.frame.
– Alex Buerkle will present a short primer on Bayesian estimation and inference. We will do a bit of hands-on with posterior probability distributions. Beyond conceptual and math foundations, we will talk about computational methods for parameter estimation (MCMC, HMC, variational inference).
Below is R code to accompany beta distribution of allele frequencies in a population. This uses the closed form solution for P(p|x,n) = P(x|p,n)*P(p), where the product of a binomial and a beta is a new beta distribution.
(160+1)/(160+1 + 200-160+1) # expectation with P(p)=beta(1,1) (160+0.1)/(160+0.1 + 200-160+0.1) # expectation with P(p)=beta(0.1,0.1) p<-seq(0, 1, 0.001) plot(p, dbeta(p, shape1=160+1, shape2=40+1), type="l") par(mfrow=c(2,1)) plot(p, dbeta(p, shape1=160+1, shape2=40+1), type="l", xlim=c(0.5, 1), col="red") abline(v=qbeta(c(0.025, 0.975), shape1=161, shape2=41), col="red") abline(v=qbeta(c(0.025, 0.975), shape1=16 + 1, shape2=4+1), col="blue") lines(p, dbeta(p, shape1=16+1, shape2=4+1), col="blue") par(mfrow=c(3,1)) plot(p, dbeta(p, 1, 1), type="l") # beta(1,1) plot(p, dbeta(p, 0.1, 0.1), type="l") # beta(0.1, 0.1) plot(p, dbeta(p, 160+1, 200-160+1), type="l")
– No meeting. Consider reading Chapters 1 & 2 of Statistical Rethinking (2nd edition) if the introduction to Bayesian inference last week was new to you, or if you want a refresher. Chapter 2 is particularly good. If you want code in various languages from the second edition, you can find it here (R, python, Julia, etc.).
– Alex Buerkle will lead a discussion and some hands-on to help learn how to marginalize discrete parameters in hierarchical Bayesian models, in JAGS and STAN. The example application pertains to modeling the distribution of a type of organism across sampling sites. The idea would apply to any type of detection problem, where a thing might exist at a point or not (a Bernoulli variable), but there are a bunch of ecologists who are interested in ‘occupancy models’. For example, this type of model could apply to monitoring for an invasive species, like Zebra mussels, or New Zealand mud snails, at monitoring sites. The data can be discrete, but discrete parameters that are inferred from the data cannot be used directly in a STAN model (because discrete distributions are not differentiable).
In advance, please read over Marginalized occupancy models: a how-to for JAGS and Stan, and try out the code to the extent that it interests you. Please come with questions. We will use this webpage and code as our basis for discussion. Note, JAGS models look like R code, but are not and are instead a model specification.
Additional resources:
If you are totally new to Bayesian analysis, consider reading Chapters 1 & 2 of Statistical Rethinking, as I suggested for last week
For a more explicit and mathematical notation-heavy presentation of many of the same ideas, see A step-by-step guide to marginalizing over discrete parameters for ecologists using Stan
Here is some of Mikey’s Stan code to run a pathogen prevalence model (same model structure as occupancy) that accounts for both false positive and false negative error rates. It’s written a little differently than how we discussed things yesterday, because I took advantage of the
log_mix
function, which I found more efficient when using the two detection parameters.
– Continuation of Bayesian modeling – Discussion of a paper: Ensuring identifiability in hierarchical mixed effects Bayesian models. Ogle and Barber 2020. Ecological Applications. 30(7):e02159. (pdf).
– Nominally spring break at UW. Initially I overlooked that is part of spring break, so I have bumped our scheduled discussion to April 7 and we will not plan to meet this week.
– Discussion of part of Chapter 5 (Model-Agnostic Methods) of the book Interpretable Machine Learning (most recently updated on 22 March 2021!). Please read the lead-in to the chapter. Additionally, please read sections 5.7-5.10 (or 5.7 and 5.9 only if pressed for time and skim the others).
related to this topic, see A translucent box: interpretable machine learning in ecology – Lucas – Ecological Monographs (this paper is not as long as you’d expect based on the journal)
– presentation by John Calder on Bayesian mixture modeling. Until recently John was a postdoc at UW, now he works as a data scientist for MX (a financial services company).
– Lars Kotthoff will present a brief intro on automated machine learning – bring your hyperparameters.
Below is R code that we’ll use as basis for discussion – first doing some hyperparameter tuning “manually”, then with mlr and mlrMBO.
For comparing different learners, see https://mlrmbo.mlr-org.com and https://mlr3book.mlr-org.com/benchmarking.html. For avoiding overfitting of hyper parameters, see nested resampling in the MLR3 book.
Free online automated machine learning course (including Lars) that just came online.
# tune two parameters of the rpart classification tree learner require(rpart) train.indices = sample(1:150, 150 * 2/3) test.indices = setdiff(1:150, train.indices) evalParams = function(...) { model = rpart(Species~., iris[train.indices, ], ...) preds = predict(model, newdata = iris[test.indices, -5], type = "class") return(list(pars = ..., acc = sum(preds == iris[test.indices, "Species"]) / length(preds))) } pars = expand.grid(minbucket = 1:20, minsplit = 1:20) res = lapply(1:nrow(pars), function(i) do.call(evalParams, as.list(pars[i,]))) best = which.max(sapply(res, function(x) x$acc)) res[[best]] # ...and now with mlr/MBO # adapted from https://mlrmbo.mlr-org.com/articles/supplementary/machine_learning_with_mlrmbo.html require(mlr) require(mlrMBO) # tune same parameters par.set = makeParamSet( makeIntegerParam("minbucket", 1, 20), makeIntegerParam("minsplit", 1, 20) ) ctrl = makeMBOControl() tune.ctrl = makeTuneControlMBO(mbo.control = ctrl, budget = 10) res = tuneParams(makeLearner("classif.rpart"), iris.task, cv3, par.set = par.set, control = tune.ctrl) plot(cummin(getOptPathY(res$opt.path)), type = "l", ylab = "mmce", xlab = "iteration")
– Work and help session
people will document, edit, call out unclear sections, or request documentation
how to use teton
how to do specific scientific computation tasks (on teton or elsewhere).
people will get or give assistance or a demonstration of how to do specific things
we will also use break out rooms to assist with specific tasks (please invite your colleagues to help and get help).
– 🌟 Demo day 🌟
we will do some short demos of different technologies
cloud computing with AWS – we can spin up some different disk images using AWS EC2. There are 400 different instances to choose from. Ideally several of us will get free sessions started on different instances and report back. The interest here is that many companies do their compute in the cloud rather than on shared academic systems. We are fortunate that UW supports our very good shared system within ARCC.
⭐ if this looks useful for you to experiment with, please sign up for an AWS account. This is free, but they do use a credit card to verify your identity. There is a fair amount of information to share to get set up, and it takes several minutes after all the info is entered, so perhaps do this in advance.
example of running R on a linux instance or alternatively https://www.louisaslett.com/RStudio_AMI/
AWS Cloud Credit for Research Program – grants program, students are eligible
Regarding the question of easier access to cloud computing tools, you might be interested in the Discovery Environment at Cyverse, in addition to the more standard cloud computing tools that are available for free to academic researchers.
if there is interest:
Docker – demo a docker image of MISO, the LIMS software for the GTL.
wrap_slurm.pl – a program to launch many SLURM jobs on the cluster that need different inputs (a complement to SLURM arrays).
… other requests or suggestions …
⭐ Thanks everyone for participating this spring semester. Let’s call this semester complete and plan to return in the fall semester, very possibly in person in Aven Nelson 210. Feel free to use the data science Confluence space over the summer break to share and organize things. All the best for your research over the summer and for recharging any depleted energy reserves.
Ideas for spring 2021
Many additional ideas can be found on the Previous meetings archive and The "queue" pages.
Short research topic presentations with a key methodological or technical illustration from group members. Please indicate your interest in giving a presentation and leading a discussion below.
Reading and study groups
Who would be interested in discussing the 2nd edition of Statistical Rethinking?: Alex Buerkle Joanna Blaszczak Andrew Siefert Erin BentleyShannon E Albeke Bridger Huhn …
Other topics?
Host non-academic data scientists to visit our group and discuss careers, technical things, and other topics of mutual interest
Possible statistical modeling and machine learning topics
Genetic programming and symbolic regression in R: A friendly introduction to RGP, which discusses genetic programming using the RGP package (which is archived on CRAN and is no longer being developed, but is probably useful for learning)
Hands-on and conceptual learning about sparse models (including following paper by Runge et al.)
Read and discuss: Detecting and quantifying causal associations in large nonlinear time series datasets. Runge et al., Sci. Adv. 2019; 5 : eaau4996 27 November 2019
Continue reading a few sections from Interpretable Machine Learning
read some sections of Machine Learning: a Probabilistic Perspective by Kevin Patrick Murphy? It is available for free as an ebook from UW library. Alternatively, he is completing two additional books, including Probabilistic Machine Learning: An Introduction.
Non-centered parameterizations
Simple, univariate models
Hierarchical models (including understanding Cholesky factorization)
Non-multivariate normal alternatives to non-centered hierarchical models
Wondering about talking with ARCC about methods for reducing IO bottlenecks. I run into this most often with large Raster datasets with multithreading
AWS - primer and practice usage. A lot of jobs want people to have experience with AWS. Maybe we could set up a UWYO account that lets people practice and see what is about.
Docker - At some point Jason did a show and tell on Docker (search for Docker on this website to find his pages), but it might be worth revisiting how to use containers, from a practical standpoint. What I mean by that is, how to load up and use containers that others will have built. Most of us, when starting a new job, probably would be using existing docker containers instead of making our own. I assume using these is pretty straightforward, but have never actually done it. Would be useful to do so.