Previous meetings archive
The schedule for the current time period can be found here.
Fall 2020
The semester has started. Welcome everyone.
What shall we do for the working group during the pandemic? Are you up for another video meeting? I (Alex) am up for it and would be happy to participate or lead, but we’d need to poll for a different time. Drop your thoughts here, or in the Comments below.
Spring 2020
5 February 2020 – Check-in meeting with updates from participants, and wish list for semester topics (see list below).
12 February 2020 – @Alex Buerkle will lead some regular expression gymnastics. For these data wrangling exercises, we will work with regular expressions and | (pipe) and a few UNIX friends (we will save cut, sed, awk, uniq, and sort for another day). Please bring a laptop.
19 February 2020 – @Dylan Perkins (End User Support Manager, ARCC) will give a presentation on "containers and software environments" and lead some hands-on with Conda Environments.
26 February 2020 – @Alex Buerkle will lead some additional regular expression gymnastics. Please enter requests and suggestions on the linked page.
4 March 2020 – Lars Kothoff will lead a discussion of compiled versus interpreted languages, how interpreted languages regularly rely on compiled code, what steps compilers use to optimize code execution, and what really happens when you compile a STAN model and run it from python or R.
For those with less of a computer science background that would like a bit more information about the abstractions required to code from binary (the 0s and 1s we discussed) to interpreted languages, I (Jason) found Crash Course's Computer Science playlist to be super interesting. It's 41 videos, but they do kind of an amazing job of linking our day-to-day interactions with computers to the fundamental elements that make computers actually work.
11 March 2020 – Given that this is an optional meeting, in response to the anticipated spread of COVID19, we are going to suspend meetings of the working group for now. We will see how things look after spring break and evaluate the rest of the semester's schedule. Meanwhile carry on with your other work and stay well.
18 March 2020 – Spring break
future – Hierarchical statistical modeling of mixtures 1 (@Alex Buerkle will lead) – conceptual, graphical, and mathematical introduction to one or a few examples of mixture models
Mixture models are used when we want to assign cases (individuals, samples) to source populations (stone artifacts assigned to known or unobserved source quarries, migrant animals assigned to a birth place), or when we want to assign fractions of an observation to sources (fraction of a diet to different diet items: plants, animals, etc.; ancestry of an individual to different human populations).
future – Hierarchical statistical modeling of mixtures 2 (@Alex Buerkle or someone from his lab will lead) – coding and using a model in JAGS
future – Hierarchical statistical modeling of mixtures 3 (@Alex Buerkle or someone from his lab will lead) – marginalizing the discrete parameters, in JAGS, then Stan
Future topics for spring 2020:
Statistical modeling topics
Genetic programming and symbolic regression in R: A friendly introduction to RGP, which discusses genetic programming using the RGP package (which is archived on CRAN and is no longer being developed, but is probably useful for learning)
Marginalizing discrete parameters in hierarchical Bayesian models (think mixture modeling, for stable isotopes, or population genetics), in JAGS and STAN. See Rao-Blackwellization and discrete parameters in Stan
Hands-on and conceptual learning about sparse models (including following paper by Runge et al.)
Read and discuss: Detecting and quantifying causal associations in large nonlinear time series datasets. Runge et al., Sci. Adv. 2019; 5 : eaau4996 27 November 2019
Continue reading Interpretable Machine Learning
Low-level computing information
What does compilation mean? Interpreted versus compiled code and languages.
Optimization
Simple introduction to Stan focusing on regularization and mixing models.
More visualization with d3 and JavaScript
Teton usage
How to use?
Machine learning
Deep learning – Something that combines ML with inferential modeling
Limitations of deep learning
Is anyone besides @Alex Buerkle interested reading and discussing: Machine Learning: a Probabilistic Perspective by Kevin Patrick Murphy? It is available for free as an ebook from UW library. Here are the leading pages of the book, with table of contents. Has anyone already read it?
Optimization
What is optimization?
How do ecologists vs other disciplines view optimization?
What is an objective function and how does it relate to optimization?
How can we use different objective functions to penalize a problem for different outcomes of interest?
How does least squares or the maximum likelihood relate to the concept of an objective function?
Some optimization algorithms of interest:
Simulated annealing
Integer Linear Programming
Mixed Integer Nonlinear Programming
A list of different optimization problems: https://neos-guide.org/content/optimization-tree-alphabetical
A list of algorithms by optimization type: https://neos-guide.org/content/algorithms-by-type
Fall 2019
A few of our meetings will likely build off of a workshop at UW, an Introduction to data science and machine learning (September 24-26th). Lars Kotthoff will be organizing the workshop, with funding from the EPSCoR project. The principal instructor will be Bernd Bischl, professor of Statistics at LMU Munich and lead author of the mlr package, which is an umbrella package for implementations of machine learning methods in R
We will organize one or more meetings on the topic of fundamentals of business for data scientists (types of business ownership, venture capital, rounds and types of stock offerings, mergers and acquisitions, etc.)
We are interested in having a series of meetings on technologies for visualization. One suggestion was to structure this with an initial, in-person demo/tutorial, a set of tasks to do as homework, followed by a debrief on the homework, with extensions. We could look at D3, Vega, Vega Lite, Voyager 2, etc, many of which have some relationship to the Interactive Data Lab at the University of Washington.
If you have suggestions for people to invite, please add them to our page for suggesting visitors.
4 September – let’s watch a 33 minute video about visualization technologies together and then discuss it and what next steps we might want to take in learning about powerful visualization tools. This is a keynote talk by Jeffrey Heer (https://homes.cs.washington.edu/~jheer/) of the University of Washington.
As a follow-up, here is a more recent paper by the IDL group: Critical Reflections on Visualization Authoring Systems
11 September – Hands-on experimentation with Vega, Vega-Lite, Voyager, and Lyra. If you have time before the meeting, browse one of these in advance and we'll use our meeting to experiment with each and share experiences and impressions with others.
18 September – Hands-on experimentation for specification of interactive graphics with shiny, htmlwidgets, D3 (examples), and observable. Please bring any show-and-tell examples you have or find, and link them below.
Interactive shiny app that displays Historial temperature at U.S. weather stations
networkD3 R package
An ebook on javascript in the UW library: An Introduction to HTML and JavaScript : for Scientists and Engineers
25 September – No meeting. Many people are attending the Introduction to data science and ML workshop
2 October – Debrief about previous week's workshop.
9 October – Browse three books in advance and come ready to discuss what parts we might want to read and discuss as a group in future meetings
Advanced Statistical Computing, by Roger Peng
Efficient R programming, by Colin Gillespie and Robin Lovelace
Advanced R, by Hadley Wickham
16 October – Read 1.6 (Benchmarking and profiling) and Chapter 3 (Efficient programming) from Efficient R Programming, and work examples. We'll highlight interesting bits that we found in our reading.
This is also an interesting blog on efficient accumulation of data from a function in R: http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/
This video (Python) is also really interesting in terms of a use case for how recursion and memoization can be used to drastically improve the efficiency of certain functions (e.g., Fibonacci sequence):
For more on memoization in R: GitHub - r-lib/memoise: Easy memoisation for R
A bit more about a kind of recursion, tail recursion, in R (not yet sure how this is different from regular recursion, but it looks interesting): https://tailrecursion.com/wondr/posts/tail-recursion-in-r.html
23 October - Demo + hands-on session. Remember to bring your laptops with code you want to vectorize or parallelize!
Look at the future package in R and its many flavors
@Simon will talk about how to effectively use the package on Teton
Helpful tip to check if your code is running in parallel, the system + user time >> elapsed time when using microbenchmark() or system.time()
Specify plan(multiprocess) and options(mc.cores=n) before using a function that parallelizes your code. For example, future_apply() or furrr::map_dfr()
30 October - We will read different sections from our favorite books (Advanced Statistical Computing, Efficient R programming, Advanced R) and bring lessons to share with everyone else. Please share below what you will be reading. Also please bring by instances of your code that you optimized to run faster, to share with the group.
@Vivaswat Shastry: Chapter 7 (Efficient optimization)
6 November - We will discuss optimization techniques from Chapter 3 of Advanced Statistical Computing. Please bring questions from your reading of the chapter. We can look at the code for these algorithmic implementations during the session.
An old-fashioned PDF (http://www.cs.cmu.edu/~15859n/RelatedWork/painless-conjugate-gradient.pdf) explaining the intuition behind steepest descent and conjugate gradient. A bit mathy but the plots are very useful in learning the mechanism behind the math.
3Blue1Brown also provides a nice visual intuition of how gradient/steepest descent works in the context of machine learning (but can be generalized to just about any other problem):
13 November – EM algorithm
20 November – Let's read parts of the first two chapters of Interpretable Machine Learning (free and online) and discuss them as a group. I think you can readily skip section 1.2 of Chapter, but otherwise, let's read all of Chapter 1 and Chapter 2.
27 November – no meeting, enjoy Thanksgiving Break
4 December – Continue reading Interpretable Machine Learning, in this case Chapters 3 & 4, where it looks like we will get into applications.
11 December – Continue reading Interpretable Machine Learning, in this case Chapter 5 (5.1-5.3).
Spring 2019
13 February – Organization and plan for the semester. Further discussion and work on improving our beginner's guide to use the Teton system. We have only scratched the surface from our whiteboard from our 13 September meeting (see below).
20 February – HPC: An introduction to using SLURM on Teton (UW's high performance computing system) w/ @Alex Buerkle.
27 February – Reproducible research: Using R with Rmarkdown/git/LaTeX/Overleaf w/ @Jessi Rick + discussion of reproducibility. Please feel free to share your own reproducible workflows.
We'll be walking through this tutorial that I've created for interfacing between R and Overleaf via Git
Update: I have uploaded some of my Overleaf templates and the tutorial to my GitHub page (GitHub - jessicarick/resources: Tutorials, templates, etc. for other students ) – please make use of these if they'll make your life easier!
6 March – Reproducible research: Using
make
for reproducibility w/ @Joshua Harrison + more discussion.Resources: A Quick Guide to Organizing Computational Biology Projects
minimal make A minimal tutorial on make (by Karl Broman)
Why use make – by Mike Bostock
13 March – Hands-on session to put reproducible research topics into action, or to get/give one-on-one help with other computing tasks (how to login to and navigate teton, how to submit SLURM jobs, how to set up an initial LaTeX document).
20 March – Spring break
27 March – Discuss A Quick Guide to Organizing Computational Biology Projects and different approaches to organizing the work of individuals, research groups, and larger collaboratives.
3 April – another Hands-on session to put data science into action, to get/give one-on-one help with other computing tasks (how to login to and navigate teton, how to submit SLURM jobs, how to set up an initial LaTeX document).
10 April – in person Q&A with a non-academic data scientist: Fawn Hornsby, Data Infrastructure Lead and Biometrician at WEST, Inc in Laramie (MS in statistics from UW).
17 April – Q&A with a non-academic data scientist: Johan Grahnen, Senior Data Scientist at Microsoft Cloud and AI (PhD in Molecular and Cellular Life Sciences at UW, 2012).
24 April – in person Q&A with a newly non-academic data scientist: Marie-Agnes Tellier, Senior Environmental Statistician at Trihydro Corporation in Laramie (PhD in Statistics at UW, 2018).
1 May – @Joshua Harrison and @Vivaswat Shastry will lead a tutorial and Q&A about the use of git for version control.
8 May – Semester wrap-up, with hands-on help and discussion of what we learned from our Q&As with non-academic data scientists.
Fall 2018
6 Sept – Presentation by @Liz Mandeville and discussion: Machine learning
13 Sept – working meeting in which we'll brainstorm any and all questions about research computing resources at UW, and work to document them and answer some of them in the Knowledge base.
20 Sept – Q&A with a non-academic data scientist: Joseph Murray, PhD in electrical engineering (mid 2000s, applied machine learning) who now works as a principal scientist for FICO (credit scores and services to financial clients. Joe works on methods to detect financial fraud and money laundering).
27 Sept – working meeting related to how to's and questions about research computing resources at UW. Some groups want to step through some demo's of different tasks, and others can translate some of our work from the 13 Sept from the whiteboard to the Knowledge Base.
4 Oct – Q&A with a non-academic data scientist: Alison Appling of the USGS (Tuscon, Arizona); PhD in ecology
11 Oct – another working meeting related to how to's and questions about research computing resources at UW
18 Oct – either another working meeting about research computing, or a conversation with a non-academic data scientist (in this case with a bit more emphasis about their work and future directions in data science; waiting to get confirmation from outside scientist)
25 Oct – short round-robin of topics of interest (20 min), followed by a working demo of how to submit 20 jobs to SLURM on teton using a script.
1 Nov – Q&A with a non-academic data scientist: Justin Abold-Labreche (U.S. Internal Revenue Service; Ph.D. Law / Criminology, Oxford University). Published interview.
8 Nov – Working session: bring a data science problem to share, come to help solve problems, or come listen in and encourage.
15 Nov – Discuss a cool paper: Reverse-engineering ecological theory from data (pdf)
29 Nov – Q&A with a non-academic data scientist: Jimena's friend’s (Sergio Ballester) software developing company that deploys data-gathering drones to improve industrial and agriculture practices (e.g.sugar/coffee/pineapple plantations) in Costa Rica (https://www.indigoia.com), among others.
6 Dec – Watch and discuss a cool talk on Youtube: Mike Bostock – Design is a search problem. We talked about a few different things afterwards, but here are some links to some tech that came up:
minimal make A minimal tutorial on make (by Karl Broman)
Why use make – by Mike Bostock ... We should probably experiment with make a bit. Oh, and I remembered correctly, make is old for software (born in 1977).
Giant set of D3 examples and Bostock's most recent endeavor, interactive javascript notebooks at observable
Another make example, this time to facilitate linking r scripts and latex documents. This example is old, but probably still worthwhile.
Presentations
Data sets of interest
See Google's search tool for data sets (new in September 2018)
Q&A with non-academic data scientists:
Wish list for future topics:
A couple of ideas:
htmlwidgets
: http://www.htmlwidgets.org/.shiny
(http://shiny.rstudio.com/gallery/) + hosting on a website.shiny
in a presentation.Rcpp
.dplyr.
Previously:
On April 4th, let's discuss the following paper: Deep Learning for Population Genetic Inference (Sheehan S, Song YS (2016) PLoS Comput Biol 12(3): e1004845. https://doi.org/10.1371/journal.pcbi.1004845). It provides details of methods, an application of deep learning in biology, and a comparison to Approximate Bayesian Computation. A complementary paper is Supervised Machine Learning for Population Genetics: A New Paradigm, which looks interesting and is a contemporary review, but has fewer details.
No meeting on June 6th, we'll resume on June 13th with coding a linear model and MH from scratch.
On May 30th, we'll discuss this short intro to Bayesian analysis, with its example of how to code Metropolis-Hastings updates in R. Beyond this, in future meetings this summer, we'll build on this first step and code our own MCMC for parameters of a linear model in R, add a reversible jump feature for model selection in the MCMC, and learn how to use Rcpp(and probably RcppGSL or just GSL directly) and code an equivalent MCMC model in C.
On June 13th, we took an initial and somewhat disorganized look at a linear model using MCMC and by-hand Metropolis-Hastings. In this case, this is equivalent to a two-way ANOVA, where we're considering the effect of genotype and planting treatment on phenotype. After our meeting, Alex cleaned up the code some more (including a more interesting simulation; so now there are two) and here's a clearer version. We will likely return to this next week and extend it in some way (e.g., simulating and modeling a continuous covariate). mcmclm.R. Once we're comfortable with this, we'll do a sparse linear model with reversible jump MCMC.
On June 20th, we kept working on mcmclm.R and found that we were doing a terrible job of estimating sigma with the simulated data. The updated version of mcmclm.R fixes this by changing to a uniform prior for sigma.
For June 27th, we'll look at mcmclm.R again, modify it for continuous covariates and work toward reversible jump MCMC, to find evidence for and estimate sparse models. (Alex is going to be away)
Alex will be away July 4 and July 11th too.