Confronting models with data

 

 

 

In this working group we learn about methods and ideas for confronting models with data, and about data science generally.  Anyone is encouraged and welcome to attend.

Meeting schedule and topics:

In the Spring 2023 semester, we will meet in person again, Fridays 1pm in EN 2101.

Spring 2023

  • Feb 3, 2023 We’ll have an informal planning meeting. Come along and let us know what you’d like to see in the Data Science Meetup. There’ll be muffins.

  • Feb 24, 2023 “Finding Ice Age houses in Colorado and other problems in archaeological data analysis”, presented by Todd Surovell (Anthropology).

  • Mar 3, 2023 “Using adaptive prior regularization to quantify species interactions in diverse communities”, presented by Christopher Weiss-Lehman (Botany).

  • Mar 24, 2023 “Genomic approaches to studying North American snake diversity”, presented by Sean Harrington (INBRE).

  • Apr 7, 2023 “Adventures in hierarchical Bayesian modeling”, presented by Andrew Siefert (Botany). For this one, we’ll be on zoom at https://uwyo.zoom.us/j/95012886827

  • Apr 14, 2023 “Efficiently Mobilizing Data for Decision Making -- Implications for Data Science Curriculum Design”, presented by Timothy Robinson (Maths and Stats).

  • Apr 28, 2023 “Data in Hydrology - More than just Streamflow”, presented by Fabian Nippgen and Salar Jarhan (Ecosystems Science and Management).

  • May 5, 2023 "Leveraging Data of Large-Scale High-Fidelity Numerical Simulations in Wind Energy Applications", presented by Andrew Kirby (School of Computing).

Fall 2022

We’re going to try something different this semester and have a series of presentations from faculty and students across UW to see the breadth of data science (in the widest sense). The format is a short-ish talk/demo, followed by questions – we want this to be as interactive as possible! In the Fall 2022 semester, we met at 1 p.m. Fridays (Mountain Time) on Zoom.

If you’re interested in contributing a talk/demo/?, get in touch with @Lars Kotthoff.

  • Aug 26, 2022 “Physics-informed neural-networks (PINN) for solving linear elastic solid mechanics problems: a comparison with finite element method”, presented by Min Lin and Xiang Zhang (Mechnical Engineering).

  • Sep 2, 2022 “Quantitative Structure-Property Relationships of COFs for Desalination”, presented by Ali Davies and Laura de Sousa Oliveira (Chemistry).

  • Sep 9, 2022 “Rapid Introduction to Spatial R”, presented by Shannon Albeke (WyGISC). You can download the script and data here: Albeke_Example

  • Sep 16, 2022 “Hands-On Introduction to Deep Learning”, presented by Mehdi Nourelahi (EECS).

  • Sep 23, 2022 “Flow and transport in large-scale complex permeable media”, presented by Saman Aryana (Chemical and Biomedical Engineering). Note that this will be at 2pm instead of 1pm!

  • Sep 30, 2022 “Dynamic Analysis of Ecological Restoration of damaged landscapes with Chronosequences, and building Models to Generate Alternatives for policy makers”, presented by Roger Coupal (Agricultural and Applied Economics).

  • Oct 7, 2022 “Deep learning methods for geophysical inverse problems and applications to subsurface characterization”, presented by Dario Grana (Geology and Geophysics) and Mingliang Liu (Stanford).

  • Oct 14, 2022 “Connecting Data Science and the Sustainable Built Environment”, presented by Liping Wang (Civil and Architectural Engineering).

  • Oct 21, 2022 No talk.

  • Oct 28, 2022 No talk, again.

  • Nov 4, 2022 No ta… wait, no, actually there will be a talk. “Data-intensive freshwater ecology”, presented by Sarah Collins (Zoology and Physiology).

  • Nov 11, 2022 “Divide and Conquer: Partitioning the Sky with the Hierarchical Equal Area isoLatitude Pixelisation”, presented by Adam Myers (Physics and Astronomy).

  • 18 Nov 2022 Dec 2, 2022 “Invasion informatics! Harnessing the big data revolution to overcome invasive species”, presented by Kelsey Brock (Plant Sciences).

Spring 2022

  • Jan 19, 2022building an R package (for CRAN, GitHub, or Bioconductor).

  • Jan 26, 2022 – In advance of our meeting, please watch https://youtu.be/dag9l0GFci8 from the 2021 useR conference and come ready to discuss. This 40 minute talk (plus ~15 minutes of Q&A) covers a diversity of topics and provides some food for thought. See The United States Research Software Engineer Association and their job board.

  • Feb 2, 2022@Damir Pulatov – How to write unit tests for your R code why you should do it.

    • Brief shoutout - useR is an R conference and there is a call to offer tutorials (link goes to the useR page) at the conference. Proposals are due by Feb 2015. Might be a way to ‘level up’ for some of us. Suggested topics are broad and introductory (e.g., git with R, data visualization basics, etc.)

  • Feb 9, 2022– … R, Iteration in R – loops, apply, map functions, Functional(s) and programming in R @Eryn McFarlane - back to very basics. link here

  • Feb 16, 2022 – No meeting this week. Consider watching this recorded talk instead: https://www.youtube.com/watch?v=B7TBfJrofQM

  • Feb 23, 2022 – visit with Nick Anderson, security engineer at Facebook, undergrad degree in math from UW, Wyoming native. We will learn about his education and career path and ask him questions like (please bring your own too):

    1. what advice would you give to yourself as a finishing undergrad at UW, what do you wish you had known at that point?

    2. what things would you encourage students and postdocs to do to prepare themselves for a career in security, or more broadly a career in a technology company?

    3. what are some of the main areas in which technology companies deal with security?

    4. what is your area of speciality within security and how has this changed over time?

    5. what are areas that you are excited learning more about in security?

    6. in what ways do machine learning experts and data scientists contribute to security teams?

  • Mar 2, 2022 – Development environments (a.k.a. pimp my rIDE)

    • Discussion of last week’s meeting with Nick Anderson.

    • Code editors – Share, demo, try out different IDEs / editors: Visual Studio Code, BBedit, emacs ( read and write files remotely with built-in ssh support: /ssh:teton:tmp.txt; split screens and multiple windows on same buffer), vi, Rstudio, atom, Eclipse.

  • Mar 9, 2022 – Some learning about Apache Arrow in advance, please watch short talk by Wes McKinney (25 min, demo starts around 18 min) and look over this short introduction.

  • Mar 16, 2022 spring break – no meeting

  • Mar 23, 2022 – Interactive visualizations with Shiny – demo and hands-on – @Heili Lowmanhere is the shiny app that @Heili Lowman built in the demo.

  • Mar 30, 2022 and remainder of spring semester – we are going to put the working group on hold for the rest of the semester and will see about regrouping again in fall 2022. Thanks everyone for participating and sharing your interest and enthusiasm for learning things together. Don’t forget the lesson about getting accustomed to failing. Get out there and push yourself and see what you can do. And have fun doing it!


We are inviting some speakers to join us this semester, with a few different potential topics including computer security, value systems associated with data sharing, data science in biotechnology companies, careers in data science, etc. If you have suggestions or requests, please add them here or in the wishlist below.

Wish list for topics

Please add and edit this list, including expanding the plan and volunteering to lead on one of these topics.

  • Short (5-10 minute) presentations by group members on data science related to their research with Q&A afterwards

  • See page that lists previous and potential people to invite to talk about careers in non-academic data science (requires login).

  • Reproducible research with workflow managers like Nextflow, and visit from bioinformatician at Wyoming Department of Health

  • Compositional data analysis,

  • Machine learning

    • Scaling up machine learning. Would like to do a hands-on project on some really large data (leaving this vague, so we can find something useful to the broader group).

    • A more introductory overview of machine learning for beginners. Maybe someone can give some background for part of a group meeting, and several of us can share how we are implementing machine learning (or would like to) in our own research? @Joshua Harrison

    • Bayesian machine learning/neural networks

  • Computational statistics

    • Bayesian multilevel modeling (using the brms R package or other methods)

    • Bayesian methods that don’t involve MCMC: variational inference, INLA, ABC

    • Nonlinear modelling (frequentist and Bayesian approaches)

    • Compositional data analysis, potentially as applied to microbiome data – transformations, and considerations (useful for those working with 16S sequence data), or other bioinformatics-related topics if there is enough demand. @Scott Klasek @Melissa DeSiervo I could provide perspective / share some stuff about compositional data that is not 16S related. I’ve been working with this “ecotraj” package in R to perform some analyses on speed and directionality of community shifts https://cran.r-project.org/web/packages/ecotraj/vignettes/IntroductionETA.html

    • Spatial analysis

  • Text mining intro - a basic primer of questions and tools and maybe a follow-up if anyone is interested in digging deeper. Maybe with a focus on how the methods have been applied to biological questions, rather than social science questions.

  • Linux

    • Intermediate bash tips & tricks

    • replicable data cleaning and manipulation

  • How to make your own SQL database

  • Webscraping

  • Visualization

    • Interactive plots

    • How to make an html-based dashboard, with an example for hands-on learning.

    • Effective story-telling and visualization for scientific research with applications in R

  • Math for machine learning/statistics: linear algebra, probability, calculus


Fall 2021


Spring 2021

  • 10 February 2021 – organizational meeting

  • 17 February 2021 – Introduction to mlr3 – Damir Pulatov (UW Computer Science graduate student)

  • 24 February 2021 – profiling and comparison of code running on data.table versus data.frame.

  • Mar 3, 2021@Alex Buerkle will present a short primer on Bayesian estimation and inference. We will do a bit of hands-on with posterior probability distributions. Beyond conceptual and math foundations, we will talk about computational methods for parameter estimation (MCMC, HMC, variational inference).

Below is R code to accompany beta distribution of allele frequencies in a population. This uses the closed form solution for P(p|x,n) = P(x|p,n)*P(p), where the product of a binomial and a beta is a new beta distribution.

(160+1)/(160+1 + 200-160+1) # expectation with P(p)=beta(1,1) (160+0.1)/(160+0.1 + 200-160+0.1) # expectation with P(p)=beta(0.1,0.1) p<-seq(0, 1, 0.001) plot(p, dbeta(p, shape1=160+1, shape2=40+1), type="l") par(mfrow=c(2,1)) plot(p, dbeta(p, shape1=160+1, shape2=40+1), type="l", xlim=c(0.5, 1), col="red") abline(v=qbeta(c(0.025, 0.975), shape1=161, shape2=41), col="red") abline(v=qbeta(c(0.025, 0.975), shape1=16 + 1, shape2=4+1), col="blue") lines(p, dbeta(p, shape1=16+1, shape2=4+1), col="blue") par(mfrow=c(3,1)) plot(p, dbeta(p, 1, 1), type="l") # beta(1,1) plot(p, dbeta(p, 0.1, 0.1), type="l") # beta(0.1, 0.1) plot(p, dbeta(p, 160+1, 200-160+1), type="l")
  • Mar 10, 2021 – No meeting. Consider reading Chapters 1 & 2 of Statistical Rethinking (2nd edition) if the introduction to Bayesian inference last week was new to you, or if you want a refresher. Chapter 2 is particularly good. If you want code in various languages from the second edition, you can find it here (R, python, Julia, etc.).

  • Mar 17, 2021@Alex Buerkle will lead a discussion and some hands-on to help learn how to marginalize discrete parameters in hierarchical Bayesian models, in JAGS and STAN.  The example application pertains to modeling the distribution of a type of organism across sampling sites. The idea would apply to any type of detection problem, where a thing might exist at a point or not (a Bernoulli variable), but there are a bunch of ecologists who are interested in ‘occupancy models’. For example, this type of model could apply to monitoring for an invasive species, like Zebra mussels, or New Zealand mud snails, at monitoring sites. The data can be discrete, but discrete parameters that are inferred from the data cannot be used directly in a STAN model (because discrete distributions are not differentiable).

  • Mar 24, 2021 – Continuation of Bayesian modeling – Discussion of a paper: Ensuring identifiability in hierarchical mixed effects Bayesian models. Ogle and Barber 2020. Ecological Applications. 30(7):e02159. (pdf).

  • Mar 31, 2021 – Nominally spring break at UW. Initially I overlooked that Mar 31, 2021 is part of spring break, so I have bumped our scheduled discussion to April 7 and we will not plan to meet this week.

  • Apr 7, 2021 – Discussion of part of Chapter 5 (Model-Agnostic Methods) of the book Interpretable Machine Learning (most recently updated on 22 March 2021!). Please read the lead-in to the chapter. Additionally, please read sections 5.7-5.10 (or 5.7 and 5.9 only if pressed for time and skim the others).

  • Apr 14, 2021 – presentation by @John Calder on Bayesian mixture modeling. Until recently John was a postdoc at UW, now he works as a data scientist for MX (a financial services company).

  • Apr 21, 2021@Lars Kotthoff will present a brief intro on automated machine learning – bring your hyperparameters.

# tune two parameters of the rpart classification tree learner require(rpart) train.indices = sample(1:150, 150 * 2/3) test.indices = setdiff(1:150, train.indices) evalParams = function(...) { model = rpart(Species~., iris[train.indices, ], ...) preds = predict(model, newdata = iris[test.indices, -5], type = "class") return(list(pars = ..., acc = sum(preds == iris[test.indices, "Species"]) / length(preds))) } pars = expand.grid(minbucket = 1:20, minsplit = 1:20) res = lapply(1:nrow(pars), function(i) do.call(evalParams, as.list(pars[i,]))) best = which.max(sapply(res, function(x) x$acc)) res[[best]] # ...and now with mlr/MBO # adapted from https://mlrmbo.mlr-org.com/articles/supplementary/machine_learning_with_mlrmbo.html require(mlr) require(mlrMBO) # tune same parameters par.set = makeParamSet( makeIntegerParam("minbucket", 1, 20), makeIntegerParam("minsplit", 1, 20) ) ctrl = makeMBOControl() tune.ctrl = makeTuneControlMBO(mbo.control = ctrl, budget = 10) res = tuneParams(makeLearner("classif.rpart"), iris.task, cv3, par.set = par.set, control = tune.ctrl) plot(cummin(getOptPathY(res$opt.path)), type = "l", ylab = "mmce", xlab = "iteration")
  • Apr 28, 2021 – Work and help session

    • people will document, edit, call out unclear sections, or request documentation

      • how to use teton

      • how to do specific scientific computation tasks (on teton or elsewhere).

    • people will get or give assistance or a demonstration of how to do specific things

      • we will also use break out rooms to assist with specific tasks (please invite your colleagues to help and get help).

  • May 5, 2021 Demo day

    • we will do some short demos of different technologies

      • cloud computing with AWS – we can spin up some different disk images using AWS EC2. There are 400 different instances to choose from. Ideally several of us will get free sessions started on different instances and report back. The interest here is that many companies do their compute in the cloud rather than on shared academic systems. We are fortunate that UW supports our very good shared system within ARCC.

      • Regarding the question of easier access to cloud computing tools, you might be interested in the Discovery Environment at Cyverse, in addition to the more standard cloud computing tools that are available for free to academic researchers.

      • if there is interest:

        • Docker – demo a docker image of MISO, the LIMS software for the GTL.

        • wrap_slurm.pl – a program to launch many SLURM jobs on the cluster that need different inputs (a complement to SLURM arrays).

        • … other requests or suggestions …

  • Thanks everyone for participating this spring semester. Let’s call this semester complete and plan to return in the fall semester, very possibly in person in Aven Nelson 210. Feel free to use the data science Confluence space over the summer break to share and organize things. All the best for your research over the summer and for recharging any depleted energy reserves.


Ideas for spring 2021
  • Many additional ideas can be found on the Previous meetings archive and The "queue" pages.

  • Short research topic presentations with a key methodological or technical illustration from group members. Please indicate your interest in giving a presentation and leading a discussion below.

  • Reading and study groups

    • Who would be interested in discussing the 2nd edition of Statistical Rethinking?: @Alex Buerkle @Joanna Blaszczak @Andrew Siefert @Erin Bentley@Shannon E Albeke @Bridger Huhn

    • Other topics?

  • Host non-academic data scientists to visit our group and discuss careers, technical things, and other topics of mutual interest

  • Possible statistical modeling and machine learning topics

    • Genetic programming and symbolic regression in R: A friendly introduction to RGP, which discusses genetic programming using the RGP package (which is archived on CRAN and is no longer being developed, but is probably useful for learning)

  • Hands-on and conceptual learning about sparse models (including following paper by Runge et al.)

  • Read and discuss: Detecting and quantifying causal associations in large nonlinear time series datasets. Runge et al., Sci. Adv. 2019; 5 : eaau4996 27 November 2019

  • Continue reading a few sections from Interpretable Machine Learning

  • read some sections of Machine Learning: a Probabilistic Perspective by Kevin Patrick Murphy?  It is available for free as an ebook from UW library. Alternatively, he is completing two additional books, including Probabilistic Machine Learning: An Introduction.

  • Non-centered parameterizations

    • Simple, univariate models

    • Hierarchical models (including understanding Cholesky factorization)

    • Non-multivariate normal alternatives to non-centered hierarchical models

  • Wondering about talking with ARCC about methods for reducing IO bottlenecks. I run into this most often with large Raster datasets with multithreading

  • AWS - primer and practice usage. A lot of jobs want people to have experience with AWS. Maybe we could set up a UWYO account that lets people practice and see what is about.

  • Docker - At some point Jason did a show and tell on Docker (search for Docker on this website to find his pages), but it might be worth revisiting how to use containers, from a practical standpoint. What I mean by that is, how to load up and use containers that others will have built. Most of us, when starting a new job, probably would be using existing docker containers instead of making our own. I assume using these is pretty straightforward, but have never actually done it. Would be useful to do so.