...
In this working group we learn about methods and ideas for confronting models with data, and about data science generally. Anyone is encouraged and welcome to attend.
...
Meeting schedule and topics:
In
...
the Spring 2023 semester, we will meet
...
This was based on a poll http://whenisgood.net/gc33wpw. As of , 57 people had responded to the poll. That’s a great response. Unfortunately it means that our best time works for 35 of us and 22 have a conflict.
We can also consider a second weekly meeting slot for the many interested people who won’t be able to make this time. People are welcome to use these webpages on Confluence to organize additional interest groups.
Meeting schedule and topics:
Fall 2021
– Organizational meeting
develop list of topics for the semester
pair and share – break out into smaller groups and share what you have been up with respect to data science, what you’ve been wanting to learn, and learn about other people in the group.
– Lars Kotthoff will do a brief intro to mlr3pipelines
– Dylan Perkins (ARCC/End User Support): Intro to shared computing at UW – Teton compute and storage resources (video is below, slides are in https://docs.google.com/presentation/d/145AVEOLHi22CPn0IwpLkVCNWJ7ZzGJq1fZDWt4IBE8I/edit?usp=sharing).
As a follow-up, please contact arcc-info@uwyo.edu with questions.
Please drop them a line if you are interested in participating in some testing of the new browser-based, graphical interface they are developing to the teton compute resources. They would appreciate several people testing the system.
– hands on teton
Tasks you would like help with or demonstrated, from entry-level to advanced
we did one big group screen share to show:
how to configure ssh so that we can provide Teton password and validation (2FA) once per session
how to launch a SLURM interactive session to do some text wrangling with UNIX command line tools
demonstrate two ways of using text editors to write bash scripts that can be executed on teton; this include making a SLURM compliant script that we submitted withs batch
our script demo’d the use of /dev/shm, /lscratch, and /gscratch for simulation output and how to move data home and clean up after yourself at the end
We did not get to demo how to install R packages yourself. Instead, I started a Knowledge Base entry on this, which you are welcome to add to, edit, and improve.
– Short demonstration of LaTeX as implemented in Overleaf, followed by hands-on session for participants to sign-up for a free account, make documents with one or more templates, and ask questions of more experienced users. Alex Buerkle will do initial demo and will ask for helpers to assist others in hands-on session.
Overleaf features:
the direct submission button (arxiv, biorxiv, society journals, etc.)
built-in version control
built-in multi-author, shared editing
Latex features
references to figures, tables, supplementary sections
citation management and bibliography
sectioning
Follow-up:
– an introduction to Bayesian modeling. Eryn McFarlane Topher Weiss-Lehman will discuss the basis of Bayesian thinking and talk about why one might want to use Bayesian methods.
– an introduction to computational Bayesian modeling (Andrew Siefert , Joshua Harrison. Why do computers help when doing Bayesian statistics? What does sampling and convergence mean? A high level overview of the different tools one can use to do Bayesian statistics. Finish with illustration of a model specified with R and Stan so that folks can get an idea of the modeling process.
Animation of samplers for Bayesian modeling: https://chi-feng.github.io/mcmc-demo/app.html
HERE is a git repo that has the code for the little mini-talk that Josh gave. We can keep posting Bayesian stuff here if we want. Feel free to do pull requests. If you have not used git, you can go to that link and view the different files and download them as you like.
– Breakout groups for hands-on and Q&A regarding Bayesian models for parameter estimation and inference. Request a group below. We’ll add one or two more at the beginning of the meeting.
…
– Reproducible research with R, Git, LaTeX, etc. Jessi Rick & others welcome to join in
let Jessi know if you have workflows/ideas that you’d like to add to the discussion
– …
In the queue to place on the schedule:
Reproducible research with workflow managers like Nextflow, and visit from bioinformatician at Wyoming Department of Health
Wish list for topics for fall 2021
See page that lists previous and potential people to invite to talk about careers in non-academic data science (requires login).
(on the schedule)
Intro to shared computing at UW – Teton compute and storage resources, with demos by request (how to …?)
Machine learning
...
More with mlr3 - I’ve (J. Harrison) been using the software for simple ML tasks on manageable data, but am curious how to scale up to larger data and even if mlr3 is the right tool for that task.
...
Scaling up machine learning. Would like to do a hands-on project on some really large data (leaving this vague, so we can find something useful to the broader group).
...
A more introductory overview of machine learning for beginners. Maybe someone can give some background for part of a group meeting, and several of us can share how we are implementing machine learning (or would like to) in our own research? Joshua Harrison
...
Text mining intro - a basic primer of questions and tools and maybe a follow-up if anyone is interested in digging deeper. Maybe with a focus on how the methods have been applied to biological questions, rather than social science questions.
...
(on the schedule) Machine learning pipelines – I (LK) could do a brief intro to mlr3pipelines (https://mlr3pipelines.mlr-org.com).
...
Bayesian methods that don’t involve MCMC: variational inference, INLA, ABC
...
A primer of what Bayesian stats are and why one might want to use them.
...
Bayesian machine learning/neural networks
...
STAN (super basic, please?)
...
Overview of various statistical tests (when to use what for what purpose, etc.)
...
Practicing with loops and apply functions in R
...
Regular expressions tutorials/practice
...
Approximate Bayesian computation, see https://www.pnas.org/content/104/6/1760
...
Parallelization in R
...
Intermediate bash tips & tricks
...
Bayesian multilevel modeling (using the brms R package or other methods)
...
Nonlinear modelling (frequentist and Bayesian approaches)
...
Compositional data analysis, potentially as applied to microbiome data
...
Functional programming in R
...
replicable data cleaning and manipulation
...
Intro to replicability in data science (i.e. github, etc)
Best practices/how-to for sharing code, data, etc. (e.g., on GitHub)
...
Database management
...
Webscraping
...
Interactive plotting
...
Spatial analysis
...
A primer of OverLeaf, LaTex, r markdown and integration of these tools.
...
Math for machine learning/statistics: linear algebra, probability, calculus
...
Spatial data
...
Series of short (5-10 minute) presentations by someone on their research with Q&A afterwards
...
Effective story-telling and visualization for scientific research with applications in R
...
How to make a dashboard, with an example for hands-on learning.
...
General tools for API calls in Python or R.
...
Machine learning collaborative group (see HERE for more)
...
in person again, Fridays 1pm in EN 2101.
Spring 2023
We’ll have an informal planning meeting. Come along and let us know what you’d like to see in the Data Science Meetup. There’ll be muffins.
“Finding Ice Age houses in Colorado and other problems in archaeological data analysis”, presented by Todd Surovell (Anthropology).
“Using adaptive prior regularization to quantify species interactions in diverse communities”, presented by Christopher Weiss-Lehman (Botany).
“Genomic approaches to studying North American snake diversity”, presented by Sean Harrington (INBRE).
“Adventures in hierarchical Bayesian modeling”, presented by Andrew Siefert (Botany). For this one, we’ll be on zoom at https://uwyo.zoom.us/j/95012886827
“Efficiently Mobilizing Data for Decision Making -- Implications for Data Science Curriculum Design”, presented by Timothy Robinson (Maths and Stats).
“Data in Hydrology - More than just Streamflow”, presented by Fabian Nippgen and Salar Jarhan (Ecosystems Science and Management).
"Leveraging Data of Large-Scale High-Fidelity Numerical Simulations in Wind Energy Applications", presented by Andrew Kirby (School of Computing).
Fall 2022
We’re going to try something different this semester and have a series of presentations from faculty and students across UW to see the breadth of data science (in the widest sense). The format is a short-ish talk/demo, followed by questions – we want this to be as interactive as possible! In the Fall 2022 semester, we met at 1 p.m. Fridays (Mountain Time) on Zoom.
If you’re interested in contributing a talk/demo/?, get in touch with Lars Kotthoff.
“Physics-informed neural-networks (PINN) for solving linear elastic solid mechanics problems: a comparison with finite element method”, presented by Min Lin and Xiang Zhang (Mechnical Engineering).
“Quantitative Structure-Property Relationships of COFs for Desalination”, presented by Ali Davies and Laura de Sousa Oliveira (Chemistry).
“Rapid Introduction to Spatial R”, presented by Shannon Albeke (WyGISC). You can download the script and data here: Albeke_Example
“Hands-On Introduction to Deep Learning”, presented by Mehdi Nourelahi (EECS).
“Flow and transport in large-scale complex permeable media”, presented by Saman Aryana (Chemical and Biomedical Engineering). Note that this will be at 2pm instead of 1pm!
“Dynamic Analysis of Ecological Restoration of damaged landscapes with Chronosequences, and building Models to Generate Alternatives for policy makers”, presented by Roger Coupal (Agricultural and Applied Economics).
“Deep learning methods for geophysical inverse problems and applications to subsurface characterization”, presented by Dario Grana (Geology and Geophysics) and Mingliang Liu (Stanford).
“Connecting Data Science and the Sustainable Built Environment”, presented by Liping Wang (Civil and Architectural Engineering).
No talk.
No talk, again.
No ta… wait, no, actually there will be a talk. “Data-intensive freshwater ecology”, presented by Sarah Collins (Zoology and Physiology).
“Divide and Conquer: Partitioning the Sky with the Hierarchical Equal Area isoLatitude Pixelisation”, presented by Adam Myers (Physics and Astronomy).
18 Nov 2022“Invasion informatics! Harnessing the big data revolution to overcome invasive species”, presented by Kelsey Brock (Plant Sciences).
Spring 2022
– building an R package (for CRAN, GitHub, or Bioconductor).
how to make a minimal R package in the style expected by CRAN. We might want to work through https://r-pkgs.org/whole-game.html from https://r-pkgs.org (a book focused on this topic).
Here is a quick tutorial, that is similar to the “whole game” tutorial linked above, but with a few twists.
hosting options and considerations
– In advance of our meeting, please watch https://youtu.be/dag9l0GFci8 from the 2021 useR conference and come ready to discuss. This 40 minute talk (plus ~15 minutes of Q&A) covers a diversity of topics and provides some food for thought. See The United States Research Software Engineer Association and their job board.
– Damir Pulatov – How to write unit tests for your R code why you should do it.
Brief shoutout - useR is an R conference and there is a call to offer tutorials (link goes to the useR page) at the conference. Proposals are due by Feb 2015. Might be a way to ‘level up’ for some of us. Suggested topics are broad and introductory (e.g., git with R, data visualization basics, etc.)
– … R, Iteration in R – loops, apply, map functions, Functional(s) and programming in R Eryn McFarlane - back to very basics. link here
– No meeting this week. Consider watching this recorded talk instead: https://www.youtube.com/watch?v=B7TBfJrofQM
Not related, but I found this article on the use of AI insightful – https://www.technologyreview.com/2021/07/30/1030329/machine-learning-ai-failed-covid-hospital-diagnosis-pandemic/
– visit with Nick Anderson, security engineer at Facebook, undergrad degree in math from UW, Wyoming native. We will learn about his education and career path and ask him questions like (please bring your own too):
what advice would you give to yourself as a finishing undergrad at UW, what do you wish you had known at that point?
what things would you encourage students and postdocs to do to prepare themselves for a career in security, or more broadly a career in a technology company?
what are some of the main areas in which technology companies deal with security?
what is your area of speciality within security and how has this changed over time?
what are areas that you are excited learning more about in security?
in what ways do machine learning experts and data scientists contribute to security teams?
– Development environments (a.k.a. pimp my rIDE)
Discussion of last week’s meeting with Nick Anderson.
Code editors – Share, demo, try out different IDEs / editors: Visual Studio Code, BBedit, emacs ( read and write files remotely with built-in ssh support: /ssh:teton:tmp.txt; split screens and multiple windows on same buffer), vi, Rstudio, atom, Eclipse.
– Some learning about Apache Arrow – 🌟 in advance, please watch short talk by Wes McKinney (25 min, demo starts around 18 min) and look over this short introduction.
we will discuss and share thoughts on the video and short introduction
We can experiment with these simple examples. https://arrow.apache.org/docs/r/
We will also step through this vignette, which is part of the R package ‘arrow’. It calls for working with 37 GB of data from AWS S3. Consequently, I suggest we skip trying to run the code ourselves, but instead we step through the vignette and work to understand what it is doing without executing it ourselves.
Extras:
– 🌴 spring break 🏖 – no meeting ⛷
– Interactive visualizations with Shiny – demo and hands-on – Heili Lowman – here is the shiny app that Heili Lowman built in the demo.
For inspiration when building your own Shiny applications, there are lots of examples online in the Shiny Gallery. There are also a number of examples of Shiny Dashboards, if you prefer that layout.
Additional tutorials are available here and here, and lots more details are available in the textbook Mastering Shiny by Hadley Wickham.
The Shiny Widget Gallery provides all the code necessary to add any kind of widget you like. The ‘shinyWidgets’ package also has more customizable widgets if you would like to offer the end user more freedom in their selections.
Today’s lesson was adapted from a lesson created by Allison Horst for the R Ladies Santa Barbara chapter (GitHub repo for the lesson here). A copy of today’s script ('app.R') can be found below.
Alex's example with >12,000 possible plots that are calculated on the fly. This was my first Shiny app, which I use for teaching.
and remainder of spring semester – we are going to put the working group on hold for the rest of the semester and will see about regrouping again in fall 2022. Thanks everyone for participating and sharing your interest and enthusiasm for learning things together. Don’t forget the lesson about getting accustomed to failing. Get out there and push yourself and see what you can do. And have fun doing it!
...
We are inviting some speakers to join us this semester, with a few different potential topics including computer security, value systems associated with data sharing, data science in biotechnology companies, careers in data science, etc. If you have suggestions or requests, please add them here or in the wishlist below.
Wish list for topics
Please add and edit this list, including expanding the plan and volunteering to lead on one of these topics.
Short (5-10 minute) presentations by group members on data science related to their research with Q&A afterwards
See page that lists previous and potential people to invite to talk about careers in non-academic data science (requires login).
Reproducible research with workflow managers like Nextflow, and visit from bioinformatician at Wyoming Department of Health
Compositional data analysis,
Machine learning
Scaling up machine learning. Would like to do a hands-on project on some really large data (leaving this vague, so we can find something useful to the broader group).
A more introductory overview of machine learning for beginners. Maybe someone can give some background for part of a group meeting, and several of us can share how we are implementing machine learning (or would like to) in our own research? Joshua Harrison
Bayesian machine learning/neural networks
Computational statistics
Bayesian multilevel modeling (using the brms R package or other methods)
Bayesian methods that don’t involve MCMC: variational inference, INLA, ABC
Nonlinear modelling (frequentist and Bayesian approaches)
Compositional data analysis, potentially as applied to microbiome data – transformations, and considerations (useful for those working with 16S sequence data), or other bioinformatics-related topics if there is enough demand. Scott Klasek Melissa DeSiervo I could provide perspective / share some stuff about compositional data that is not 16S related. I’ve been working with this “ecotraj” package in R to perform some analyses on speed and directionality of community shifts https://cran.r-project.org/web/packages/ecotraj/vignettes/IntroductionETA.html
Spatial analysis
Text mining intro - a basic primer of questions and tools and maybe a follow-up if anyone is interested in digging deeper. Maybe with a focus on how the methods have been applied to biological questions, rather than social science questions.
Linux
Intermediate bash tips & tricks
replicable data cleaning and manipulation
How to make your own SQL database
Webscraping
Visualization
Interactive plots
How to make an html-based dashboard, with an example for hands-on learning.
Effective story-telling and visualization for scientific research with applications in R
Math for machine learning/statistics: linear algebra, probability, calculus
...
Fall 2021
– Organizational meeting
develop list of topics for the semester
pair and share – break out into smaller groups and share what you have been up with respect to data science, what you’ve been wanting to learn, and learn about other people in the group.
– Lars Kotthoff will do a brief intro to mlr3pipelines
– Dylan Perkins (ARCC/End User Support): Intro to shared computing at UW – Teton compute and storage resources (video is below, slides are in https://docs.google.com/presentation/d/145AVEOLHi22CPn0IwpLkVCNWJ7ZzGJq1fZDWt4IBE8I/edit?usp=sharing).
As a follow-up, please contact arcc-info@uwyo.edu with questions.
Please drop them a line if you are interested in participating in some testing of the new browser-based, graphical interface they are developing to the teton compute resources. They would appreciate several people testing the system.
– hands on teton
Tasks you would like help with or demonstrated, from entry-level to advanced
we did one big group screen share to show:
how to configure ssh so that we can provide Teton password and validation (2FA) once per session
how to launch a SLURM interactive session to do some text wrangling with UNIX command line tools
demonstrate two ways of using text editors to write bash scripts that can be executed on teton; this include making a SLURM compliant script that we submitted withs batch
our script demo’d the use of /dev/shm, /lscratch, and /gscratch for simulation output and how to move data home and clean up after yourself at the end
We did not get to demo how to install R packages yourself. Instead, I started a Knowledge Base entry on this, which you are welcome to add to, edit, and improve.
– Short demonstration of LaTeX as implemented in Overleaf, followed by hands-on session for participants to sign-up for a free account, make documents with one or more templates, and ask questions of more experienced users. Alex Buerkle will do initial demo and will ask for helpers to assist others in hands-on session.
Overleaf features:
the direct submission button (arxiv, biorxiv, society journals, etc.)
built-in version control
built-in multi-author, shared editing
Latex features
references to figures, tables, supplementary sections
citation management and bibliography
sectioning
Follow-up:
– an introduction to Bayesian modeling. Eryn McFarlane Topher Weiss-Lehman will discuss the basis of Bayesian thinking and talk about why one might want to use Bayesian methods.
– an introduction to computational Bayesian modeling (Andrew Siefert , Joshua Harrison. Why do computers help when doing Bayesian statistics? What does sampling and convergence mean? A high level overview of the different tools one can use to do Bayesian statistics. Finish with illustration of a model specified with R and Stan so that folks can get an idea of the modeling process.
Animation of samplers for Bayesian modeling: https://chi-feng.github.io/mcmc-demo/app.html
HERE is a git repo that has the code for the little mini-talk that Josh gave. We can keep posting Bayesian stuff here if we want. Feel free to do pull requests. If you have not used git, you can go to that link and view the different files and download them as you like.
– Breakout groups for hands-on and Q&A regarding Bayesian models for parameter estimation and inference. Request a group below. We’ll add one or two more at the beginning of the meeting.
– Regular expressions tutorial and practice – Alex will lead and we will do hands-on work in break out groups. Regular expression gymnastics - fall 2021
We will use R and https://regex101.com as sandboxes for learning
Regular expressions are a tool for parsing, modifying, and wrangling text and are present in many programming language, command-line tools, and text editors. They are very useful for extracting information from text fields and for reformatting text for input to different software.
– Reproducible research with R, Git, LaTeX, etc. Jessi Rick
We’ll discuss creating dynamic documents (i.e., where data, code, analysis, and output are all in one place, with automated updating if any of them are modified) as a way of making research more reproducible. While there are numerous options that all lead toward this end, I’ll be introducing how to do this when R is used for analysis and LaTeX/Overleaf is being used for document creation.
Link to the Overleaf tutorial I’ll be working from: https://www.overleaf.com/read/fxmzqhndjdxv
Useful links:
General intro to reproducible research ideas/tools: https://doi.org/10.1002/bes2.1801
Using knitr from within Overleaf: https://www.overleaf.com/learn/latex/Knitr
More general intros to using knitr with LaTeX: https://kbroman.org/knitr_knutshell/pages/latex.html and http://edrub.in/ARE212/latexKnitr.html
Tutorial for syncing between RStudio and Overleaf using Git/GitHub: https://github.com/jessicarick/resources/blob/master/tutorials/R_Overleaf_Integration.pdf
Tutorial for writing manuscripts directly in RMarkdown (no LaTeX necessary): https://jhemberger.github.io/posts/posts/r-markdown-manuscripts/
Joshua Harrison 's tutorial on how to use Make to integrate analyses + manuscript into one reproducible workflow: How to use Make with R and Latex
– A bit more hierarchical modeling in STAN (why, yes, it is machine learning 🚀 ).
in advance of our meeting, please watch this 11 minute video that illustrates the benefit of using hierarchical models for priors and the information sharing that results. https://youtu.be/dNZQrcAjgXQ.
The analyses in the video are in
Hierarchical
in https://github.com/MaggieLieu/STAN_tutorials.git. You can clone the repo withgit clone https://github.com/MaggieLieu/STAN_tutorials.git
The directory contains an Rmarkdown file and a result file that is rendered to html. In our meeting we will break into groups discuss the video, experiment with the code, and share experiences.Here are some more tutorials that break down the theory of hierarchical models a bit more, if there’s other folks (like me!) who got a little lost in the weeds yesterday. https://www.youtube.com/watch?v=VssgU4Ey7ss , (and a less math-y version) https://www.youtube.com/watch?v=SMWleVKO9ZM .
– No group meeting. Please put requests for topics for the remaining semester meetings in the time slots below.
– No meeting, give thanks and take Thanksgiving Break.
– Drop-in, social meeting, to discuss ad hoc and other topics (below)
Running shiny apps, web-hosting shiny apps, and other interactive visualizations
– Hosting webpages at github.io (or bitbucket.io) from scratch. Alex Buerkle will lead and will appreciate input from others.
Here’s an example of how a company makes it easy to embed html in your own documents: https://help.twitter.com/en/using-twitter/embed-twitter-feed . The can obviously be done in a Content Management System (CMS), but they are doing it by making a button available to you to insert some html. Similarly, here’s the same thing to embed YouTube videos in your page: https://support.google.com/youtube/answer/171780?hl=en
another reason to consider html rather than markdown (or a CMS), is that you can readily embed D3 and related tools directly inside of pages for dynamic visualizations. To make this point, I made my index.html more interesting and have not deleted it from GitHub, yet.
And, here, straight from 1998, is a very easy introduction to html (from before css existed). Beyond that, here is the official specification of html and lots of tutorials and modern stuff.
Thanks everyone for participating this semester. Have a good break and see you in January.
...
Spring 2021
10 February 2021 – organizational meeting
17 February 2021 – Introduction to mlr3 – Damir Pulatov (UW Computer Science graduate student)
In advance of our meeting, please read this very short introduction to the use of R6 in R (one form of object-oriented programming in R) from the mlr3 book (~5 minutes).
24 February 2021 – profiling and comparison of code running on
data.table
versusdata.frame.
– Alex Buerkle will present a short primer on Bayesian estimation and inference. We will do a bit of hands-on with posterior probability distributions. Beyond conceptual and math foundations, we will talk about computational methods for parameter estimation (MCMC, HMC, variational inference).
...
Many additional ideas can be found on the Previous meetings archive and The "queue" pages.
Short research topic presentations with a key methodological or technical illustration from group members. Please indicate your interest in giving a presentation and leading a discussion below.
Reading and study groups
Who would be interested in discussing the 2nd edition of Statistical Rethinking?: Alex Buerkle Joanna Blaszczak Andrew Siefert Erin BentleyShannon E Albeke Bridger Huhn …
Other topics?
Host non-academic data scientists to visit our group and discuss careers, technical things, and other topics of mutual interest
Possible statistical modeling and machine learning topics
Genetic programming and symbolic regression in R: A friendly introduction to RGP, which discusses genetic programming using the RGP package (which is archived on CRAN and is no longer being developed, but is probably useful for learning)
Hands-on and conceptual learning about sparse models (including following paper by Runge et al.)
Read and discuss: Detecting and quantifying causal associations in large nonlinear time series datasets. Runge et al., Sci. Adv. 2019; 5 : eaau4996 27 November 2019
Continue reading a few sections from Interpretable Machine Learning
read some sections of Machine Learning: a Probabilistic Perspective by Kevin Patrick Murphy? It is available for free as an ebook from UW library. Alternatively, he is completing two additional books, including Probabilistic Machine Learning: An Introduction.
Non-centered parameterizations
Simple, univariate models
Hierarchical models (including understanding Cholesky factorization)
Non-multivariate normal alternatives to non-centered hierarchical models
Wondering about talking with ARCC about methods for reducing IO bottlenecks. I run into this most often with large Raster datasets with multithreading
AWS - primer and practice usage. A lot of jobs want people to have experience with AWS. Maybe we could set up a UWYO account that lets people practice and see what is about.
Docker - At some point Jason did a show and tell on Docker (search for Docker on this website to find his pages), but it might be worth revisiting how to use containers, from a practical standpoint. What I mean by that is, how to load up and use containers that others will have built. Most of us, when starting a new job, probably would be using existing docker containers instead of making our own. I assume using these is pretty straightforward, but have never actually done it. Would be useful to do so.
...