Blog from September, 2018

Hi All,

Before DSC, there was a Slack channel used by some UWyo students and faculty that wanted to share information about R and other data science stuff, or just ask quick questions of folks that might be online. I’m wondering if there is still a need for that kind of thing. Or maybe that is just too many apps, etc. for everyone? I don’t know, I’m just going to experiment with this and see what happens.

For those not familiar with Slack, it is more-or-less a souped-up messaging environment for sharing not just text, but files, calendars, etc. This includes integration with things like Confluence, GitHub, Trello, Google’s Backup and Sync. It’s functionality makes it a pretty common tool in data science shops, as well as other businesses focused on managing teams across a diffuse network.

For those interested, the Slack channel is: wy-user.slack.com (WY-UseR!). The only requirement for signing-in to the service is an email with a uwyo.edu domain. Once you’ve signed-in, you have access to a number of channels and conversations that have happened over the last couple years. The default channel is #general, but you’ll have to search for additional channels that might be of interest.

I’ve also added a #dsc channel that is linked to the DSC space. So you should see any updates from DSC in that channel. If you are signed-in to DSC from Slack, you should also be able to interact with DSC content from Slack, which is kind of nifty.

The other day, Alex Buerkle brought to my attention that Window’s 10 is now Linux-capable, such that anyone with the Fall Creator’s update can install a very light form of Linux (e.g., Ubuntu), without having to partition their drive. That is, the Windows Subsystem for Linux (WSL) can be activated, allowing one access to a Linux Terminal. As someone that has created dual boot machines before (not for a while, so I’ve forgotten everything), this sounded magical, as I recall the process being a serious pain in the butt. However, I just installed WSL with Ubuntu and R in about an hour (okay, maybe two, because I had to trouble shoot the R installation – see below), and thought I’d share some resources and comments, in case anyone else wanted to do something similar.

So, the easiest part of this whole process was activating WSL and then installing Ubuntu. This legitimately took 15 minutes. To do that, I followed these instructions: http://blog.revolutionanalytics.com/2017/12/r-in-the-windows-subsystem-for-linux.html.

Oddly, the hardest part was setting up the install so that I could load R. Not sure if I was making it harder on myself or not, but when I got to the part about the key in the link above, I didn’t really know what was going on (still kinda don’t, but hey, it works). However, I was able to find a site that provided guidance on how to setup a key, which was surprisingly complicated. Here is the site: http://packaging.ubuntu.com/html/getting-set-up.html.

One of the weirdest parts of setting up the package system was having to register my key with launchpad.net, as outlined in the link immediately above. Part of this process required me to upload a verification value to launchpad via a txt file I’d saved to my computer, so I had to interact with Window’s files system from Ubuntu. Eventually I figured out that /mnt/c/RWorkspace/texty.txt is equivalent to C:\RWorkspace\texty.txt in Ubuntu and Windows, respectively.

At one point, when I was trying to paste in my key to register it, I got an error about not having access to dirmngr. Eventually it went away, but I’m not sure if that was because I’d properly registered my key at that point, or because I ran some code from this GitHub issue. Sharing, in case it is useful.

Other than that, the install was pretty smooth! And I now have an Ubuntu terminal on my Windows machine, where I can happily type up command line R, or run R scripts from the command line.

One weird thing about working with terminal in Windows, that I think is worth mentioning, is pasting text. I can’t just ctrl+v to paste something I’ve copied elsewhere. Instead you have to right-click wherever you want to paste some text. Apparently there is good reason for this behavior, but I find it odd. Copying works the same as normal (ctrl+c). Happy coding!

If you are graduating this semester, or not TA’ing next semester, you may be interested in The Data Incubator’s Data Science Fellowship, which sounds a bit like a crash course in buzzwords, but also looks pretty interesting.

From the official website (applications aren’t open, yet, but the deadline seems to be Oct 17th): https://www.thedataincubator.com/fellowship.html?ch=em&ref=54c871f53b5c#apply

A slightly more informative version from ECOLOG-L: https://listserv.umd.edu/cgi-bin/wa?A2=ind1809c&L=ecolog-l&P=12756

Explore NEON Workshop

Hi all,

Just wanted to share with you a workshop announcement for NEON (National Ecological Observatory Network), down in Boulder. The focus is on learning how to use NEON data, but it seems like there are some general data science skills in their, such as using an API (application programming interface) to access the NEON database, and an emphasis on reproducibility. All expenses paid.

Also note from the application page:

We encourage diverse candidates to apply, including people of color, women and non-binary individuals, veterans, and individuals with disabilities.

Checkout the schedule for the workshop here: https://www.neonscience.org/explore-neon-grads-2018.

My experience with parallel computing is pretty limited - my Windows machine can’t fork, and I’ve only worked on local problems (as opposed to node- or network-based problems). However, the few times I’ve worked with parallel computing has turned out pretty successfully, though I didn’t find the tools for parallel computing in R to be very intuitive. The two main libraries I’ve used thus far are parallel and foreach. The parallel library is part of base R and actually pretty good if you understand functionals, such as the *apply() family (e.g., apply(), lapply()) of functions, as the functions in parallel use similar syntax (e.g., mclapply()).

On the other hand, foreach is geared more towards for loops – which I kind of like, because I find they are more explicit and thus easier for me to follow in a lot of cases. One added benefit of foreach (at least I can’t figure out if there is a way to do this in parallel) is that you can deal with more complex “topologies” for how workers (e.g., the nodes performing a task or function) organize themselves using a combination of the %dopar% and %:% infix operators (Go here for more on what the heck an “infix” operator is). For more on how to nest parallel processes using foreach: https://cran.r-project.org/web/packages/foreach/vignettes/nested.pdf. Such topologies allow one to control which parts of code are run in parallel and which parts are run sequentially. However, a downside to this behavior is that these topologies often have to be hard-coded into a script or function, unless you make some really flexible code. For both packages, working across networks does not seem very easy (or maybe it is and I just haven’t put in the time to figure it out).

Enter: The future

The future package seems to simplify a lot of the issues above. For example, you can explicitly set the behavior of nested functions outside of their execution using the plan() function. For example, plan(strategy = list(multiprocess, sequential)) would run the “top” function in parallel and some other function in the “top” function that has been setup with future in series. By simply changing the call to plan(strategy = list(multiprocess, multiprocess)) would result in both functions running in parallel (though you’ll need to reason as to whether that makes sense for your problem). The topology depth seems to only be limited by the number of cores, nodes, etc. one has access to.

Another aspect worth mentioning from the above snippets is the the “multiprocess” argument. This is a generic argument that can be run on either Windows or Unix-based systems. That is, plan() will figure out which system is being used and execute the appropriate strategy. For example, on my Windows machine, it will open up multiple R sessions in the background, where as on a Mac it will use multiple cores. This is great for programming, because then you don’t have to build OS-specific functions or tests, because they’ve already been built.

One last feature of future that is worth mentioning briefly is that it appears to play nicely with job/workload schedulers, like Slurm, which (apparently) is what the Teton HPC system uses at UWyo.

For more, check out this overview of future here: https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html.

Tidyverse integration: future + purrr = furrr

I know not everyone is on board with the tidyverse, but I just find these functions more intuitive and easier to learn, even though I started as a base user. I will say that one thing I never quite wrapped my head around as a base user were the functionals (e.g., apply, lapply, sapply), but for some reason the equivalent functions from tidyverse’s purrr package just make sense to me. In part, this is probably because I recently sat down with functionals chapter of Hadley Wickham’s Advanced R book, which has some excellent explanations of both the functionals concept, as well as how to use the purrr package – better than purrr's vignettes, in my opinion. Find the latest version of the functionals chapter here: https://adv-r.hadley.nz/functionals.html.

In the context of purrr, Davis Vaughn built the furrr package, which integrates purrr and future functionalities. furrr has blown my mind with how easy (relatively speaking) it is to implement parallel computing.

Below is some code providing “toy” examples of using the furrr package. The first example provides a situation in which parallel processing produces a pretty good speed-up. In the second example, the non-parallel version of the code is better.

library(future)
library(furrr)
library(tidyverse)

#This sets the maximum number of processes to 6. This works well on my computer,
# because it doesn't saturate my computing power, allowing me to do other work
# while R is doing its thing.
options(mc.cores = 6)


#Good parallel example----
#In this example I'm simply going to put the R session to sleep for a few 
# seconds to simulate a function that takes a few seconds to run.
#You should note that there is a ~3x speed-up between the serial and parallel
# versions of the code.

#Create the data to use with the two different versions of `map()`.
  #Set the seed, making the example reproducible.
set.seed(18092312)
times <- rnorm(n = 5, mean = 3, sd = 1) %>%
  #Make sure the data is greater than 0.
  #The "." is short hand for the output from the above rnorm() function.
  abs(.) %>%
  #Convert the output from the above two lines to a list, which is what `map()`
  # is expecting.
  as.list(.)

#Run the serial version of the function.
system.time(map(.x = times, .f = Sys.sleep))
   user  system elapsed 
   0.00    0.00   16.03 
   
  #Note that the above function is equivalent to the following for loop:
  # system.time(
  #   for(i in seq_along(times)) {
  #     Sys.sleep(times[[i]])
  #   }
  # )

#Now run the parallel version.
  #Make sure the plan is set to run as multiprocess, which defaults to opening
  # multiples R sessions on my machine.
plan(multiprocess)
system.time(future_map(.x = times, .f = Sys.sleep))
   user  system elapsed 
   0.05    0.00    5.96 

#Bad parallel example----
#In this example I'm going to illustrate that the "overhead" of parallelization
# can actually slowdown performance when the functions are pretty simple.
#This example simply takes the letters vector, randomly samples from it, and
# then adds a number to the end of the letter. It is a "toy" example simply
# meant to illustrate functionality.

#Make the data.
little_letters <- letters %>%
  #Sample from the 26 lower case letters, with replacement, 10,000 times.
  sample(x = ., size = 1e4, replace = TRUE) %>%
  as.list(.)
  #make an equivalent list with numbers.
little_numbers <- 1:length(little_letters) %>%
  #Convert the numbers to characters.
  as.character() %>%
  as.list(.)

#Combine the letters and numbers using a sequantial strategy.
system.time(serial_output <- map2(.x = little_letters,
  .y = little_numbers,
  #Below is the `purrr` version of an anonymous function, which is a pretty
  # convenient way to do a simple function without having to explicitly make
  # that function. Not entirely necessary here, but I wanted to provide an
  # example of its use.
  ~ paste(.x, .y, sep = "_")))
   user  system elapsed 
   0.03    0.00    0.03 

  #Note that the equivalent for loop could be written as:
  # system.time({
  #   serial_output <- vector("list", length = length(little_letters))
  #   for(i in seq_along(little_letters)) {
  #     serial_output[[i]] <- paste(little_letters[[i]], little_numbers[[i]],
  #       sep = "_")
  #   }
  # })

#Now let's use the parallel version.
plan(multiprocess)
system.time(parallel_multi <- future_map2(.x = little_letters,
  .y = little_numbers,
  ~ paste(.x, .y, sep = "_")))
  #Sometimes I manually stop the background processes, to make sure all the
  # "workers" have "closed".
  future:::ClusterRegistry("stop")
   user  system elapsed 
   0.13    0.06    2.05 

  #That doesn't look very good compared to the regular `map2()` function.

#What happens if we run the parallel version using a sequential strategy, which
# should be equivalent the regular version of `map2()`?
plan(sequential)
  #Note that I don't use anonymous function this time.
system.time(parallel_seq <- future_map2(.x = little_letters,
  .y = little_numbers, .f = paste, sep = "_"))
  
   user  system elapsed 
   0.16    0.00    0.14 
  
  #It improved quite a bit, but still slower than just `map2()`.