Wednesday, May 20, 2015

Teaching R course? Use analogsea to run your customized RStudio in Digital Ocean!

Two years ago I taught an introductory R/Shiny course here at The Jackson Lab. We all learnt a lot. Unfortunately not about Shiny itself, but rather about incompatibilities between its versions and trouble with its installation to some machines.

And it is not only my experience. If you look into forums of Rafael Irizarry MOOC courses, so many questions are just about installation / incompatibilities of R packages. The solution exists for a long time: run your R in a cloud. However, customization of virtual machines (like Amazon EC2) used to be a nontrivial task.

In this post I will show how a few lines of R code can start a customized RStudio docklet in a cloud and email login credentials to course participants. So, the participant do not need to install R and the required packages. Moreover, it is guaranteed they all run exactly the same software. All they need is a decent web browser to access RStudio server.

RStudio server login

Running RStudio in Digital Ocean with R/analogsea

So how complicated is it today to start your RStudio on clouds? It is (almost) a one-liner:
  1. If you do not have  Digital Ocean account, get one. You should receive a promotional credit $10 (= 1 regular machine running without interruption for 1 month):
    https://www.digitalocean.com/
    (full disclosure: if you create your account using the link above I might get an extra credit)
  2. Install analogsea package from Github. Make sure to create Digital Ocean personal access token and in R set DO_PAT environment variable. Also create your personal SSH key and upload it to Digital Ocean.
  3. And now it is really easy:

    library(analogsea)
    # Sys.setenv(DO_PAT = "*****") set access token

    # start your machine in Digital Ocean
    d <- docklet_create(size = getOption("do_size", "512mb"))
    # run RStudio on machine 'd' (rocker/rstudio docker image)
    d %>% docklet_rstudio()
The last line should open your browser with RStudio login page (user "rstudio", password "rstudio"). If not, use summary(d) to get the IP address of your machine and go to http://your_machine:8787

It will cost you ~$0.01 per hour ($5 per month, May 2015). When you are done, do not forget to stop your Digital Ocean machine (droplet_delete(d)). At the end, make sure that you successfully killed all your machines - either log in to Digital Ocean or by calling droplets() in R.

Customized RStudio images

What if the default RStudio image is not good enough for you because you insist that your package needs to be pre-installed. For example, your package has many dependencies, like DOQTL, that needs long time to be downloaded (org.Hs.eg.db, org.Mm.eg.db, ...).

You can still use analogsea to run your Digital Ocean machines but in advance you need to prepare your own customized docker image. First create an account on Docker.com and get yourself introduce to Dockerfile syntax. Then link your Docker account to your Github as described here.

I has been afraid of that because my knowledge of docker is somehow limited. It was actually far easier than I expected: See a dockerfile for RStudio with DOQTL pre-installed.


Also, see Dockerfile of  rocker/hadleyverse image with Hadley Wickham's packages preinstalled to get more inspiration.

Start virtual machine, pull and run customized RStudio image, email credintials

Finally, suppose you created your customized docker image (like simecek/doqtl-docker). For each participant of your course, you want to start a virtual machine, pull this image, run it and email IP (and credentials) to the participant.

The code below is doing just that. There are several ways to send emails in R and this program utilizes sendmailR package. I split the code into several for-loops, so if something goes wrong there is a better chance to catch it.



Final thoughts and links

Docker can be installed to many systems (including Windows). So, the course participants should be able to use your customized RStudio image even after the end of the course on their own servers or laptops. Another advantage is a fully reproducible code - R syntax will change, packages will come and go but your docker image will be functional as long as some docker client will exist.

If you feel overwhelmed and all you need is to run RStudio in the cloud (no analogsea, no customization), I would recommend RStudio in the cloud for dummies, 2014/2015 edition as a good start and a tool sufficient for the most of practical applications.

Of course, teaching R course is not the only reason to run R on Digital Ocean. I started my first droplet 3 month ago to host my Shiny apps. Digital Ocean gives you for $5-$10/month similar functionality as shinyapps.io for $99/month. If interested, see How to get your very own RStudio Server and Shiny Server with DigitalOcean.


UPDATE 10/14/2015: We used Docker / Digital Ocean for teaching Short Course on Systems Genetics. All data, scripts (incl. data transfer, printing DO machine table and emailing participants) and 3 docker images are available at

https://github.com/churchill-lab/sysgen2015

UPDATE 9/9/2015: We used Docker / Digital Ocean for a tutorial of kallisto, DOQTL, Deseq2. All materials available at 

2 comments:

  1. Do you create a separate droplet for each user (seems like it might get expensive) or create multiple users in the same server (could become slow if all use at the same time)? Thanks for this writeup.

    ReplyDelete
  2. Good point. I am creating a separate droplet for each user. I agree it might be cheaper to put several users on one machine and just use different ports. Depends on what your want to do, how much memory do you need...

    ReplyDelete