Hadley Wickham’s dplyr tutorial at useR! 2014, Part 1

Hadley Wickham (perhaps you’ve heard of his work) presented a 2 hour workshop on dplyr at this year’s useR! conference at UCLA. This tutorial was definitely a highlight of the week-long conference for me, and working on this tutorial video has also made me very appreciative of how versatile the dplyr package can be. It clearly is the chef’s knife of data science tools.

Hadley’s presentation was just under 2 hours long, and the edited footage where we omitted breaks gives us 90 minutes of wisdom and inspiration. I’ve split this tutorial into 2 relatively even parts for your learning convenience. If this is your first-ever attempt at learning dplyr, I definitely suggest concentrating on the basics presented here in Part 1 before moving on to next week’s video. Two great pieces of advice to follow during this tutorial come from some of the R greats:

1) One of Martin Maechler‘s rules of good R programming practice is to never copy and paste. Try to always type the commands; go line by line through the code and do your best to understand why it is what it is.

2) In his introduction, Hadley Wickham provides a gem that I want to highlight here.

Whenever you’re learning a new tool, for a long time you’re going to suck… But the good news is that is typical, that’s something that happens to everyone, and it’s only temporary.

Part 1 (this video) covers the following topics:

  1. A introduction, a bit of theory, and a description of the data
  2. Single table verbs (filter/select/arrange/mutate/summarise) and grouped summaries
  3. Data pipelines

I designed this video to be as user-friendly as possible, in hopes of inspiring newcomers to R and rStudio alike. Hadley’s talk was obviously geared towards an intermediate/advanced audience, so I’ve added my own annotations (in light blue) as quick tips for beginners. As you’ll see, Hadley’s workshop often took short breaks for “homework”. I highly urge you pause the video during each problem set and attempt to figure it out on your own before proceeding to the answers. There are also several occasions where Hadley goes off-script from the “dplyr-tutorial.pdf” and tweaks his own solution to the problem sets with answers from the crowd. Don’t worry if the answers on the PDF don’t match the video – remember that there are many different methods of programming in R, and part of the learning process is to find your own style. Most importantly, when you get stuck don’t forget to consult our amazing #rstats community available via Twitter, StackOverflow, Reddit, and other various places across the internet.

Note: I did not have access to Hadley’s console while editing this video, so the console overlays you’ll see are my best attempts to recreate the code he is using. For this reasons, any hypothetical errors are certainly mine and not Hadley’s.

In order to give you time to digest Part 1 before embarking upon Part 2 of this tutorial, we will be publishing Part 2 next week. This video will cover grouped mutate/filter & window functions, joins via two table verbs, and the “Do” function and related databases. Feel free to provide feedback on this tutorial in the comments below, or via my Twitter at @timothy_phan.

Hadley’s scripts from this tutorial can be accessed here. Press “Download as .zip” in the top right corner to download the entire directory. Happy learning, and remember: figuring out how to teach yourself new concepts is essential to improving as a data scientist.

Good luck, and stay tuned for Part 2 next week!

Share This Post


  1. Chris - October 14, 2014

    Can i just say that this is fantastic! Thank YOU!

  2. Hafeez - October 14, 2014

    Hi Tim : Thanks for capturing this tutorial. I have a question, do you know where can i obtain the data Hadley is tring to demo it with. (Flights , Weather, planes). ( I don’t know if its implied some where)
    Thanks in advance

  3. Lawrence Wu - October 17, 2014


    The data actually comes with the dplyr package.


  4. MIke - October 25, 2014

    Part 2 coming soon?

  5. Cristian - November 3, 2014

    Hello my friend and thank you very much for your strong effort!
    What about part 2? I look forward of seeing it!!

  6. Antonios Koutsourelis - April 14, 2015

    Thanks for the tutorial. I’ve been using dplyr for a long time now and it’s extremely powerful for my daily R modelling and data manipulation processes!

    At the moment I’m interested in replacing “for” loops when possible, using dplyr package and the “do” command.
    I have the following script :

    ## split initial dataset based on a grouping variable/column
    ## and save each (new) dataset as a different .csv file

    data.frame(mtcars) %>%
    group_by(cyl) %>%
    do(d=data.frame(.)) %>%
    do(write.csv(.$d, paste0(“data_cyl_”,.$cyl,”.csv”)))

    Seems to work, as I can see the .csv files created in my workspace, but it also returns the following error:

    Error: Results are not data frames at positions: 1, 2, 3

    Any ideas or thoughts?

    PS: I can always use the lapply command with lists, but I’d like to see how I could use this approach….

  7. Satish Babu - May 8, 2015

    Thanks very much for the videos. Dplyr and tidyr are great tools (side-by-side with others in the gg* series), but they have a fairly steep learning curve (as Wikham points out). These sessions are therefore great beacons of knowledge for some of us, particularly as they are by the author himself.

    Thank you again for taking the time off to capture, annotate, and put this up online. Much appreciated!

Leave a reply