Data science goes to college with DataFest

Below is the first of several exciting data science developments for the younger generation, happening right here in Los Angeles. This project is unique because it engages undergraduates in real data analysis, something that happens quite rarely in a classroom. If projects like this catch on as a new trend, we’re going to have some real competition for our jobs in a few years!



2014 UCLA DataFest contestants

DataFest is an event for undergraduates that began at UCLA and is now spreading across the country. Each year, we find an interesting “data sponsor” (i.e. an active company willing to share their data) and give students 48 hours to come up with interesting insights. We provide the food, coffee, energy drinks and wifi, they do the data cleaning, exploratory data analysis, and presentations. This is similar to the ‘hackathons’ many in our group are familiar with, with a data twist.

In addition to the data and fuel, we provide the contestants with plenty of help. There are always roving graduate students to consult with, and we recruit “VIP consultants” (working data science practitioners) to come by throughout the event. Most of the students use R for their analysis, although it’s not a formal requirement. Statistics majors at UCLA will have seen R in one of their required courses, but many students haven’t worked with messy, real datasets before DataFest so they often need technical help. Our VIP consultants had expressed worry that they wouldn’t be able to answer students’ questions, however asking “why are you trying to do that?” as a followup would get to the root of the problem and enabled these mentors to provide valuable insights toward a simpler path.

Past years’ data sponsors of DataFest have included the LAPD, eHarmony,, and GridPoint. Typically, we give three prizes– “Best Insight,” “Best Visualization,” and “Best Use of External Data.” This year’s UCLA DataFest winners for Best Visualization were Lev Golod and Jonhngyun Lee. Golod and Lee created a graphic that showed when GridPoint users were mis-using air conditioning (e.g. using AC when the outside temperature was less than 60 degrees). See that winning entry here or see all this year’s winners on the UCLA DataFest website.

This year, a number of higher education institutions held their own DataFests, including Duke and the Five Colleges of Western Massachusetts (Amherst, Hampshire, Mount Holyoke, Smith, and UMass Amherst).

For more on DataFest, check out this post from fivethirtyeight, this writeup from the American Statistical Association (Note that the ASA is taking over sponsorship of the event as of next year), or my post about DataFest and Hadley Wickham‘s R packages. If you’d like to be involved next year (donating data, swag, prizes, money, or your time as a VIP consultant), contact me or Rob Gould.

With projects like DataFest the future of data science seems bright! Send us a note with other projects that make you feel inspired and optimistic at, we’d love to hear about them!

