Set up

To find other tutorials for this class, go to the main website, https://ds112-lendway.netlify.app/.

Welcome to your first tutorial for this class, COMP/STAT 112: Introduction to Data Science! As you work through the different sections, there will be videos for you to watch (both embedded YouTube videos and links to the videos on Voicethread), files for you to download, and exercises for you to work through. The solutions to the exercises are usually provided, but in order to get the most out of these tutorials, you should work through the exercises and only look at the solutions if you get really stuck. You could also work through the exercises in your own R Markdown file in order to keep the results permanently. If you do that, start the file with the three code chunks I talk about below. Then copy and paste the questions into your document and put your solutions in R code chunks.

If you haven’t done so already, please go through the R Basics document.

When you start your own document, you should have the following three code chunks at the top of your R Markdown file:

  1. Options that control what happens to the R code chunks.
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
  1. Libraries that are used and other settings, like a theme you would like to use throughout the document. If you have not yet installed the libraries you are going to use, you will first have to install them. Go to the Packages tab (top of lower right box) and choose Install. You can then list the packages you would like to install. Alternatively, you can use the install.packages() function in the console and write the name of each of the packages you want to install. Some packages (like my gardenR package) needs to be installed in a special way using the install_github() function in the remotes library - uncomment (delete the hashtags from the front) those two lines of code to install the library. Then either delete those two lines or comment them again. You only need to install packages once, although you will need to re-install them if you upgrade to a new version of R. You need to load them with the library() statements each time you use them. There is a good analogy with lights: installing the package is like putting the light in the socket, loading the package is like turning the light on.
library(tidyverse)         # for graphing and data cleaning
library(lubridate)         # for working with dates
library(palmerpenguins)    # for palmer penguin data
# library(remotes)        # for installing package from GitHub
# remotes::install_github("llendway/gardenR") # run if package is not already installed
library(gardenR)           # for Lisa's garden data
theme_set(theme_minimal()) # my favorite ggplot theme
  1. Load data that will be used. Data from packages can be loaded using the data() function. Data outside of a package can be loaded in different ways depending where it is and what type of data it is. Later in the course, we will learn different functions that can be used to read in data from other places.
# Palmer Penguins data from palmerpenguins library
data("penguins")

# Lisa's garden data from gardenR library
data("garden_harvest")

Motivation

Before jumping into teaching you some Data Science skills in R, I want to give you some motivation. I picked three graphs I’ve recently seen on Twitter. These are all responses to #TidyTuesday which you’ll be participating in very soon! Read more about it here if you’re curious. There are many definitions of Data Science but I broadly like to think of it as using data to tell a story. These three graphs are just a small sample of doing just that.

One of my favorite Data Visualizers on Twitter:

One of my former students (and your preceptor!):

A #TidyTuesday newcomer:

Learning goals

After this tutorial, you should be able to do the following.

  • Construct the “Five Named Graphs” using ggplot2 functions.
  • Add labels to graphs.
  • Change the theme of a graph.
  • Interpret or explain the graph you created.
  • Use the six main dplyr functions to begin “wrangling” data.
  • Pipe (%>%) together a sequence of dplyr functions to answer a question.
  • Combine dplyr verbs and ggplot() functions to wrangle and plot data.

Data

We will use two different datasets throughout this tutorial.

Palmer Penguins

The Palmer Penguins dataset is from the palmerpenguins library. The data we will use is called penguins. You can read about it within R by typing ?penguins in the console.

Let’s do some basic exploration of the data. The code below uses the dim() function to find the dimensions of the dataset - the number of rows and columns.

dim(penguins)
## [1] 344   8

And we use the head() function to view the first 6 rows of the data.

head(penguins)

Lisa’s Garden Data

The garden_harvest data contains data that I have collected from my personal garden in the summer of 2020. You can view the original google sheet here. Each row in the data is a “harvest” for a variety of a vegetable. So, vegetables might have multiple rows on a day, especially if they are things I eat twice a day (lettuce) or there are many different varieties of the vegetable (tomatoes).

I fondly refer to my garden as the “Jungle Garden” because by the end of the summer all the plants are creeping out of their beds and it can be quite the adventure walking through it. Take a look at the video below for an in-depth tour of the garden and details around how I collect the data.

Voicethread: Jungle Garden tour

Let’s also get an overview of this dataset.

Use the dim() function to find the number of cases and variables in the dataset.

Use the glimpse() function to show the first few cases of each of the variables and see the type of variable.

Creating graphs with ggplot()

Now, let’s get ready to plot some data! The concept map below provides an overview of the functions you will be learning, how they relate to one another, and what they do.

First, watch the video below that introduces the ggplot() syntax. You can download the slides. They will open in a web browser, and you can press the letter “p” to go to presentation mode and see my notes that go along with them.

Voicethread: Intro to ggplot()

Next, watch the video below that walks through some examples in R Studio. You can practice along with me by downloading the R Markdown file and working through the problems. If you do that, you will likely get somewhat different results than you see in the video when using the garden_harvest data because I made the videos when I was still in the midst of collecting data :)

Voicethread: ggplot() demo

Lastly, watch this short video about common mistakes. Hopefully you won’t make them, but admittedly I sometimes still do.

Voicethread: ggplot() mistakes

More with ggplot()!

I couldn’t resist giving a few more tips and tricks for creating beautiful plots using ggplot()!

View the video below:

Voicethread: More with ggplot()

And follow along with the code!

Resources

Your turn!

Now you have the tools you need to begin creating your own plots. As you work through these exercises, it will be helpful to have the Data Visualization with ggplot2 cheatsheet open. Find the cheatsheet here or, from within R Studio, go to Help –> Cheatsheets and click on Data Visualization with ggplot2.

Exercise 1a: Scatterplots

Use the penguins data to create a scatterplot of bill_length_mm (x-axis) vs. bill_depth_mm (y-axis). I have started the code for you. How would do describe the relationship?

penguins %>% 
  ggplot(   (x =    , 
             y =    )) +
  geom_()

Exercise 1b: Scatterplots

Now use the code you wrote in the previous exercise but color the points by species. How does this change how you described the relationship before?

CHALLENGE: Scatterplots

Now use the code you wrote in the previous exercise but make the points smaller and more transparent.

Exercise 2a: Histograms

Create a histogram of the flipper_length_mm.

Exercise 2b: Histograms

Add a facet to the previous histogram so there is a different histogram for each species. Make it so there is one column of plots. How would you compare the distributions?

Exercise 3a: Barplots

Create a barplot that shows the number of penguins for each year. Fill in the bars with the color lightblue.

CHALLENGE: Barplots

The code below creates a new dataset called tomatoes. Use the tomatoes dataset to create a barplot that shows the number of days that each tomato variety has been harvested. Make the bars horizontal, fill them in with the color tomato4 , order them from most to least (hint: use fct_infreq() and fct_rev()). Also give the plot nice labels.

tomatoes <- garden_harvest %>% 
  filter(vegetable == "tomatoes") 

Exercise 4: Boxplots

Use boxplots to compare the flipper_length_mm by species. Make the boxplots horizontal. How does this graph compare to the faceted histogram you made above? What are the strengths and weaknesses of each type of graph.

Exercise 5a: Line graphs

The code below creates a dataset (tomatoes_wt_date) that has the weight in grams of tomatoes (daily_wt_g) for each date. Use that to create a linegraph of the weight of tomatoes harvested each day.

tomatoes_wt_date <- garden_harvest %>% 
  filter(vegetable == "tomatoes") %>% 
  group_by(date) %>% 
  summarize(daily_wt_g = sum(weight))

Exercise 5b: Line graphs

The code below creates a dataset (tomato_variety_daily) that has the weight in grams of each variety of tomato (daily_wt_g) for each date. Use that to create a linegraph of the weight of tomatoes harvested each day, where there is a separate line for each variety, in a different color. What are some ways you might improve this graph?

tomato_variety_daily <- garden_harvest %>% 
  filter(vegetable == "tomatoes") %>% 
  group_by(date, variety) %>% 
  summarize(daily_wt_g = sum(weight))

Wrangling data with dplyr functions

Next, you will learn how to wrangle and manipulate data using six dplyr functions. There are many other functions we can use (and we will!) but these six will get us pretty far, especially when combined. The concept map below shows the six functions I will introduce and what they are used for.

First, watch the video below that introduces the dplyr functions. Again, you can download the slides. They will open in a web browser, and you can press the letter “p” to go to presentation mode and see my notes that go along with them.

Voicethread: Intro to dplyr

To recap, the six main dplyr verbs are summarized below.

Images from R Studio Cheatsheets: https://rstudio.com/resources/cheatsheets/

The main dplyr functions … illustrated

Here, I illustrate (I will not pretend to be an artist) some of the dplyr functions to highlight their main uses … and hopefully make you smile. In the made up dataset, called data (I know, it’s a terrible name), there are three variables: pet is the type of pet the student owns - a cat, a dog, or a fish (they all own a pet and only one in this dataset); n_classes is the number of classes the student is taking in this quarter; and hours_hw is the number of hours the student spends doing homework in a week. Descriptions are found below each illustration.

The raw data

We choose variables with select()