ggplot()
and dplyr
tutorialTo find other tutorials for this class, go to the main website, https://ds112-lendway.netlify.app/.
Welcome to your first tutorial for this class, COMP/STAT 112: Introduction to Data Science! As you work through the different sections, there will be videos for you to watch (both embedded YouTube videos and links to the videos on Voicethread), files for you to download, and exercises for you to work through. The solutions to the exercises are usually provided, but in order to get the most out of these tutorials, you should work through the exercises and only look at the solutions if you get really stuck. You could also work through the exercises in your own R Markdown file in order to keep the results permanently. If you do that, start the file with the three code chunks I talk about below. Then copy and paste the questions into your document and put your solutions in R code chunks.
If you haven’t done so already, please go through the R Basics document.
When you start your own document, you should have the following three code chunks at the top of your R Markdown file:
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
install.packages()
function in the console and write the name of each of the packages you want to install. Some packages (like my gardenR
package) needs to be installed in a special way using the install_github()
function in the remotes
library - uncomment (delete the hashtags from the front) those two lines of code to install the library. Then either delete those two lines or comment them again. You only need to install packages once, although you will need to re-install them if you upgrade to a new version of R. You need to load them with the library()
statements each time you use them. There is a good analogy with lights: installing the package is like putting the light in the socket, loading the package is like turning the light on.library(tidyverse) # for graphing and data cleaning
library(lubridate) # for working with dates
library(palmerpenguins) # for palmer penguin data
# library(remotes) # for installing package from GitHub
# remotes::install_github("llendway/gardenR") # run if package is not already installed
library(gardenR) # for Lisa's garden data
theme_set(theme_minimal()) # my favorite ggplot theme
data()
function. Data outside of a package can be loaded in different ways depending where it is and what type of data it is. Later in the course, we will learn different functions that can be used to read in data from other places.# Palmer Penguins data from palmerpenguins library
data("penguins")
# Lisa's garden data from gardenR library
data("garden_harvest")
Before jumping into teaching you some Data Science skills in R, I want to give you some motivation. I picked three graphs I’ve recently seen on Twitter. These are all responses to #TidyTuesday
which you’ll be participating in very soon! Read more about it here if you’re curious. There are many definitions of Data Science but I broadly like to think of it as using data to tell a story. These three graphs are just a small sample of doing just that.
One of my favorite Data Visualizers on Twitter:
And here's the #makingof of this week's #TidyTuesday submission#dataviz #rstats https://t.co/zPvjs4KdaH pic.twitter.com/iqTuOFpP4b
— Georgios Karamanis (@geokaramanis) April 18, 2020
One of my former students (and your preceptor!):
This wk's @R4DScommunity #TidyTuesday: guess what a centered dot-plot of astronauts in space by year and nation looks a lot like?
— lil bobby tables 🐳 (@robert_b_) July 15, 2020
A space station in mid-orbit (or Cloud City)! #RStats #r4ds #DataScience #DataViz #tidyverse #ggplot2 pic.twitter.com/hqW7KLWmsn
A #TidyTuesday
newcomer:
My first #TidyTuesday! I decided to K.I.S.S. and focus on aesthetic for my first week. Thank you @kllycttn for pointing me to the futurevisions palettes!
— Kelly Morrow McCarthy (@KellyMM_neuro) August 20, 2020
GitHub: https://t.co/S5YP0pFlvq
futurevisions: https://t.co/h0dfUYFOqi pic.twitter.com/7hqsz7cwdb
After this tutorial, you should be able to do the following.
ggplot2
functions.dplyr
functions to begin “wrangling” data.%>%
) together a sequence of dplyr
functions to answer a question.dplyr
verbs and ggplot()
functions to wrangle and plot data.We will use two different datasets throughout this tutorial.
The Palmer Penguins dataset is from the palmerpenguins
library. The data we will use is called penguins
. You can read about it within R by typing ?penguins
in the console.
Let’s do some basic exploration of the data. The code below uses the dim()
function to find the dimensions of the dataset - the number of rows and columns.
dim(penguins)
## [1] 344 8
And we use the head()
function to view the first 6 rows of the data.
head(penguins)
The garden_harvest
data contains data that I have collected from my personal garden in the summer of 2020. You can view the original google sheet here. Each row in the data is a “harvest” for a variety of a vegetable. So, vegetables might have multiple rows on a day, especially if they are things I eat twice a day (lettuce) or there are many different varieties of the vegetable (tomatoes).
I fondly refer to my garden as the “Jungle Garden” because by the end of the summer all the plants are creeping out of their beds and it can be quite the adventure walking through it. Take a look at the video below for an in-depth tour of the garden and details around how I collect the data.
Voicethread: Jungle Garden tour
Let’s also get an overview of this dataset.
Use the dim()
function to find the number of cases and variables in the dataset.
Use the glimpse()
function to show the first few cases of each of the variables and see the type of variable.
ggplot()
Now, let’s get ready to plot some data! The concept map below provides an overview of the functions you will be learning, how they relate to one another, and what they do.
First, watch the video below that introduces the ggplot()
syntax. You can download the slides. They will open in a web browser, and you can press the letter “p” to go to presentation mode and see my notes that go along with them.
Voicethread: Intro to ggplot()
Next, watch the video below that walks through some examples in R Studio. You can practice along with me by downloading the R Markdown file and working through the problems. If you do that, you will likely get somewhat different results than you see in the video when using the garden_harvest
data because I made the videos when I was still in the midst of collecting data :)
Lastly, watch this short video about common mistakes. Hopefully you won’t make them, but admittedly I sometimes still do.
ggplot()
!I couldn’t resist giving a few more tips and tricks for creating beautiful plots using ggplot()
!
View the video below:
Voicethread: More with ggplot()
And follow along with the code!
Now you have the tools you need to begin creating your own plots. As you work through these exercises, it will be helpful to have the Data Visualization with ggplot2
cheatsheet open. Find the cheatsheet here or, from within R Studio, go to Help –> Cheatsheets and click on Data Visualization with ggplot2
.
Use the penguins
data to create a scatterplot of bill_length_mm
(x-axis) vs. bill_depth_mm
(y-axis). I have started the code for you. How would do describe the relationship?
penguins %>%
ggplot( (x = ,
y = )) +
geom_()
Now use the code you wrote in the previous exercise but color the points by species
. How does this change how you described the relationship before?
Now use the code you wrote in the previous exercise but make the points smaller and more transparent.
Create a histogram of the flipper_length_mm
.
Add a facet
to the previous histogram so there is a different histogram for each species
. Make it so there is one column of plots. How would you compare the distributions?
Create a barplot that shows the number of penguins for each year. Fill in the bars with the color lightblue.
The code below creates a new dataset called tomatoes
. Use the tomatoes
dataset to create a barplot that shows the number of days that each tomato variety has been harvested. Make the bars horizontal, fill them in with the color tomato4 , order them from most to least (hint: use fct_infreq()
and fct_rev()
). Also give the plot nice labels.
tomatoes <- garden_harvest %>%
filter(vegetable == "tomatoes")
Use boxplots to compare the flipper_length_mm
by species
. Make the boxplots horizontal. How does this graph compare to the faceted histogram you made above? What are the strengths and weaknesses of each type of graph.
The code below creates a dataset (tomatoes_wt_date
) that has the weight in grams of tomatoes (daily_wt_g
) for each date. Use that to create a linegraph of the weight of tomatoes harvested each day.
tomatoes_wt_date <- garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(date) %>%
summarize(daily_wt_g = sum(weight))
The code below creates a dataset (tomato_variety_daily
) that has the weight in grams of each variety of tomato (daily_wt_g
) for each date. Use that to create a linegraph of the weight of tomatoes harvested each day, where there is a separate line for each variety, in a different color. What are some ways you might improve this graph?
tomato_variety_daily <- garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(date, variety) %>%
summarize(daily_wt_g = sum(weight))
dplyr
functionsNext, you will learn how to wrangle and manipulate data using six dplyr
functions. There are many other functions we can use (and we will!) but these six will get us pretty far, especially when combined. The concept map below shows the six functions I will introduce and what they are used for.
First, watch the video below that introduces the dplyr
functions. Again, you can download the slides. They will open in a web browser, and you can press the letter “p” to go to presentation mode and see my notes that go along with them.
To recap, the six main dplyr
verbs are summarized below.
Images from R Studio Cheatsheets: https://rstudio.com/resources/cheatsheets/
dplyr
functions … illustratedHere, I illustrate (I will not pretend to be an artist) some of the dplyr
functions to highlight their main uses … and hopefully make you smile. In the made up dataset, called data
(I know, it’s a terrible name), there are three variables: pet
is the type of pet the student owns - a cat, a dog, or a fish (they all own a pet and only one in this dataset); n_classes
is the number of classes the student is taking in this quarter; and hours_hw
is the number of hours the student spends doing homework in a week. Descriptions are found below each illustration.
The raw data
We choose variables with select()