Welcome to your first tutorial for this class, COMP/STAT 112: Introduction to Data Science! As you work through the different sections, there will be videos for you to watch (both embedded YouTube videos and links to the videos on Voicethread), files for you to download, and exercises for you to work through. The solutions to the exercises are usually provided, but in order to get the most out of these tutorials, you should work through the exercises and only look at the solutions if you get really stuck. You could also work through the exercises in your own R Markdown file in order to keep the results permanently. If you do that, start the file with the three code chunks I talk about below. Then copy and paste the questions into your document and put your solutions in R code chunks.
If you haven’t done so already, please go through the R Basics document.
When you start your own document, you should have the following three code chunks at the top of your R Markdown file:
Options that control what happens to the R code chunks.
Libraries that are used and other settings, like a theme you would like to use throughout the document. If you have not yet installed the libraries you are going to use, you will first have to install them. Go to the Packages tab (top of lower right box) and choose Install. You can then list the packages you would like to install. Alternatively, you can use the install.packages() function in the console and write the name of each of the packages you want to install. Some packages (like my gardenR package) needs to be installed in a special way using the install_github() function in the remotes library - uncomment (delete the hashtags from the front) those two lines of code to install the library. Then either delete those two lines or comment them again. You only need to install packages once, although you will need to re-install them if you upgrade to a new version of R. You need to load them with the library() statements each time you use them. There is a good analogy with lights: installing the package is like putting the light in the socket, loading the package is like turning the light on.
library(tidyverse) # for graphing and data cleaning
library(lubridate) # for working with dates
library(palmerpenguins) # for palmer penguin data
# library(remotes) # for installing package from GitHub
# remotes::install_github("llendway/gardenR") # run if package is not already installed
library(gardenR) # for Lisa's garden data
theme_set(theme_minimal()) # my favorite ggplot theme
Load data that will be used. Data from packages can be loaded using the data() function. Data outside of a package can be loaded in different ways depending where it is and what type of data it is. Later in the course, we will learn different functions that can be used to read in data from other places.
# Palmer Penguins data from palmerpenguins library
data("penguins")
# Lisa's garden data from gardenR library
data("garden_harvest")
Motivation
Before jumping into teaching you some Data Science skills in R, I want to give you some motivation. I picked three graphs I’ve recently seen on Twitter. These are all responses to #TidyTuesday which you’ll be participating in very soon! Read more about it here if you’re curious. There are many definitions of Data Science but I broadly like to think of it as using data to tell a story. These three graphs are just a small sample of doing just that.
My first #TidyTuesday! I decided to K.I.S.S. and focus on aesthetic for my first week. Thank you @kllycttn for pointing me to the futurevisions palettes!
After this tutorial, you should be able to do the following.
Construct the “Five Named Graphs” using ggplot2 functions.
Add labels to graphs.
Change the theme of a graph.
Interpret or explain the graph you created.
Use the six main dplyr functions to begin “wrangling” data.
Pipe (%>%) together a sequence of dplyr functions to answer a question.
Combine dplyr verbs and ggplot() functions to wrangle and plot data.
Data
We will use two different datasets throughout this tutorial.
Palmer Penguins
The Palmer Penguins dataset is from the palmerpenguins library. The data we will use is called penguins. You can read about it within R by typing ?penguins in the console.
Let’s do some basic exploration of the data. The code below uses the dim() function to find the dimensions of the dataset - the number of rows and columns.
dim(penguins)
## [1] 344 8
And we use the head() function to view the first 6 rows of the data.
head(penguins)
Lisa’s Garden Data
The garden_harvest data contains data that I have collected from my personal garden in the summer of 2020. You can view the original google sheet here. Each row in the data is a “harvest” for a variety of a vegetable. So, vegetables might have multiple rows on a day, especially if they are things I eat twice a day (lettuce) or there are many different varieties of the vegetable (tomatoes).
I fondly refer to my garden as the “Jungle Garden” because by the end of the summer all the plants are creeping out of their beds and it can be quite the adventure walking through it. Take a look at the video below for an in-depth tour of the garden and details around how I collect the data.
Use the dim() function to find the number of cases and variables in the dataset.
Use the glimpse() function to show the first few cases of each of the variables and see the type of variable.
Creating graphs with ggplot()
Now, let’s get ready to plot some data! The concept map below provides an overview of the functions you will be learning, how they relate to one another, and what they do.
First, watch the video below that introduces the ggplot() syntax. You can download the slides. They will open in a web browser, and you can press the letter “p” to go to presentation mode and see my notes that go along with them.
Next, watch the video below that walks through some examples in R Studio. You can practice along with me by downloading the R Markdown file and working through the problems. If you do that, you will likely get somewhat different results than you see in the video when using the garden_harvest data because I made the videos when I was still in the midst of collecting data :)
Now you have the tools you need to begin creating your own plots. As you work through these exercises, it will be helpful to have the Data Visualization with ggplot2 cheatsheet open. Find the cheatsheet here or, from within R Studio, go to Help –> Cheatsheets and click on Data Visualization with ggplot2.
Exercise 1a: Scatterplots
Use the penguins data to create a scatterplot of bill_length_mm (x-axis) vs. bill_depth_mm (y-axis). I have started the code for you. How would do describe the relationship?
penguins %>%
ggplot( (x = ,
y = )) +
geom_()
Exercise 1b: Scatterplots
Now use the code you wrote in the previous exercise but color the points by species. How does this change how you described the relationship before?
CHALLENGE: Scatterplots
Now use the code you wrote in the previous exercise but make the points smaller and more transparent.
Exercise 2a: Histograms
Create a histogram of the flipper_length_mm.
Exercise 2b: Histograms
Add a facet to the previous histogram so there is a different histogram for each species. Make it so there is one column of plots. How would you compare the distributions?
Exercise 3a: Barplots
Create a barplot that shows the number of penguins for each year. Fill in the bars with the color lightblue.
CHALLENGE: Barplots
The code below creates a new dataset called tomatoes. Use the tomatoes dataset to create a barplot that shows the number of days that each tomato variety has been harvested. Make the bars horizontal, fill them in with the color tomato4 , order them from most to least (hint: use fct_infreq() and fct_rev()). Also give the plot nice labels.
Use boxplots to compare the flipper_length_mm by species. Make the boxplots horizontal. How does this graph compare to the faceted histogram you made above? What are the strengths and weaknesses of each type of graph.
Exercise 5a: Line graphs
The code below creates a dataset (tomatoes_wt_date) that has the weight in grams of tomatoes (daily_wt_g) for each date. Use that to create a linegraph of the weight of tomatoes harvested each day.
The code below creates a dataset (tomato_variety_daily) that has the weight in grams of each variety of tomato (daily_wt_g) for each date. Use that to create a linegraph of the weight of tomatoes harvested each day, where there is a separate line for each variety, in a different color. What are some ways you might improve this graph?
Next, you will learn how to wrangle and manipulate data using six dplyr functions. There are many other functions we can use (and we will!) but these six will get us pretty far, especially when combined. The concept map below shows the six functions I will introduce and what they are used for.
First, watch the video below that introduces the dplyr functions. Again, you can download the slides. They will open in a web browser, and you can press the letter “p” to go to presentation mode and see my notes that go along with them.
Here, I illustrate (I will not pretend to be an artist) some of the dplyr functions to highlight their main uses … and hopefully make you smile. In the made up dataset, called data (I know, it’s a terrible name), there are three variables: pet is the type of pet the student owns - a cat, a dog, or a fish (they all own a pet and only one in this dataset); n_classes is the number of classes the student is taking in this quarter; and hours_hw is the number of hours the student spends doing homework in a week. Descriptions are found below each illustration.
The raw data
We choose variables with select()
We add variables (keeping the same number of rows) with mutate(). Inside the mutate() function we need to tell it the details of how to compute the new variable.
We choose rows with filter()
We order rows with arrange()
We group data with group_by(), which changes the data “internally” but won’t make it look different. If you find yourself saying “for each” to describe somethig you want to do, you will probably want to use group_by() first.
We often use summarize() after a group_by(). This will add a new variable and decrease the number of rows based on how we summarize the data. Like mutate(), we also need to give the details of how to compute the new variable inside the function.
We can also group_by() more than one variable which will group observations by each combination of values of those variables.
Here is an example of using a summarize() after a group_by() that includes two variables. See the extra section on group_by() and summarize() for more information.
Useful functions and operators
This table shows the logical operators often used with the filter() verb.
Operator
Meaning
==
Equal to
>
Greater than
<
Less than
>=
Greater than or equal to
<=
Less than or equal to
!=
Not equal to
%in%
in
is.na
is a missing value (NA)
!is.na
is not a missing value
&
and
|
or
The table below shows common functions you would use when you add a variable using mutate(). Find more in the dplyr cheatsheet linked in Resources.
Function
Meaning
+, -, *, /, ^
Arithmetic operations
>, <, etc.
Logical operators (see above)
ifelse()
Used to create binary variable from non-binary
lag()
Offset elements by 1
cumsum()
cumulative sum()
The table below shows common functions you would use when you add a summary variable using summarize(). Find more in the dplyr cheatsheet linked in Resources. With many of these functions, you can add na.rm = TRUE to remove missing values.
Function
Meaning
n()
number of rows
mean()
mean
median()
median
sum()
sum
sd()
standard deviation
IQR()
interquartile range
Demo video
Next, watch the video below that walks through some examples in R Studio. Just like with the ggplot() material, you can practice the dplyr problems along with me by downloading the R Markdown file and working through them. If you do that, you will likely get somewhat different results than you see in the video when using the garden_harvest data because the data has changed since I made the video. At the time of the video, I was still collecting data from the garden :)
In the demo video, I don’t think I did enough to emphasize the grouping behavior and how it differs depending on which function comes after it. I highlight some key behaviors and give some recommendations below:
When group_by() is followed by a summarize(), the data will be grouped one level higher than in the group_by(). For example, in the code below the group_vars() function tells us by which variables the data are grouped. In the first set of code, they are grouped by vegetable in the end; but in the second, they are grouped by date in the end. This also means that if there is only one grouping variable, after a summarize(), the data are no longer grouped. This is illustrated in the third piece of code (note that it outputs character(0) which means there isn’t one).
It is important to pay close attention to how data are grouped after using group_by() followed by another function. The behavior of the grouping depends on which function follows the group_by(). Use the group_vars() function to check. You can always ungroup() and then use group_by() again or add a group_by() to explicitly assure the data are grouped in the way you intend them to be.
Select vegetable, date, and weight from the garden_harvest data. I have started the code for you below.
garden_harvest #What do I need to put here?
select()
Exercise 2a: mutate()
Add a variable for weight in kilograms, weight_kg. One kilogram is 1000 grams. I started the code below.
garden_harvest #What do I need to put here?
mutate()
Exercise 2b: mutate()
Keep the weight_kg variable from the previous problem and also add a variable to the garden_harvest data called day_of_week that returns the day of the week. HINT: Use the function wday() and add an argument to that function that is label=TRUE.
Exercise 3a: filter()
Filter the garden_harvest data to observations that have weights less than 50 grams.
Exercise 3b: filter()
Filter the garden_harvest data to peas and beans with weights larger than 40 grams.
Exercise 4a: arrange()
Order the observations in the garden_harvest data from largest to smallest weight.
Exercise 4b: arrange()
Order the observations in the garden_harvest data from largest to smallest weight on each date.
Exercise 5a: summarize()
Find the total weight in grams and how many rows of data are in the garden_harvest data.
Exercise 5b: summarize() with group_by()
Find the total weight in grams harvested for each date. After doing to summarize(), how are the data grouped?
Exercise 6: combining dplyr verbs
I love tomatoes. Well, truthfully, I love things made out of tomatoes - spaghetti sauce, salsa, soups, and even ketchup. I always want to know which variety of tomato is most productive. In this exercise, start with the garden_harvest data, filter to tomatoes, find the total weight for each variety, compute a new variable to convert the weights from grams to pounds, and lastly sort the data from largest to smallest total weight in pounds. Which variety is best? Is there any information missing? Think about (and check!) how your data are grouped each step of the way.
Exercise 7: combining dplyr verbs and ggplot()
I’m curious if there are certain days during the week where I harvest more or less. In this exercise, start with the garden_harvest data, find the daily harvest in grams for each date, create two new variables: 1. the daily harvest in pounds and 2. day of the week, plot the data so for each day of the week (on the y-axis) a boxplot of the daily harvest in pounds is created.
Exercise solutions
Exercise 1a: Scatterplots
penguins %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm)) +
geom_point()
Exercise 1b: Scatterplots
penguins %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm,
color = species)) +
geom_point()
CHALLENGE: Scatterplots
penguins %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm,
color = species)) +
geom_point(alpha = .5, size = .5)
tomatoes %>%
ggplot(aes(y=fct_rev(fct_infreq(variety)))) +
geom_bar(fill = "tomato4") +
labs(title = "Tomatoes",
subtitle = "# of days each variety has been harvested",
x = "",
y = "")
Exercise 4: Boxplots
penguins %>%
ggplot(aes(x = flipper_length_mm, y = species)) +
geom_boxplot()
Exercise 5a: Line graphs
tomatoes_wt_date %>%
ggplot(aes(x = date, y = daily_wt_g)) +
geom_line()
Exercise 5b: Line graphs
tomato_variety_daily %>%
ggplot(aes(x = date, y = daily_wt_g, color = variety)) +
geom_line()