Intro2R

.title[
# Intro2R
]
.subtitle[
## Introduction to R Programming
]
.author[
### <a href="https://xp-song.github.io">Xiao Ping (XP) Song</a>
]
.institute[
### <a href="mailto:xp.song@u.nus.edu" class="email">xp.song@u.nus.edu</a> Course materials: <a href="https://github.com/xp-song/Intro2R" class="uri">https://github.com/xp-song/Intro2R</a>
]
.date[
### updated 2023-07-30
]

---

---

# Before we begin...

1. Navigate to course webpage and read background information  
*https://github.com/xp-song/Intro2R*

2. Ensure that you have installed __R__ on your computer, followed by __R Studio__  
*(follow links under '_Instructions_' section of webpage)*

3. [Download](https://github.com/xp-song/Intro2R/archive/master.zip) workshop materials  
*(green button on webpage)*

---

# Outline

__[About](#about)__  
 
[Getting Started](#Getstarted)

[General Syntax](#gensyntax)

[Data Structures](#structures)

[Functions](#functions)

[The _tidyverse_](#tidyverse)

[Useful Resources](#resources)

---

background-image: url(https://www.r-project.org/logo/Rlogo.svg)
background-size: 100px
background-position: 95% 15%
class: left

# What is R?

- Programming language and software environment with a command line interface

- RStudio is often used as a software client

???

- Cmd line interface is text-based
  
  - Excel: interact with the software using a GUI

- Rstudio: E.g., Microsoft Word vs. text editor

- In this course, we will work entirely within RStudio

- Both R and RStudio are open source software

- Huge library of packages created by the R community

???

- Open source:

- Anyone can contribute, easy to look for help online
  
- *Packages*: Like apps/toolboxes
  - E.g., these course materials

---
class: left

# About this crash course

## What it IS
- Designed for those with minimal coding experience
- Give you a taste of what R can do

???
- ... hopefully give you some ideas on how you can use R for your own projects
  - Slide deck 1: Look at the basics (language, structure/syntax) 
  - Slide deck 2: Dive right into analysing data

## What it is NOT
- A substitute to practicing the fundamentals of the language
- A lesson in statistics

???
- I have provided links to more resources in your notes.. check them out if you're interested

---

# Outline

[About](#about)  
 
__[Getting Started](#Getstarted)__

[General Syntax](#gensyntax)

[Data Structures](#structures)

[Functions](#functions)

[The _tidyverse_](#tidyverse)

[Useful Resources](#resources)

---

# Course materials

On your computer, navigate to downloaded folder

_/notes_ 1 
_/data_ 
_PDF slide deck_ 
_Intro2R.Rproj_

???

- Notes: Can refer
- Output: when we generate figures/graphs
- Open the RStudio Project file. This boots up RStudio (demo)

---

# R Studio Client

- __Console:__ Command line input/output  
- __Script editor:__ View/edit files that contain code
- __Environment/History__  
- __Files/Plots/Packages/Help/Viewer__

???

- Console: real-time interaction (inputs AND OUTPUTS)
  - All you would see if you ran R without RStudio!
  - TRY doing some calculations

- Create new .R script (File > New File > R script) > Script editor
  - Write some calculations > Cmd+Enter
  - Code sent to the console!
  - Comments # are skipped 
  - Scripts: edit/save/share our code with others!

---

# R Notebooks

## What are R Notebooks?

- R Notebooks (a.k.a. [R Markdown Notebooks](http://rmarkdown.rstudio.com)) are files ending with '_.Rmd_'.

- Compared to basic '_.R_' scripts, they allows us to:  
  - Write normal text alongside code
  - Interact with code within a single document
  - Generate (i.e. '_knit_') different types of files

???
- R Notebooks = R Markdown files

- E.g., All materials in this course were created from _.Rmd_ files.

- `install.packages("rmarkdown")` (already comes installed with RStudio)

**Try creating a new R Notebook `File > New File > R Notebook`**

---

# R Notebooks

<img src="images/notebook.png" width="700" style="display: block; margin: auto;" />
- __Header section:__ specify document [parameters](https://bookdown.org/yihui/rmarkdown/html-document.html) 
- __Normal text__ 
- __Code chunk:__ write code and specify code [parameters](https://bookdown.org/yihui/rmarkdown/r-code.html)

???
- Follow instructions

- Code chunk: Results appear beneath the chunk
  - plot(): our very first function (tool in toolbox)
  - Name of function always followed by brackets
  - Look up a function with `?`

**Save our new file as _'myanalysis.Rmd'_**

???

- *myanalysis.Rmd*: use for the rest of the lecture

---

# RStudio Projects

**Try creating a new RStudio Project `File > New Project > New Directory > New Project`**

---

# RStudio Projects

## What are RStudio Projects?

- RStudio Projects help organise your work into separate 'R sessions'.

- Each project has it's own workspace a.k.a. 'working directory' (separate configuration, history, etc.)

???
- Workspace = Table in my office 
- How do I know which is my table?

- The location of the '_.RProj_' file defines the 'working directory'

- **Type `getwd()` in the _console_ of our new project**
 
 - This returns the absolute path to our working directory 
 e.g., `/Users/<computer_username>/Desktop/test`
 
???

- E.g., tissue paper
- Likewise, .RProj file tells R that this is my workspace (Address: see window header)

- Also type `getwd()` in the *Intro2R* project

- *Delete test project*

---

# RStudio Projects

## 🌟 Best Practice

- Use _relative_ paths in your script, based on _.RProj_ file location

- **Try reading in data in your R Notebook** 
 `read.csv("<path to Intro2R folder>/Intro2R/data/ozone_data.csv")`
 `read.csv("data/ozone_data.csv")`
 
- Keep all project items in the working directory

???

- By using RStudio Projects... don't need to write out initial path

- Relative path: Scripts work across different computers

- **IMPT: getwd() in console may be diff from in code chunk in notebook**

- If .Rmd is in subfolders! `../`

---

# Outline

[About](#about)  
 
[Getting Started](#Getstarted)

__[General Syntax](#gensyntax)__

[Data Structures](#structures)

[Functions](#functions)

[The _tidyverse_](#tidyverse)

[Useful Resources](#resources)

---

# General Syntax

.left-column[ 
### Operators
]
.right-column[ 
__Arithmetic operators:__  
  
**E.g., Solve the following equation**  
$$
\frac{ (1+2) * (4-5)}{50}
$$
]

???

- Refer to your notes for the list of operators

- We will only go through some operators now

--
.right-column[

```r
(1+2)*(4-5)/50
```

```
## [1] -0.06
```
]

---

# General Syntax

.left-column[ 
### Operators
]
.right-column[ 
__Logical operators:__  
  
**E.g., Check if `1e3` is larger or equal to `1*10^3`**  
]

--
.right-column[

```r
1e3 >= 1*10^3
```

```
## [1] TRUE
```
]

???
- `1e3` = Scientific/exponential notation
- 1000 can be written as 1e3

---

# General Syntax

.left-column[ 
### Operators
### Variables
]
.right-column[ 
__Variables are named objects used to store data__
- `<-` is used to assign variable names in R (E.g., `x <- 4`)
- Print variables by name (`x` vs. `"x"`)
- Assigning data to an existing variable overwrites it (`x <- 10`)
]

???
- Arrow shortcut `Option/Alt`+`-`
- `=` produces a similar result.. but reserve it for other uses
- Let's assign a number to `hello`
- New variable in environment
- "hello" vs. hello (text in quotation)

- Case sensitive!

--
.right-column[ 
**🌟Best Practice**  
- Clear and consistent names
- Avoid numbers/symbols/whitespace
]

---

# General Syntax

.left-column[ 
### Operators
### Variables
]
.right-column[ 
__Data types and examples:__
- Numeric (`3.142`), Integer (`5L`)
- Character (`"hello"`)
- Logical (`TRUE, FALSE`)
- Complex
]

???
- Your own names..

.right-column[ 
**Check the data type for each variable using the function `is.numeric()`, `is.integer()`, `is.character()`**
]

???
- Age remember to put L (integer) - both numeric and integer

---

# Outline

[About](#about)  
 
[Getting Started](#Getstarted)

[General Syntax](#gensyntax)

__[Data Structures](#structures)__

[Functions](#functions)

[The _tidyverse_](#tidyverse)

[Useful Resources](#resources)

---

# Data Structures

.left-column[ 
### Vectors
]
.right-column[
__About vectors:__  
- Linear collection of data 
- Must be of the _same_ data type
]
--
.right-column[
**Assign a _vector_ of names to the variable `name`**  
(use the concatenate function `c()`)

```r
name <- c("Me", "Tom", "Dick", "Harry", "Susan") # character vector
```
]

???
- *COERCION*: What if we add a number? Treated as "character" (Environment: "chr")

--
.right-column[
**Assign a _vector_ of numbers to the variable `age`**

```r
age <- c(20, 25, 30, 35, 40) # numeric vector
```
]
???

- For numeric vectors:

- What if we add characters/text? COERCE  
  - What if we add LOGICAL data? (COERCE: 1 = TRUE, 0 = FALSE)

---

# Data Structures

.left-column[ 
### Vectors
]
.right-column[
__About vectors:__
- Linear collection of data 
- Must be of the _same_ data type
- _Operations in R are vectorised_
]
--
.right-column[
**Subtract 5 from the vector `age`**

```r
age-5
```

```
## [1] 15 20 25 30 35
```

**Add together two vectors**

```r
age+age
```

```
## [1] 40 50 60 70 80
```
]

???

- If 2 vectors not similar in length, will recycle the shorter vector!

- R will return a warning

---

# Data Structures

.left-column[ 
### Vectors
### Lists
]
.right-column[
__About lists:__ 
- Linear collection of data 
- Can contain of different _types_ and _structure_ of data
]

???

- Lists are very flexible!

--
.right-column[
**Create a list with a mix of data types and variables**

```r
myteam <- list(name, age, "Group 1", 2019)
```
]

---

# Data Structures

.left-column[ 
### Vectors
### Lists
]
.right-column[
__About lists:__ 
- Linear collection of data 
- Can contain of different _types_ and _structure_ of data
]
.right-column[

```r
myteam
```

```
## [[1]]
## [1] "Me"    "Tom"   "Dick"  "Harry" "Susan"
## 
## [[2]]
## [1] 20 25 30 35 40
## 
## [[3]]
## [1] "Group 1"
## 
## [[4]]
## [1] 2019
```
]
???
- Elements in a list are indexed with [[]]
- Sub-indices are in [] (nested like tree branches)

---

# Data Structures

.left-column[ 
### Vectors
### Lists
### Factors
]
.right-column[
__About factors:__ 
- A special kind of vector that represents categorical data with discrete levels
]
???
- E.g., Eye color, M/F (binary), SA to SD (ordinal)

--
.right-column[
**Let's code the sex of each person in the variable `name`**  
(use the functions `factor()` and `c()`)

```r
sex <- factor(c("M","M","M","M","F")) 
sex
```

```
## [1] M M M M F
## Levels: F M
```
]

---

# Data Structures

```r
perform <- factor(c("High", "Low", "Med", "Med", "High"))
perform
```

```
## [1] High Low  Med  Med  High
## Levels: High Low Med
```
What is wrong with this output?
]
???
- Order: Alphabetical unless specified

---

# Data Structures

.right-column[
__About factors:__  
- A special kind of vector that represents categorical data with discrete levels
]
.right-column[
**Define the order using the `levels=` argument in `factor()`**

```r
perform <- factor(c("High", "Low", "Med", "Med", "High"), 
 levels = c("Low", "Med", "High"))
perform
```

```
## [1] High Low  Med  Med  High
## Levels: Low Med High
```
]
???
- Add arguments to the function (`?functionname`)

---

# Data Structures

.left-column[ 
### Vectors
### Lists
### Factors
### Matrices
]
.right-column[
__About matrices:__ 
- Tabular data (rows & columns)
- Must be of the _same_ data type
]
???
- 2D/rectangular data
- E.g., Image processing (each pixel has a value), spatial data..
--
.right-column[
**Create a 4 by 3 matrix of sequential numbers** 
Use `matrix()` and the `:` operator to create a sequence

```r
m <- matrix(1:12, nrow = 4)
m
```

```
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
```
]

---

# Data Structures

.left-column[ 
### Vectors
### Lists
### Factors
### Matrices
### Dataframes
]
.right-column[
__About dataframes:__ 
- Tabular data (rows & columns)
- Rows represent data entries, columns represent different variables
]
--
.right-column[
**Import the dataset `ozone_data.csv` into your R Notebook using `read.csv()`**

```r
ozone <- read.csv("data/ozone_data.csv") # column headers in first row
```
]

---

# Data Structures

```r
head(ozone) #print first few rows
```

```
##   rad temp wind ozone
## 1 190   67  7.4    41
## 2 118   72  8.0    36
## 3 149   74 12.6    12
## 4 313   62 11.5    18
## 5 299   65  8.6    23
## 6  99   59 13.8    19
```
]
--
.right-column[
**Check the dimensions of `ozone`**

```r
dim(ozone)
```

```
## [1] 111   4
```
]

???

- Tabular data in R: rows, cols

---

# Data Structures

.left-column[ 
### Vectors
### Lists
### Factors
### Matrices
### Dataframes
]
.right-column[
**Check the names of `ozone` using `dimnames()`, `rownames()` and `colnames()`**

```r
dimnames(ozone)
```

```
## [[1]]
##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
##  [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
##  [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
##  [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
##  [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
##  [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
##  [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
##  [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
##  [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [109] "109" "110" "111"
## 
## [[2]]
## [1] "rad"   "temp"  "wind"  "ozone"
```
]
???
- Returns a list!
- Try using `rownames()` and `colnames()`

---

# Data Structures

```r
ozone$temp
```

```
##   [1] 67 72 74 62 65 59 61 69 66 68 58 64 66 57 68 62 59 73 61 61 67 81 79 76 82
##  [26] 90 87 82 77 72 65 73 76 84 85 81 83 83 88 92 92 89 73 81 80 81 82 84 87 85
##  [51] 74 86 85 82 86 88 86 83 81 81 81 82 89 90 90 86 82 80 77 79 76 78 78 77 72
##  [76] 79 81 86 97 94 96 94 91 92 93 93 87 84 80 78 75 73 81 76 77 71 71 78 67 76
## [101] 68 82 64 71 81 69 63 70 75 76 68
```
]

---

# Data Structures

```r
team_details <- data.frame(name, age, sex, perform)
team_details
```

```
##    name age sex perform
## 1    Me  20   M    High
## 2   Tom  25   M     Low
## 3  Dick  30   M     Med
## 4 Harry  35   M     Med
## 5 Susan  40   F    High
```
]

---

# Back to operators...

## Subsetting in R

**Extract the 5th element in the vector `name`**

```r
name[5]
```

```
## [1] "Susan"
```
--
**Extract the 4th element of the column `age` in the dataframe `team_details`**  
.small[_Remember: use `$` to extract columns by their name_]

```r
team_details$age[4]
```

```
## [1] 35
```

---

# Back to operators...

## Subsetting in R

**Extract the element in the 2nd row and 4th col in `team_details`**

```r
team_details[2,4]
```
--
**Extract 2nd row and all cols in `team_details`**

```r
team_details[2,]
```
--

**Extract the 4th col and all rows except the 2nd in `team_details`**

```r
team_details[-2,4]
```

---

# Back to operators...

## Subsetting in R  
**Extract rows 1 to 3 in `team_details`**

```r
team_details[1:3,]
```

```
##   name age sex perform
## 1   Me  20   M    High
## 2  Tom  25   M     Low
## 3 Dick  30   M     Med
```
???
Hint: u can use the semicolon to indicate a sequence of no.s

**Extract rows 1 and 3 in `team_details`**

```r
team_details[c(1,3),]
```

```
##   name age sex perform
## 1   Me  20   M    High
## 3 Dick  30   M     Med
```

---

# Back to operators...

## Subsetting in R: Quick test!⚡️  
**Load the built-in dataset `data(mtcars)`**

**Extract data on cars with a fuel efficiency of at least 20 mpg, and that are more than 108 hp**

```
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
```
???
- This is the answer you want to end up with

- Refer to 'Operators' in notes
- Each row is a type of car
- We want all the variables (leave col blank)

- **Hint:** `mtcars[  & , ]`

- **Hint:** `mtcars[mtcars$mpg >= 20 & , ]`

???

- Answer: `mtcars[mtcars$mpg >= 20 & mtcars$hp > 108, ]`

---

# Outline

[About](#about)  
 
[Getting Started](#Getstarted)

[General Syntax](#gensyntax)

[Data Structures](#structures)

__[Functions](#functions)__

[The _tidyverse_](#tidyverse)

[Useful Resources](#resources)

---

# Functions

.left-column[ 
### Overview
]
.right-column[
__About functions:__
- Functions have inputs and outputs
- Look up details about the function with `?<functionname>` 
]

???

- Tools in toolbox (packages): Orange squeezer, juicer, blender
  - input = orange
  - output = orange juice, smoothie
  
--
.right-column[
**E.g., Plot the performance distribution in `team_details`**

```r
plot(team_details$perform)
```

<img src="slides_files/figure-html/unnamed-chunk-31-1.png" width="50%" style="display: block; margin: auto;" />
]
???
- **Plotting functions**

---

# Functions

```r
mean(team_details$age)
```

```
## [1] 30
```
]

???
- **Statistical functions**
- Basic functions: refer to notes
--
.right-column[
**E.g., Find the number of people (rows) in `team_details`**

```r
nrow(team_details)
```

```
## [1] 5
```
]

---

# Functions

.left-column[ 
### Overview
### User-defined
]
.right-column[
__General structure when defining a custom function:__

```r
functionname <- function(inputs){
 # calculations...
 output
 }
```
]
???

- You define the inputs and outputs
- All calculations are within squiggly brackets
--
.right-column[
**Subsequent calls to the function:**  
`functionname(inputs)`
]

???

- Function becomes an object in your environment

---

# Functions

.left-column[ 
### Overview
### User-defined
]
.right-column[
__E.g., Load `data/grades.csv` and assign it the name `grades`__

```
##     subject grade grade_point credits
## 1      Math     A         4.5       5
## 2   English     B         3.5       5
## 3 Economics     C         2.0       4
## 4  Mandarin    B+         4.0       5
## 5     Music     F         1.0       0
## 6   History    C+         2.5       5
## 7   Intro2R    A+         5.0       1
```
]

---

# Functions

.left-column[ 
### Overview
### User-defined
]
.right-column[
**Manually calculate the GPA in R using the formula:**

`$$\frac{ \sum_{i=1}^{n} gradepoint_i \times credits_i}{ \sum_{i=1}^{n} credits_i}$$`
]

```r
sum(grades$grade_point * grades$credits) / sum(grades$credits)
```

```
## [1] 3.42
```
]

???
- What is the answer?
- Vectorised multiplication in R

---

# Functions

.left-column[ 
### Overview
### User-defined
]
.right-column[
__Create a function named `scorer` that:__
- Takes a dataframe as input
- Outputs a calculation based on the colnames `grade_point` and `credits`

```r
scorer <- function(x){
 sum(x$grade_point*x$credits) / sum(x$credits)
}
```
]
???
- refer to the named input within your function
--
.right-column[

```r
scorer(grades) #use function
```

```
## [1] 3.42
```
]
???
- If you have many dfs of grades from diff people, you can just use this function to calc the GPA for all of them

---

# Functions

.left-column[ 
### Overview
### User-defined
### Loops
]
.right-column[
__About loop functions:__ 
- Loop functions repeat code `i` number of times
- Most common type: `for` loop
]

---

# Functions

**Prepare our data inputs to the `for` loop:**

Get the grades of other team members within `/data` folder

```r
grades_tom <- read.csv("data/grades_tom.csv")
grades_dick <- read.csv("data/grades_dick.csv")
grades_harry <- read.csv("data/grades_harry.csv")
grades_susan <- read.csv("data/grades_susan.csv")
```

]
--
.right-column[
__Put all these dataframes into a list named `team_grades`__

```r
team_grades <- list(grades, grades_tom, grades_dick, grades_harry, grades_susan)
```
]

---

# Functions

.left-column[ 
### Overview
### User-defined
### Loops
]
.right-column[
**For every item (person) in the list `team_grades`, use the function `scorer()` and append results to new column "GPA" in `team_details`**

```r
for(i in 1:length(team_grades)){ 
 team_details$GPA[i] <- scorer(team_grades[[i]]) 
 }
#the named object "i" changes in value with iteration of the loop
```
]
???
- "For" loop - same general structure as a function
- Value of `i` changes through each iteration
- In this case, there are 5 iterations (code is looped 5 times)
- `read.csv("data/team_details.csv")` if you don't have the object in environment
--
.right-column[
**Who has the best grades in the team?** 
]

--
.right-column[

```
##    name age sex perform      GPA
## 1    Me  20   M    High 3.420000
## 2   Tom  25   M     Low 3.710526
## 3  Dick  30   M     Med 4.342105
## 4 Harry  35   M     Med 5.000000
## 5 Susan  40   F    High 4.342105
```
]

???
Harry

---

# Functions

`lapply(x, FUN)`: Apply a function on each element of `x`, returns a _list_  
`apply(x, MARGIN, FUN)`: Apply a function to tabular data by rows (`1`), cols (`2`), or both `c(1,2)`
]
--
.right-column[
**E.g., Find the mean value for _each_ numeric column in `team_details`**  
]

--
.right-column[

```r
apply(team_details[,c(2,5)], 2, mean) #apply mean() function across columns
```

```
##       age       GPA 
## 30.000000  4.162947
```
]
???
- Average age & GPA of people in my team
- Other examples of loops in the notes

- range() of factors are alphabetical (don't mean anything)

---

# Functions

.right-column[
__Quick test!⚡️__  
  
**Calculate the mean for each numeric variable in `data(mtcars)`**

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```
]
???
- 1st few rows
--
.right-column[
**Answer:**

```r
apply(mtcars, 2, mean)
```

```
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
...
```
]

---

# Outline

[About](#about)  
 
[Getting Started](#Getstarted)

[General Syntax](#gensyntax)

[Data Structures](#structures)

[Functions](#functions)

__[The *tidyverse*](#tidyverse)__

[Useful Resources](#resources)

---

# The _tidyverse_

The [_tidyverse_](https://www.tidyverse.org/packages/) collection of packages are commonly used for data science, and share the same design philosophy, grammar and data structures

???

- Packages related to the tidyverse are designed to follow a specific workflow
  - E.g., how code is written... how data is structured... provides an intuitive way to work
- Designed to make the workflow of analysis consistent & reproucible

---

# The _tidyverse_

**Install the _tidyverse_ collection of packages**

```r
install.packages("tidyverse", dependencies = TRUE) # don't forget the quotes
```

.small[.footnote[
**Note:**
- If you are asked to restart R, select the Option 'Yes' once
- Enter `n` if you get the following prompt: `Do you want to attempt to install these from sources?`  
]]

???
- Dependencies: Some packages depend on other packages
- If stuck in restart loop, click no
- If lib not writable, use personal library (admin permissions?)

**Load these packages into R**

```r
library(tidyverse) # no need quotes
```

---

# The _tidyverse_

### _Tidy_ syntax

**The pipe operator `%>%`**  
- From the [magrittr](https://magrittr.tidyverse.org) package
- Frequently used to manipulate data in stages/sequence 
- Shortcut: _Ctrl (Cmd) + Shift + M_

.small[.footnote[
**Note:** R 4.1.0 introduced the native pipe operator `|>`. It can be used without installing/loading any packages. However, note that it will not work in earlier versions of R. Differences between the two pipe operators are explained in [here](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) 
]]

???
- We've loaded the data, examined it briefly
- Before we continue, let me introduce you to the pipe operator...
- Used very often in **converting data**

__For example:__

```r
round(exp(diff(log(x))), 1) # using nested brackets

x %>% # using the pipe operator
  log() %>%
  diff() %>%
  exp() %>%
  round(1)
```
???

- Very readable, follows a logical sequence

---

# The _tidyverse_

### _Tidy_ data

- Tabular data (2D)  
- Each variable is a column & each observation is a row  
- Can be in long or wide format

???
- Which is wide/long?

---

# The _tidyverse_

__Load example survey data as _tibbles_1 using `readr::read_csv()`__ (Source: [Kaggle](https://www.kaggle.com/kaggle/kaggle-survey-2018))

```r
survey <- read_csv("data/kaggle-survey-2018_mcq.csv", skip = 1)
```

<img src="images/previewdata.png" width="100%" style="display: block; margin: auto;" />
.small[.footnote[[1] [Tibbles](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) are dataframes with stricter rules that avoid hassle/errors often associated with conventional dataframes]]
]

???
- `::` can help you rmbr which package
- Note: 2 rows of headers - account for when we import data 
- `skip = 1` Skip the first line of the .csv file

---

# The _tidyverse_

__Examine the first few rows of `survey`__

```r
head(survey)
```

```
## # A tibble: 6 × 395
## `Duration (in seconds)` What is your gender? - Select…¹ What is your gender?…²
## <dbl> <chr> <dbl>
## 1 710 Female -1
## 2 434 Male -1
## 3 718 Female -1
## 4 621 Male -1
## 5 731 Male -1
## 6 1142 Male -1
## # ℹ abbreviated names: ¹`What is your gender? - Selected Choice`,
## # ²`What is your gender? - Prefer to self-describe - Text`
## # ℹ 392 more variables: `What is your age (# years)?` <chr>,
...
```

]

???
- notice it says `tibble` with the df dimensions
- data type of each col also shown

---

# The _tidyverse_

.left-column[ 
### Import
]
.right-column[
**Compare `tibble::read_csv()` with `read.csv()` from base R**

```r
survey2 <- read.csv("data/kaggle-survey-2018_mcq.csv", skip = 1)
head(survey2)
```

```
##   Duration..in.seconds. What.is.your.gender....Selected.Choice
## 1                   710                                 Female
## 2                   434                                   Male
## 3                   718                                 Female
## 4                   621                                   Male
## 5                   731                                   Male
## 6                  1142                                   Male
...
```
]

???
- Colnames: Whitespace & symbols replaced with `.` (Rmbr what we mentioned about best practices?)
- We'll use `read_csv()` in our analysis

---

# The _tidyverse_

__Column names have unusual characters__

```r
head(colnames(survey))
```

```
## [1] "Duration (in seconds)"                                                                                          
## [2] "What is your gender? - Selected Choice"                                                                         
## [3] "What is your gender? - Prefer to self-describe - Text"                                                          
## [4] "What is your age (# years)?"                                                                                    
## [5] "In which country do you currently reside?"                                                                      
## [6] "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?"
```
]

???
- White spaces; long colnames
- Analysing colnames as entire sentences is not very feasible at scale (we should abbreviate the colnames)

.right-column[
__Abbreviate the colname `Duration (in seconds)` to `duration` using `dplyr::rename()`__

```r
survey <- survey %>%
 rename(duration = `Duration (in seconds)`)
```
]

???
- A little housekeeping...  
- Have to wrap colname with backticks (because of white spaces)

---

# The _tidyverse_

.left-column[ 
### Import
### Tidy
]
.right-column[
__Change the units from seconds to minutes using `dplyr::mutate()`__

```r
survey <- survey %>%
 mutate(duration = duration/60) # overwrite the colname
```
]

```r
head(survey$duration)
```

```
## [1] 11.833333  7.233333 11.966667 10.350000 12.183333 19.033333
```
]

---

# The _tidyverse_

.left-column[ 
### Import
### Tidy
### Wrangle
]
.right-column[
__Subset rows using `dplyr::filter()` as an alternative to subset operators `[` and `]` in base R__
]

.right-column[
__E.g., Subset (filter) the data to respondents who took < 30 minutes to complete the survey__

```r
survey %>% 
 filter(duration < 30)
```

```
## # A tibble: 17,757 × 395
## duration What is your gender?…¹ What is your gender?…² What is your age (# …³
## <dbl> <chr> <dbl> <chr> 
## 1 11.8 Female -1 45-49 
## 2 7.23 Male -1 30-34 
## 3 12.0 Female -1 30-34 
## 4 10.4 Male -1 35-39 
## 5 12.2 Male -1 22-24 
...
```
]
???
- What is the base R operation? `survey[survey$duration < 30, ]`

---

# The _tidyverse_

__Use `group_by()`, `summarize()` and `arrange()` from the [dplyr](https://dplyr.tidyverse.org) package to summarise (aggregate) data__

```r
ctry_breakdown <- survey %>%
 rename(country = `In which country do you currently reside?`) %>% # simplify colname
 group_by(country) %>%
 summarise(count = n()) %>% # create new col that counts each group size using n()
 arrange(-count) # arrange by the colname 'count' in descending order 
ctry_breakdown
```

```
## # A tibble: 58 × 2
## country count
## <chr> <int>
## 1 United States of America 4716
## 2 India 4417
## 3 China 1644
## 4 Other 1036
## 5 Russia 879
## 6 Brazil 736
## 7 Germany 734
...
```
]

---

# The _tidyverse_

.left-column[ 
### Import
### Tidy
### Wrangle
### Plot
]
.right-column[
__Plot a histogram using the `ggplot2::ggplot()` function__

Three basic steps:  
1. Provide _data_  
2. Assign your data _variables_ to _aesthetics_  
3. Assign the graphical _primitives_

```r
survey %>% # data
  ggplot(aes(duration)) + # map variable to aesthetic
  geom_histogram() # graphical primitive
```
]

???
- A very popular package used for data visualisation
- Show in console!
- In console: `?ggplot`; try editing the `bins=` argument 
- Everything within ggplot() function also follows pipeline approach (+)
  - Everything tt follows the `+` sign is part of the `ggplot` func

---

# The _tidyverse_

```r
survey %>% 
  ggplot(aes(duration)) + 
  geom_histogram()
```

<img src="slides_files/figure-html/unnamed-chunk-65-1.png" width="70%" style="display: block; margin: auto;" />
]
???
- warning about the binwidth - very skewed data

---

# The _tidyverse_

```r
survey %>% 
  ggplot(aes(duration)) + 
  geom_histogram() +
  geom_vline(xintercept = median(survey$duration)) # add median value
```

<img src="slides_files/figure-html/unnamed-chunk-66-1.png" width="70%" style="display: block; margin: auto;" />
]

???
- We add another graphical primitive (vertical line)
- Super long tail (respondent left browser open?)

---

# The _tidyverse_

```r
survey %>% 
  ggplot(aes(duration)) + 
  geom_histogram() +
  geom_vline(xintercept = median(survey$duration)) +
  scale_x_log10() # address extreme x-values
```

<img src="slides_files/figure-html/unnamed-chunk-67-1.png" width="70%" style="display: block; margin: auto;" />
]

???
- Transform the x-axis logarithmically - squeeze in that tail

---

# The _tidyverse_

```r
survey %>% 
  ggplot(aes(duration)) + 
  geom_histogram(bins = 50) +
  geom_vline(xintercept = median(survey$duration), linetype = 2) + 
  
  scale_x_log10(breaks = c(2, 5, 10, 20, 60, 1440)) + 
  
  labs(x = "Duration (mins)", y = "Number of respondents") + #change axis labels
  ggtitle("Most respondents took 15-20 min to complete survey") #add figure title
```

<img src="slides_files/figure-html/unnamed-chunk-68-1.png" width="70%" style="display: block; margin: auto;" />
]
???
- You can customise the plot to design it to your liking
- This is an eg. of how you can analyse 1 variable in your dataframe

---

# It's your turn!

__Explore and visualise `data(diamonds, package = "ggplot2")`__

---

# Quick exercise ⚡️

__Filter diamonds that are less than $3000 with a Premium cut__

_Expected output:_

```
## # A tibble: 6,757 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 2 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 3 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
## 4 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 5 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
## 6 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
## 7 0.29 Premium F SI1 62.4 58 403 4.24 4.26 2.65
## 8 0.22 Premium E VS2 61.6 58 404 3.93 3.89 2.41
## 9 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
## 10 0.3 Premium J SI2 59.3 61 405 4.43 4.38 2.61
## # ℹ 6,747 more rows
```

---

# Quick exercise ⚡️

__Plot a histogram of price for all diamonds__

_Expected output:_
<img src="slides_files/figure-html/unnamed-chunk-71-1.svg" width="70%" style="display: block; margin: auto;" />

---

# Quick exercise ⚡️

__Plot a scatter diagram of the price, carat and cut for all diamonds__

_Expected output:_

<img src="images/diamond_scatter.jpeg" width="50%" style="display: block; margin: auto;" />
???
- 3 variables

---

# Questions?

[About](#about)  
 
[Getting Started](#Getstarted)

[General Syntax](#gensyntax)

[Data Structures](#structures)

[Functions](#functions)

[The _tidyverse_](#tidyverse)

[Useful Resources](#resources)

---

# Useful Resources
__Online tutorials__
- [R for Data Science](https://r4ds.hadley.nz)
- [Quick R](https://www.statmethods.net)
- [Learn the tidyverse](https://www.tidyverse.org/learn/)
- [R markdown cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/)

__Online Q&A__  
- [Stack Overflow](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)
- [How to ask a good question online](https://stackoverflow.com/help/how-to-ask)
- Remember to check your `sessionInfo()` when troubleshooting!

__Others__  
- [Use R/RStudio from an external drive](https://github.com/ClaudiaBrauer/A-very-short-introduction-to-R/blob/master/documents/Portable_versions_of_R_and_RStudio.pdf) (if you don't have admin rights to install software)