class: center, middle, inverse, title-slide, title-slide .title[ # Intro2R ] .subtitle[ ## Introduction to R Programming ] .author[ ###
Xiao Ping (XP) Song
] .institute[ ###
xp.song@u.nus.edu
Course materials:
https://github.com/xp-song/Intro2R
] .date[ ### updated 2023-07-30 ] --- layout: true <div class="my-footer"><span>© XP Song</span></div> --- class: inverse, center, middle <style type="text/css"> pre { background: #00000000; max-width: 100%; overflow-x: auto; } </style> # Before we begin... 1. Navigate to course webpage and read background information *https://github.com/xp-song/Intro2R* 2. Ensure that you have installed __R__ on your computer, followed by __R Studio__ *(follow links under '_Instructions_' section of webpage)* 3. [Download](https://github.com/xp-song/Intro2R/archive/master.zip) workshop materials *(green button on webpage)* --- class: inverse, center, middle name: about # Outline __[About](#about)__ [Getting Started](#Getstarted) [General Syntax](#gensyntax) [Data Structures](#structures) [Functions](#functions) [The _tidyverse_](#tidyverse) [Useful Resources](#resources) --- background-image: url(https://www.r-project.org/logo/Rlogo.svg) background-size: 100px background-position: 95% 15% class: left # What is R? - Programming language and software environment with a command line interface - RStudio is often used as a software client ??? - Cmd line interface is text-based - Excel: interact with the software using a GUI - Rstudio: E.g., Microsoft Word vs. text editor - In this course, we will work entirely within RStudio -- - Both R and RStudio are open source software - Huge library of packages created by the R community ??? - Open source: - Anyone can contribute, easy to look for help online - *Packages*: Like apps/toolboxes - E.g., these course materials --- class: left # About this crash course ## What it IS - Designed for those with minimal coding experience - Give you a taste of what R can do ??? - ... hopefully give you some ideas on how you can use R for your own projects - Slide deck 1: Look at the basics (language, structure/syntax) - Slide deck 2: Dive right into analysing data -- ## What it is NOT - A substitute to practicing the fundamentals of the language - A lesson in statistics ??? - I have provided links to more resources in your notes.. check them out if you're interested --- class: inverse, center, middle name: Getstarted # Outline [About](#about) __[Getting Started](#Getstarted)__ [General Syntax](#gensyntax) [Data Structures](#structures) [Functions](#functions) [The _tidyverse_](#tidyverse) [Useful Resources](#resources) --- class: inverse, center # Course materials <br> On your computer, navigate to downloaded folder <br> _/notes_ <sup>1</sup> _/data_ _PDF slide deck_ _Intro2R.Rproj_ <br> .small[.center[[1] View in your web browser by opening the '_.html_' files]] ??? - Notes: Can refer - Output: when we generate figures/graphs - Open the RStudio Project file. This boots up RStudio (demo) --- class: left # R Studio Client <img src="images/rstudioclient.png" width="600" style="display: block; margin: auto;" /> - __Console:__ Command line input/output - __Script editor:__ View/edit files that contain code - __Environment/History__ - __Files/Plots/Packages/Help/Viewer__ ??? - Console: real-time interaction (inputs AND OUTPUTS) - All you would see if you ran R without RStudio! - TRY doing some calculations - Create new .R script (File > New File > R script) > Script editor - Write some calculations > Cmd+Enter - Code sent to the console! - Comments # are skipped - Scripts: edit/save/share our code with others! --- class: left # R Notebooks ## What are R Notebooks? - R Notebooks (a.k.a. [R Markdown Notebooks](http://rmarkdown.rstudio.com)) are files ending with '_.Rmd_'. - Compared to basic '_.R_' scripts, they allows us to: - Write normal text alongside code - Interact with code within a single document - Generate (i.e. '_knit_') different types of files ??? - R Notebooks = R Markdown files - E.g., All materials in this course were created from _.Rmd_ files. - `install.packages("rmarkdown")` (already comes installed with RStudio) -- <br> **Try creating a new R Notebook `File > New File > R Notebook`** --- class: left # R Notebooks <img src="images/notebook.png" width="700" style="display: block; margin: auto;" /> - __Header section:__ specify document [parameters](https://bookdown.org/yihui/rmarkdown/html-document.html) - __Normal text__ - __Code chunk:__ write code and specify code [parameters](https://bookdown.org/yihui/rmarkdown/r-code.html) ??? - Follow instructions - Code chunk: Results appear beneath the chunk - plot(): our very first function (tool in toolbox) - Name of function always followed by brackets - Look up a function with `?` -- **Save our new file as _'myanalysis.Rmd'_** ??? - *myanalysis.Rmd*: use for the rest of the lecture --- class: left # RStudio Projects **Try creating a new RStudio Project `File > New Project > New Directory > New Project`** <img src="images/new project.png" width="300" style="display: block; margin: auto;" /> --- class: left # RStudio Projects ## What are RStudio Projects? - RStudio Projects help organise your work into separate 'R sessions'. - Each project has it's own workspace a.k.a. 'working directory' (separate configuration, history, etc.) ??? - Workspace = Table in my office - How do I know which is my table? -- - The location of the '_.RProj_' file defines the 'working directory' - **Type `getwd()` in the _console_ of our new project** - This returns the absolute path to our working directory e.g., `/Users/<computer_username>/Desktop/test` ??? - E.g., tissue paper - Likewise, .RProj file tells R that this is my workspace (Address: see window header) - Also type `getwd()` in the *Intro2R* project - *Delete test project* --- class: left # RStudio Projects ## 🌟 Best Practice - Use _relative_ paths in your script, based on _.RProj_ file location - **Try reading in data in your R Notebook** `read.csv("<path to Intro2R folder>/Intro2R/data/ozone_data.csv")` `read.csv("data/ozone_data.csv")` - Keep all project items in the working directory ??? - By using RStudio Projects... don't need to write out initial path - Relative path: Scripts work across different computers - **IMPT: getwd() in console may be diff from in code chunk in notebook** - If .Rmd is in subfolders! `../` --- class: inverse, center, middle name: gensyntax # Outline [About](#about) [Getting Started](#Getstarted) __[General Syntax](#gensyntax)__ [Data Structures](#structures) [Functions](#functions) [The _tidyverse_](#tidyverse) [Useful Resources](#resources) --- class: left # General Syntax .left-column[ ### Operators ] .right-column[ __Arithmetic operators:__ **E.g., Solve the following equation** $$ \frac{ (1+2) * (4-5)}{50} $$ ] ??? - Refer to your notes for the list of operators - We will only go through some operators now -- .right-column[ ```r (1+2)*(4-5)/50 ``` ``` ## [1] -0.06 ``` ] --- class: left # General Syntax .left-column[ ### Operators ] .right-column[ __Logical operators:__ **E.g., Check if `1e3` is larger or equal to `1*10^3`** ] -- .right-column[ ```r 1e3 >= 1*10^3 ``` ``` ## [1] TRUE ``` ] ??? - `1e3` = Scientific/exponential notation - 1000 can be written as 1e3 --- class: left # General Syntax .left-column[ ### <span style="color:grey">Operators</span> ### Variables ] .right-column[ __Variables are named objects used to store data__ - `<-` is used to assign variable names in R (E.g., `x <- 4`) - Print variables by name (`x` vs. `"x"`) - Assigning data to an existing variable overwrites it (`x <- 10`) ] ??? - Arrow shortcut `Option/Alt`+`-` - `=` produces a similar result.. but reserve it for other uses - Let's assign a number to `hello` - New variable in environment - "hello" vs. hello (text in quotation) - Case sensitive! -- .right-column[ **🌟Best Practice** - Clear and consistent names - Avoid numbers/symbols/whitespace ] --- class: left # General Syntax .left-column[ ### <span style="color:grey">Operators</span> ### Variables ] .right-column[ __Data types and examples:__ - Numeric (`3.142`), Integer (`5L`) - Character (`"hello"`) - Logical (`TRUE, FALSE`) - Complex ] -- .right-column[ **Let's assign new variables `name`, `age`, and `weight`** ] ??? - Your own names.. -- .right-column[ **Check the data type for each variable using the function `is.numeric()`, `is.integer()`, `is.character()`** ] ??? - Age remember to put L (integer) - both numeric and integer --- class: inverse, center, middle name: structures # Outline [About](#about) [Getting Started](#Getstarted) [General Syntax](#gensyntax) __[Data Structures](#structures)__ [Functions](#functions) [The _tidyverse_](#tidyverse) [Useful Resources](#resources) --- class: left # Data Structures .left-column[ ### Vectors ] .right-column[ __About vectors:__ - Linear collection of data - Must be of the _same_ data type ] -- .right-column[ **Assign a _vector_ of names to the variable `name`** (use the concatenate function `c()`) ```r name <- c("Me", "Tom", "Dick", "Harry", "Susan") # character vector ``` ] ??? - *COERCION*: What if we add a number? Treated as "character" (Environment: "chr") -- .right-column[ **Assign a _vector_ of numbers to the variable `age`** ```r age <- c(20, 25, 30, 35, 40) # numeric vector ``` ] ??? - For numeric vectors: - What if we add characters/text? COERCE - What if we add LOGICAL data? (COERCE: 1 = TRUE, 0 = FALSE) --- class: left # Data Structures .left-column[ ### Vectors ] .right-column[ __About vectors:__ - Linear collection of data - Must be of the _same_ data type - _Operations in R are vectorised_ ] -- .right-column[ **Subtract 5 from the vector `age`** ```r age-5 ``` ``` ## [1] 15 20 25 30 35 ``` **Add together two vectors** ```r age+age ``` ``` ## [1] 40 50 60 70 80 ``` ] ??? - If 2 vectors not similar in length, will recycle the shorter vector! - R will return a warning --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### Lists ] .right-column[ __About lists:__ - Linear collection of data - Can contain of different _types_ and _structure_ of data ] ??? - Lists are very flexible! -- .right-column[ **Create a list with a mix of data types and variables** ```r myteam <- list(name, age, "Group 1", 2019) ``` ] --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### Lists ] .right-column[ __About lists:__ - Linear collection of data - Can contain of different _types_ and _structure_ of data ] .right-column[ ```r myteam ``` ``` ## [[1]] ## [1] "Me" "Tom" "Dick" "Harry" "Susan" ## ## [[2]] ## [1] 20 25 30 35 40 ## ## [[3]] ## [1] "Group 1" ## ## [[4]] ## [1] 2019 ``` ] ??? - Elements in a list are indexed with [[]] - Sub-indices are in [] (nested like tree branches) --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### Factors ] .right-column[ __About factors:__ - A special kind of vector that represents categorical data with discrete levels ] ??? - E.g., Eye color, M/F (binary), SA to SD (ordinal) -- .right-column[ **Let's code the sex of each person in the variable `name`** (use the functions `factor()` and `c()`) ```r sex <- factor(c("M","M","M","M","F")) sex ``` ``` ## [1] M M M M F ## Levels: F M ``` ] --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### Factors ] .right-column[ __About factors:__ - A special kind of vector that represents categorical data with discrete levels ] .right-column[ **Let's code the performance of each person in `name`** ```r perform <- factor(c("High", "Low", "Med", "Med", "High")) perform ``` ``` ## [1] High Low Med Med High ## Levels: High Low Med ``` What is wrong with this output? ] ??? - Order: Alphabetical unless specified --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### Factors ] .right-column[ __About factors:__ - A special kind of vector that represents categorical data with discrete levels ] .right-column[ **Define the order using the `levels=` argument in `factor()`** ```r perform <- factor(c("High", "Low", "Med", "Med", "High"), levels = c("Low", "Med", "High")) perform ``` ``` ## [1] High Low Med Med High ## Levels: Low Med High ``` ] ??? - Add arguments to the function (`?functionname`) --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### <span style="color:grey">Factors</span> ### Matrices ] .right-column[ __About matrices:__ - Tabular data (rows & columns) - Must be of the _same_ data type ] ??? - 2D/rectangular data - E.g., Image processing (each pixel has a value), spatial data.. -- .right-column[ **Create a 4 by 3 matrix of sequential numbers** Use `matrix()` and the `:` operator to create a sequence ```r m <- matrix(1:12, nrow = 4) m ``` ``` ## [,1] [,2] [,3] ## [1,] 1 5 9 ## [2,] 2 6 10 ## [3,] 3 7 11 ## [4,] 4 8 12 ``` ] --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### <span style="color:grey">Factors</span> ### <span style="color:grey">Matrices</span> ### Dataframes ] .right-column[ __About dataframes:__ - Tabular data (rows & columns) - Rows represent data entries, columns represent different variables ] -- .right-column[ **Import the dataset `ozone_data.csv` into your R Notebook using `read.csv()`** ```r ozone <- read.csv("data/ozone_data.csv") # column headers in first row ``` ] --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### <span style="color:grey">Factors</span> ### <span style="color:grey">Matrices</span> ### Dataframes ] .right-column[ **View the first few rows of `ozone`** ```r head(ozone) #print first few rows ``` ``` ## rad temp wind ozone ## 1 190 67 7.4 41 ## 2 118 72 8.0 36 ## 3 149 74 12.6 12 ## 4 313 62 11.5 18 ## 5 299 65 8.6 23 ## 6 99 59 13.8 19 ``` ] -- .right-column[ **Check the dimensions of `ozone`** ```r dim(ozone) ``` ``` ## [1] 111 4 ``` ] ??? - Tabular data in R: rows, cols --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### <span style="color:grey">Factors</span> ### <span style="color:grey">Matrices</span> ### Dataframes ] .right-column[ **Check the names of `ozone` using `dimnames()`, `rownames()` and `colnames()`** ```r dimnames(ozone) ``` ``` ## [[1]] ## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" ## [13] "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" ## [25] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" ## [37] "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" ## [49] "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" ## [61] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" ## [73] "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84" ## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" ## [97] "97" "98" "99" "100" "101" "102" "103" "104" "105" "106" "107" "108" ## [109] "109" "110" "111" ## ## [[2]] ## [1] "rad" "temp" "wind" "ozone" ``` ] ??? - Returns a list! - Try using `rownames()` and `colnames()` --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### <span style="color:grey">Factors</span> ### <span style="color:grey">Matrices</span> ### Dataframes ] .right-column[ **Extract data by colnames using `$`** (output is a vector) ```r ozone$temp ``` ``` ## [1] 67 72 74 62 65 59 61 69 66 68 58 64 66 57 68 62 59 73 61 61 67 81 79 76 82 ## [26] 90 87 82 77 72 65 73 76 84 85 81 83 83 88 92 92 89 73 81 80 81 82 84 87 85 ## [51] 74 86 85 82 86 88 86 83 81 81 81 82 89 90 90 86 82 80 77 79 76 78 78 77 72 ## [76] 79 81 86 97 94 96 94 91 92 93 93 87 84 80 78 75 73 81 76 77 71 71 78 67 76 ## [101] 68 82 64 71 81 69 63 70 75 76 68 ``` ] --- class: left # Data Structures .left-column[ ### <span style="color:grey">Vectors</span> ### <span style="color:grey">Lists</span> ### <span style="color:grey">Factors</span> ### <span style="color:grey">Matrices</span> ### Dataframes ] .right-column[ **Create a dataframe with the vectors `name`,`sex`, `age` and `perform`** ```r team_details <- data.frame(name, age, sex, perform) team_details ``` ``` ## name age sex perform ## 1 Me 20 M High ## 2 Tom 25 M Low ## 3 Dick 30 M Med ## 4 Harry 35 M Med ## 5 Susan 40 F High ``` ] --- class: left # Back to operators... ## Subsetting in R **Extract the 5th element in the vector `name`** ```r name[5] ``` ``` ## [1] "Susan" ``` -- **Extract the 4th element of the column `age` in the dataframe `team_details`** .small[_Remember: use `$` to extract columns by their name_] ```r team_details$age[4] ``` ``` ## [1] 35 ``` --- class: left # Back to operators... ## Subsetting in R **Extract the element in the 2nd row and 4th col in `team_details`** ```r team_details[2,4] ``` -- **Extract 2nd row and all cols in `team_details`** ```r team_details[2,] ``` -- **Extract the 4th col and all rows except the 2nd in `team_details`** ```r team_details[-2,4] ``` --- class: left # Back to operators... ## Subsetting in R **Extract rows 1 to 3 in `team_details`** ```r team_details[1:3,] ``` ``` ## name age sex perform ## 1 Me 20 M High ## 2 Tom 25 M Low ## 3 Dick 30 M Med ``` ??? Hint: u can use the semicolon to indicate a sequence of no.s -- **Extract rows 1 and 3 in `team_details`** ```r team_details[c(1,3),] ``` ``` ## name age sex perform ## 1 Me 20 M High ## 3 Dick 30 M Med ``` --- class: left # Back to operators... ## Subsetting in R: Quick test!⚡️ **Load the built-in dataset `data(mtcars)`** -- **Extract data on cars with a fuel efficiency of at least 20 mpg, and that are more than 108 hp** ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 ``` ??? - This is the answer you want to end up with - Refer to 'Operators' in notes - Each row is a type of car - We want all the variables (leave col blank) -- - **Hint:** `mtcars[ & , ]` -- - **Hint:** `mtcars[mtcars$mpg >= 20 & , ]` ??? - Answer: `mtcars[mtcars$mpg >= 20 & mtcars$hp > 108, ]` --- class: inverse, center, middle name: functions # Outline [About](#about) [Getting Started](#Getstarted) [General Syntax](#gensyntax) [Data Structures](#structures) __[Functions](#functions)__ [The _tidyverse_](#tidyverse) [Useful Resources](#resources) --- class: left # Functions .left-column[ ### Overview ] .right-column[ __About functions:__ - Functions have inputs and outputs - Look up details about the function with `?<functionname>` ] ??? - Tools in toolbox (packages): Orange squeezer, juicer, blender - input = orange - output = orange juice, smoothie -- .right-column[ **E.g., Plot the performance distribution in `team_details`** ```r plot(team_details$perform) ``` <img src="slides_files/figure-html/unnamed-chunk-31-1.png" width="50%" style="display: block; margin: auto;" /> ] ??? - **Plotting functions** --- class: left # Functions .left-column[ ### Overview ] .right-column[ **E.g., Find the mean age in `team_details`** ```r mean(team_details$age) ``` ``` ## [1] 30 ``` ] ??? - **Statistical functions** - Basic functions: refer to notes -- .right-column[ **E.g., Find the number of people (rows) in `team_details`** ```r nrow(team_details) ``` ``` ## [1] 5 ``` ] --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### User-defined ] .right-column[ __General structure when defining a custom function:__ ```r functionname <- function(inputs){ # calculations... output } ``` ] ??? - You define the inputs and outputs - All calculations are within squiggly brackets -- .right-column[ **Subsequent calls to the function:** `functionname(inputs)` ] ??? - Function becomes an object in your environment --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### User-defined ] .right-column[ __E.g., Load `data/grades.csv` and assign it the name `grades`__ ``` ## subject grade grade_point credits ## 1 Math A 4.5 5 ## 2 English B 3.5 5 ## 3 Economics C 2.0 4 ## 4 Mandarin B+ 4.0 5 ## 5 Music F 1.0 0 ## 6 History C+ 2.5 5 ## 7 Intro2R A+ 5.0 1 ``` ] --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### User-defined ] .right-column[ **Manually calculate the GPA in R using the formula:** `$$\frac{ \sum_{i=1}^{n} gradepoint_i \times credits_i}{ \sum_{i=1}^{n} credits_i}$$` ] -- .right-column[ ```r sum(grades$grade_point * grades$credits) / sum(grades$credits) ``` ``` ## [1] 3.42 ``` ] ??? - What is the answer? - Vectorised multiplication in R --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### User-defined ] .right-column[ __Create a function named `scorer` that:__ - Takes a dataframe as input - Outputs a calculation based on the colnames `grade_point` and `credits` ```r scorer <- function(x){ sum(x$grade_point*x$credits) / sum(x$credits) } ``` ] ??? - refer to the named input within your function -- .right-column[ ```r scorer(grades) #use function ``` ``` ## [1] 3.42 ``` ] ??? - If you have many dfs of grades from diff people, you can just use this function to calc the GPA for all of them --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### User-defined ### Loops ] .right-column[ __About loop functions:__ - Loop functions repeat code `i` number of times - Most common type: `for` loop ] --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### User-defined ### Loops ] .right-column[ **Prepare our data inputs to the `for` loop:** Get the grades of other team members within `/data` folder ```r grades_tom <- read.csv("data/grades_tom.csv") grades_dick <- read.csv("data/grades_dick.csv") grades_harry <- read.csv("data/grades_harry.csv") grades_susan <- read.csv("data/grades_susan.csv") ``` ] -- .right-column[ __Put all these dataframes into a list named `team_grades`__ ```r team_grades <- list(grades, grades_tom, grades_dick, grades_harry, grades_susan) ``` ] --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### User-defined ### Loops ] .right-column[ **For every item (person) in the list `team_grades`, use the function `scorer()` and append results to new column "GPA" in `team_details`** ```r for(i in 1:length(team_grades)){ team_details$GPA[i] <- scorer(team_grades[[i]]) } #the named object "i" changes in value with iteration of the loop ``` ] ??? - "For" loop - same general structure as a function - Value of `i` changes through each iteration - In this case, there are 5 iterations (code is looped 5 times) - `read.csv("data/team_details.csv")` if you don't have the object in environment -- .right-column[ **Who has the best grades in the team?** ] -- .right-column[ ``` ## name age sex perform GPA ## 1 Me 20 M High 3.420000 ## 2 Tom 25 M Low 3.710526 ## 3 Dick 30 M Med 4.342105 ## 4 Harry 35 M Med 5.000000 ## 5 Susan 40 F High 4.342105 ``` ] ??? Harry --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### <span style="color:grey">User-defined</span> ### Loops ] .right-column[ **Examples of Loop functions in base R** `lapply(x, FUN)`: Apply a function on each element of `x`, returns a _list_ `apply(x, MARGIN, FUN)`: Apply a function to tabular data by rows (`1`), cols (`2`), or both `c(1,2)` ] -- .right-column[ **E.g., Find the mean value for _each_ numeric column in `team_details`** ] -- .right-column[ ```r apply(team_details[,c(2,5)], 2, mean) #apply mean() function across columns ``` ``` ## age GPA ## 30.000000 4.162947 ``` ] ??? - Average age & GPA of people in my team - Other examples of loops in the notes - range() of factors are alphabetical (don't mean anything) --- class: left # Functions .left-column[ ### <span style="color:grey">Overview</span> ### <span style="color:grey">User-defined</span> ### Loops ] .right-column[ __Quick test!⚡️__ **Calculate the mean for each numeric variable in `data(mtcars)`** ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` ] ??? - 1st few rows -- .right-column[ **Answer:** ```r apply(mtcars, 2, mean) ``` ``` ## mpg cyl disp hp drat wt qsec ## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 ... ``` ] --- class: inverse, center, middle name: tidyverse # Outline [About](#about) [Getting Started](#Getstarted) [General Syntax](#gensyntax) [Data Structures](#structures) [Functions](#functions) __[The *tidyverse*](#tidyverse)__ [Useful Resources](#resources) --- class: left # The _tidyverse_ The [_tidyverse_](https://www.tidyverse.org/packages/) collection of packages are commonly used for data science, and share the same design philosophy, grammar and data structures <img src="images/tidyverse.png" width="500" style="display: block; margin: auto;" /> .small[.center[*Core packages loaded automatically with* `library(tidyverse)`]] ??? - Packages related to the tidyverse are designed to follow a specific workflow - E.g., how code is written... how data is structured... provides an intuitive way to work - Designed to make the workflow of analysis consistent & reproucible --- class: left # The _tidyverse_ **Install the _tidyverse_ collection of packages** ```r install.packages("tidyverse", dependencies = TRUE) # don't forget the quotes ``` .small[.footnote[ **Note:** - If you are asked to restart R, select the Option 'Yes' once - Enter `n` if you get the following prompt: `Do you want to attempt to install these from sources?` ]] ??? - Dependencies: Some packages depend on other packages - If stuck in restart loop, click no - If lib not writable, use personal library (admin permissions?) -- **Load these packages into R** ```r library(tidyverse) # no need quotes ``` --- class: left # The _tidyverse_ ### _Tidy_ syntax **The pipe operator `%>%`** - From the [magrittr](https://magrittr.tidyverse.org) package - Frequently used to manipulate data in stages/sequence - Shortcut: _Ctrl (Cmd) + Shift + M_ .small[.footnote[ **Note:** R 4.1.0 introduced the native pipe operator `|>`. It can be used without installing/loading any packages. However, note that it will not work in earlier versions of R. Differences between the two pipe operators are explained in [here](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) ]] ??? - We've loaded the data, examined it briefly - Before we continue, let me introduce you to the pipe operator... - Used very often in **converting data** -- __For example:__ ```r round(exp(diff(log(x))), 1) # using nested brackets x %>% # using the pipe operator log() %>% diff() %>% exp() %>% round(1) ``` ??? - Very readable, follows a logical sequence --- class: left # The _tidyverse_ ### _Tidy_ data - Tabular data (2D) - Each variable is a column & each observation is a row - Can be in long or wide format <img src="images/tidy.jpg" width="700" style="display: block; margin: auto;" /> ??? - Which is wide/long? --- # The _tidyverse_ .left-column[ ### Import ] .right-column[ __Load example survey data as _tibbles_<sup>1</sup> using `readr::read_csv()`__ <br>(Source: [Kaggle](https://www.kaggle.com/kaggle/kaggle-survey-2018)) ```r survey <- read_csv("data/kaggle-survey-2018_mcq.csv", skip = 1) ``` <img src="images/previewdata.png" width="100%" style="display: block; margin: auto;" /> .small[.footnote[[1] [Tibbles](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) are dataframes with stricter rules that avoid hassle/errors often associated with conventional dataframes]] ] ??? - `::` can help you rmbr which package - Note: 2 rows of headers - account for when we import data - `skip = 1` Skip the first line of the .csv file --- # The _tidyverse_ .left-column[ ### Import ] .right-column[ __Examine the first few rows of `survey`__ ```r head(survey) ``` ``` ## # A tibble: 6 × 395 ## `Duration (in seconds)` What is your gender? - Select…¹ What is your gender?…² ## <dbl> <chr> <dbl> ## 1 710 Female -1 ## 2 434 Male -1 ## 3 718 Female -1 ## 4 621 Male -1 ## 5 731 Male -1 ## 6 1142 Male -1 ## # ℹ abbreviated names: ¹`What is your gender? - Selected Choice`, ## # ²`What is your gender? - Prefer to self-describe - Text` ## # ℹ 392 more variables: `What is your age (# years)?` <chr>, ... ``` ] ??? - notice it says `tibble` with the df dimensions - data type of each col also shown --- class: left # The _tidyverse_ .left-column[ ### Import ] .right-column[ **Compare `tibble::read_csv()` with `read.csv()` from base R** ```r survey2 <- read.csv("data/kaggle-survey-2018_mcq.csv", skip = 1) head(survey2) ``` ``` ## Duration..in.seconds. What.is.your.gender....Selected.Choice ## 1 710 Female ## 2 434 Male ## 3 718 Female ## 4 621 Male ## 5 731 Male ## 6 1142 Male ... ``` ] ??? - Colnames: Whitespace & symbols replaced with `.` (Rmbr what we mentioned about best practices?) - We'll use `read_csv()` in our analysis --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### Tidy ] .right-column[ __Column names have unusual characters__ ```r head(colnames(survey)) ``` ``` ## [1] "Duration (in seconds)" ## [2] "What is your gender? - Selected Choice" ## [3] "What is your gender? - Prefer to self-describe - Text" ## [4] "What is your age (# years)?" ## [5] "In which country do you currently reside?" ## [6] "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?" ``` ] ??? - White spaces; long colnames - Analysing colnames as entire sentences is not very feasible at scale (we should abbreviate the colnames) -- .right-column[ __Abbreviate the colname `Duration (in seconds)` to `duration` using `dplyr::rename()`__ ```r survey <- survey %>% rename(duration = `Duration (in seconds)`) ``` ] ??? - A little housekeeping... - Have to wrap colname with backticks (because of white spaces) --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### Tidy ] .right-column[ __Change the units from seconds to minutes using `dplyr::mutate()`__ ```r survey <- survey %>% mutate(duration = duration/60) # overwrite the colname ``` ] -- .right-column[ __Print out first few rows of `survey$duration`__ ```r head(survey$duration) ``` ``` ## [1] 11.833333 7.233333 11.966667 10.350000 12.183333 19.033333 ``` ] --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### <span style="color:grey">Tidy</span> ### Wrangle ] .right-column[ __Subset rows using `dplyr::filter()` as an alternative to subset operators `[` and `]` in base R__ ] -- .right-column[ __E.g., Subset (filter) the data to respondents who took < 30 minutes to complete the survey__ ```r survey %>% filter(duration < 30) ``` ``` ## # A tibble: 17,757 × 395 ## duration What is your gender?…¹ What is your gender?…² What is your age (# …³ ## <dbl> <chr> <dbl> <chr> ## 1 11.8 Female -1 45-49 ## 2 7.23 Male -1 30-34 ## 3 12.0 Female -1 30-34 ## 4 10.4 Male -1 35-39 ## 5 12.2 Male -1 22-24 ... ``` ] ??? - What is the base R operation? `survey[survey$duration < 30, ]` --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### <span style="color:grey">Tidy</span> ### Wrangle ] .right-column[ __Use `group_by()`, `summarize()` and `arrange()` from the [dplyr](https://dplyr.tidyverse.org) package to summarise (aggregate) data__ ```r ctry_breakdown <- survey %>% rename(country = `In which country do you currently reside?`) %>% # simplify colname group_by(country) %>% summarise(count = n()) %>% # create new col that counts each group size using n() arrange(-count) # arrange by the colname 'count' in descending order ctry_breakdown ``` ``` ## # A tibble: 58 × 2 ## country count ## <chr> <int> ## 1 United States of America 4716 ## 2 India 4417 ## 3 China 1644 ## 4 Other 1036 ## 5 Russia 879 ## 6 Brazil 736 ## 7 Germany 734 ... ``` ] --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### <span style="color:grey">Tidy</span> ### <span style="color:grey">Wrangle</span> ### Plot ] .right-column[ __Plot a histogram using the `ggplot2::ggplot()` function__ Three basic steps: 1. Provide _data_ 2. Assign your data _variables_ to _aesthetics_ 3. Assign the graphical _primitives_ ```r survey %>% # data ggplot(aes(duration)) + # map variable to aesthetic geom_histogram() # graphical primitive ``` ] ??? - A very popular package used for data visualisation - Show in console! - In console: `?ggplot`; try editing the `bins=` argument - Everything within ggplot() function also follows pipeline approach (+) - Everything tt follows the `+` sign is part of the `ggplot` func --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### <span style="color:grey">Tidy</span> ### <span style="color:grey">Wrangle</span> ### Plot ] .right-column[ __Plot a histogram using the `ggplot2::ggplot()` function__ ```r survey %>% ggplot(aes(duration)) + geom_histogram() ``` <img src="slides_files/figure-html/unnamed-chunk-65-1.png" width="70%" style="display: block; margin: auto;" /> ] ??? - warning about the binwidth - very skewed data --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### <span style="color:grey">Tidy</span> ### <span style="color:grey">Wrangle</span> ### Plot ] .right-column[ __Plot a histogram using the `ggplot2::ggplot()` function__ ```r survey %>% ggplot(aes(duration)) + geom_histogram() + geom_vline(xintercept = median(survey$duration)) # add median value ``` <img src="slides_files/figure-html/unnamed-chunk-66-1.png" width="70%" style="display: block; margin: auto;" /> ] ??? - We add another graphical primitive (vertical line) - Super long tail (respondent left browser open?) --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### <span style="color:grey">Tidy</span> ### <span style="color:grey">Wrangle</span> ### Plot ] .right-column[ __Plot a histogram using the `ggplot2::ggplot()` function__ ```r survey %>% ggplot(aes(duration)) + geom_histogram() + geom_vline(xintercept = median(survey$duration)) + scale_x_log10() # address extreme x-values ``` <img src="slides_files/figure-html/unnamed-chunk-67-1.png" width="70%" style="display: block; margin: auto;" /> ] ??? - Transform the x-axis logarithmically - squeeze in that tail --- class: left # The _tidyverse_ .left-column[ ### <span style="color:grey">Import</span> ### <span style="color:grey">Tidy</span> ### <span style="color:grey">Wrangle</span> ### Plot ] .right-column[ __Plot a histogram using the `ggplot2::ggplot()` function__ ```r survey %>% ggplot(aes(duration)) + geom_histogram(bins = 50) + geom_vline(xintercept = median(survey$duration), linetype = 2) + scale_x_log10(breaks = c(2, 5, 10, 20, 60, 1440)) + labs(x = "Duration (mins)", y = "Number of respondents") + #change axis labels ggtitle("Most respondents took 15-20 min to complete survey") #add figure title ``` <img src="slides_files/figure-html/unnamed-chunk-68-1.png" width="70%" style="display: block; margin: auto;" /> ] ??? - You can customise the plot to design it to your liking - This is an eg. of how you can analyse 1 variable in your dataframe --- class: inverse, left, center, middle # It's your turn! __Explore and visualise <span style="color:black">`data(diamonds, package = "ggplot2")`</span>__ <br> <br> .small[_Hint: Use <span style="color:black">`summary()`</span> to examine the dataset_] --- class: left # Quick exercise ⚡️ __Filter diamonds that are less than $3000 with a Premium cut__ -- _Expected output:_ ``` ## # A tibble: 6,757 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 2 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 3 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33 ## 4 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27 ## 5 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68 ## 6 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47 ## 7 0.29 Premium F SI1 62.4 58 403 4.24 4.26 2.65 ## 8 0.22 Premium E VS2 61.6 58 404 3.93 3.89 2.41 ## 9 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31 ## 10 0.3 Premium J SI2 59.3 61 405 4.43 4.38 2.61 ## # ℹ 6,747 more rows ``` --- class: left # Quick exercise ⚡️ __Plot a histogram of price for all diamonds__ -- _Expected output:_ <img src="slides_files/figure-html/unnamed-chunk-71-1.svg" width="70%" style="display: block; margin: auto;" /> --- class: left # Quick exercise ⚡️ __Plot a scatter diagram of the price, carat and cut for all diamonds__ -- _Expected output:_ <img src="images/diamond_scatter.jpeg" width="50%" style="display: block; margin: auto;" /> ??? - 3 variables --- class: inverse, center, middle name: resources # Questions? [About](#about) [Getting Started](#Getstarted) [General Syntax](#gensyntax) [Data Structures](#structures) [Functions](#functions) [The _tidyverse_](#tidyverse) [Useful Resources](#resources) --- class: left # Useful Resources __Online tutorials__ - [R for Data Science](https://r4ds.hadley.nz) - [Quick R](https://www.statmethods.net) - [Learn the tidyverse](https://www.tidyverse.org/learn/) - [R markdown cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/) __Online Q&A__ - [Stack Overflow](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) - [How to ask a good question online](https://stackoverflow.com/help/how-to-ask) - Remember to check your `sessionInfo()` when troubleshooting! __Others__ - [Use R/RStudio from an external drive](https://github.com/ClaudiaBrauer/A-very-short-introduction-to-R/blob/master/documents/Portable_versions_of_R_and_RStudio.pdf) (if you don't have admin rights to install software)