1 How to use this file

This file was created using Quarto, a type of document that allows you to review and execute all R code on this webpage.
To test a certain chunk of code, click the “copy” icon in the upper right corner of the chunk block (see screenshot below)
- Try copying the following code
  Hide the code
  a = 1 + 1 b = a + 1 print(b)
To review the whole file, click </> Code next to the title of this page. Find View Source and click the button. Then, you can paste the content into a newly created Quarto Document.

2 Getting Started with R

2.1 What is R?

R is a powerful programming language and environment specifically designed for statistical computing and graphics. It’s free, open-source, and has a vast ecosystem of packages for data analysis, visualization, and machine learning.

2.2 Why R?

Free and Open Source: No licensing costs
Extensive Package Ecosystem: Over 18,000 packages available
Excellent for Statistics: Built by statisticians, for statisticians
Great Visualization: ggplot2 and other packages for beautiful graphics
Reproducible Research: R Markdown and Quarto for literate programming
Active Community: Large, helpful community of users

2.3 R vs RStudio

R: The programming language and computing environment
RStudio: An integrated development environment (IDE) that makes R easier to use

3 Suggestion in R

Hide the code

# R comments begin with a # -- there are no multiline comments

# RStudio helps you build syntax
#   GREEN: Comments and character values in single or double quotes
#   BLUE: Functions and keywords
#   BLACK: Variable names and values

# You can use the tab key to complete object names, functions, and arguments

# R is case sensitive. That means R and r are two different things.

# Good naming conventions:
#   - Use descriptive names: my_data instead of x
#   - Use underscores or dots: my_data or my.data
#   - Avoid spaces and special characters (except . and _)
#   - Don't start with numbers: 1data is invalid, data1 is valid

4 Basic Data Types in R

Hide the code

# R has several basic data types:

# 1. Numeric (double) - decimal numbers
numeric_value <- 3.14
class(numeric_value)

# 2. Integer - whole numbers
integer_value <- 42L  # The L suffix makes it an integer
class(integer_value)

# 3. Character (string) - text
character_value <- "Hello, R!"
class(character_value)

# 4. Logical (boolean) - TRUE/FALSE
logical_value <- TRUE
class(logical_value)

# 5. Complex - complex numbers
complex_value <- 3 + 4i
class(complex_value)

# Check the type of any object
typeof(numeric_value)
is.numeric(numeric_value)
is.character(character_value)

5 R Functions

Hide the code

# In R, every statement is a function

# The print function prints the contents of what is inside to the console
print(x = 10)

# The terms inside the function are called the arguments; here print takes x
#   To find help with what the arguments are use:
?print

# Each function returns an object
print(x = 10)

# You can determine what type of object is returned by using the class function
class(print(x = 10))

# Function syntax: function_name(argument1, argument2, ...)
# Examples of common functions:
sqrt(16)           # Square root
abs(-5)            # Absolute value
round(3.14159, 2)  # Round to 2 decimal places
length(c(1,2,3,4)) # Length of a vector
sum(c(1,2,3,4))    # Sum of values
mean(c(1,2,3,4))   # Mean of values

6 Vectors - The Building Blocks

Vectors are the most basic data structure in R. They are one-dimensional arrays that can contain multiple elements of the same type (e.g., all numbers, all text, or all logical values).

6.1 Creating Vectors

Hide the code

# Use the c() function (combine) to create vectors
numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE)

# Display the vectors
numeric_vector
character_vector
logical_vector

6.2 Creating Sequences

Hide the code

# Using the colon operator for simple sequences
sequence <- 1:10
sequence

# You can also create descending sequences
10:1

6.3 Using seq() for More Control

Hide the code

# seq() gives you more control over sequences
# Create a sequence from 1 to 10, incrementing by 2
seq(from = 1, to = 10, by = 2)

# Create a sequence with exactly 5 equally-spaced values between 1 and 10
seq(1, 10, length.out = 5)

6.4 Repeating Values with rep()

Hide the code

# Repeat a single value multiple times
rep(5, times = 3)

# Repeat an entire vector multiple times
rep(c(1, 2), times = 3)

# Repeat each element multiple times before moving to the next
rep(c(1, 2), each = 3)

6.5 Vector Operations

Hide the code

# R performs operations element-wise on vectors
x <- c(1, 2, 3, 4, 5)
y <- c(10, 20, 30, 40, 50)

# Element-wise addition
x + y

# Element-wise multiplication
x * y

# Element-wise exponentiation
x^2

# You can also perform operations with a single value (vectorization)
x + 10
x * 2

7 Categorical/Factor Vectors (Factors)

Factors are used for categorical variables in R. They store both the values and the levels (categories), which is essential for statistical analysis and plotting. R uses factors to understand categorical variables properly.

7.1 Creating Basic Factors

Hide the code

# Create a factor from a character vector
gender <- c("Male", "Female", "Male", "Female", "Male")
gender_factor <- factor(gender)
gender_factor

# Check the levels (categories)
levels(gender_factor)

# See how many observations in each category
table(gender_factor)

7.2 Creating Factors with Specific Levels

Hide the code

# You can specify the order of levels explicitly
# This is useful when you want a specific order for plotting or analysis
education <- c("High School", "College", "Graduate", "High School")
education_factor <- factor(education,
                          levels = c("High School", "College", "Graduate"))
education_factor

# View the levels in the order you specified
levels(education_factor)

7.3 Ordered Factors (Ordinal Data)

Hide the code

# Use ordered = TRUE for ordinal data (categories with a meaningful order)
satisfaction <- c("Low", "Medium", "High", "Medium", "Low")
satisfaction_ordered <- factor(satisfaction,
                               levels = c("Low", "Medium", "High"),
                               ordered = TRUE)
satisfaction_ordered

# Notice the < signs indicating the order
print(satisfaction_ordered)

7.4 Factor Operations and Summaries

Hide the code

# Get frequency counts
table(gender_factor)

7.5 Converting Between Data Types

Hide the code

# Convert factor back to character
as.character(gender_factor)

# Convert to numeric (gives you the underlying level numbers, not always useful)
as.numeric(gender_factor)

# Be careful: converting numeric to factor
age_values <- c(25, 30, 35, 25, 40, 30)
age_factor <- factor(age_values)
age_factor  # Notice it treats each unique number as a separate category

7.6 Grouping Continuous Data into Categories

7.6.1 Method 1: Using cut() Function

Hide the code

# cut() is ideal for dividing continuous data into intervals
ages <- c(22, 25, 30, 35, 40, 45, 50, 55, 60, 65)

age_categories <- cut(ages,
                      breaks = c(0, 30, 50, 100),  # Define the breakpoints
                      labels = c("Young", "Middle", "Senior"),  # Label each interval
                      include.lowest = TRUE)  # Include the lowest value in the first interval
age_categories

# Check the distribution
table(age_categories)

7.6.2 Method 2: Using ifelse() for Custom Grouping

Hide the code

# ifelse() gives you more control over custom conditions
ages <- c(22, 25, 30, 35, 40, 45, 50, 55, 60, 65)

age_groups_custom <- ifelse(ages < 30, "Young",
                            ifelse(ages < 50, "Middle", "Senior"))

# Convert to an ordered factor
age_groups_factor <- factor(age_groups_custom,
                            levels = c("Young", "Middle", "Senior"),
                            ordered = TRUE)
age_groups_factor

# View the distribution
table(age_groups_factor)
summary(age_groups_factor)

7.7 Why Use Factors?

Factors are essential because they:

Help R recognize categorical data in statistical models (e.g., ANOVA, regression)
Control the order of categories in plots and tables
Store data more efficiently than character strings
Prevent typos from creating unintended new categories

8 R Objects

Hide the code

# Each object can be saved into the R environment (the workspace here)
#   You can save the results of a function call to a variable of any name
MyObject = print(x = 10)
class(MyObject)

# You can view the objects you have saved in the Environment tab in RStudio
# Or type their name
MyObject

# There are literally thousands of types of objects in R (you can create them),
#   but for our course we will mostly be working with data frames (more later)

# The process of saving the results of a function to a variable is called 
#   assignment. There are several ways you can assign function results to 
#   variables:

# The equals sign takes the result from the right-hand side and assigns it to
#   the variable name on the left-hand side:
MyObject = print(x = 10)

# The <- (Alt "-" in RStudio) functions like the equals (right to left)
MyObject2 <- print(x = 10)

identical(MyObject, MyObject2)

# The -> assigns from left to right:
print(x = 10) -> MyObject3

identical(MyObject, MyObject2, MyObject3)

# Best practice: Use <- for assignment (more explicit)
# Use = only for function arguments

9 Working with Data Structures

9.1 Lists

Hide the code

# Lists can contain elements of different types
my_list <- list(
  name = "John",
  age = 30,
  scores = c(85, 90, 78),
  passed = TRUE
)

# Accessing list elements
my_list$name
my_list[["age"]]
my_list[[3]]

# Lists are very flexible and useful for complex data structures

9.2 Matrices

Hide the code

# Matrices are 2-dimensional arrays with the same data type
my_matrix <- matrix(1:12, nrow = 3, ncol = 4)
my_matrix

# Creating matrices from vectors
matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)

# Matrix operations
matrix1 <- matrix(1:4, nrow = 2)
matrix2 <- matrix(5:8, nrow = 2)
matrix1 + matrix2
matrix1 * matrix2  # Element-wise multiplication

10 Importing and Exporting Data

A data frame is an R object that stores data in a rectangular (table) format. Each column represents a variable and can be of different types (e.g., numeric, character, factor). Each row represents an observation or case.
We will start by importing data from a comma-separated values (csv) file.
We will use the read.csv() function. Here, the argument stringsAsFactors = FALSE prevents R from automatically converting character strings into factors (categorical variables), giving us more control over data types
We can use the here::here() function to create reliable file paths that work across different operating systems and project structures.

Hide the code

# You can also set the working directory using setwd().
# For example, to set it to your home folder:
# setwd("~")

getwd()  # Get current working directory
dir()    # List files in current directory

Hide the code

# The following might give an error if the file path is not correct from your current directory:
HeightsData = read.csv(file = "heights.csv",
                       stringsAsFactors = FALSE)
HeightsData

Hide the code

# Note: Windows users need to use either forward slashes (/) or
# double backslashes (\\) in file paths. Single backslashes (\) don't work in R.
# Example: "C:/Users/name/file.csv" or "C:\\Users\\name\\file.csv"

# To view your data in RStudio, you can either:
# 1) Double-click the data frame in the Environment tab, or
# 2) Use the View() function
# View(HeightsData)

# You can access individual variables (columns) using the $ operator:
HeightsData$ID

# To read SPSS files, we need the foreign package.
# The foreign package comes pre-installed with R (no need to use install.packages()).
library(foreign)

# The read.spss() function imports an SPSS file.
# Setting to.data.frame = TRUE converts it to an R data frame (rather than a list)
WideData = read.spss(file = "wide.sav", 
                     to.data.frame = TRUE)
WideData

11 Working with Data Frames

Hide the code

# Data frames are the most common data structure for statistical analysis
# They are like spreadsheets with rows (observations) and columns (variables)

# Basic data frame operations
dim(HeightsData)        # Dimensions (rows, columns)
nrow(HeightsData)       # Number of rows
ncol(HeightsData)       # Number of columns
names(HeightsData)      # Column names
str(HeightsData)        # Structure of the data frame
head(HeightsData)       # First 6 rows
tail(HeightsData)       # Last 6 rows
summary(HeightsData)    # Summary statistics

# Accessing data frame elements
HeightsData[1, 2]       # Row 1, Column 2
HeightsData[1:5, ]      # Rows 1-5, all columns
HeightsData[, "ID"]     # All rows, column named "ID"
HeightsData$ID          # Same as above (preferred method)

# Subsetting data frames
subset(HeightsData, HeightIN > 70)
HeightsData[HeightsData$HeightIN > 70, ]

11.1 Exercise

Obtain the following information from WideData
- Dimensions (rows, columns)
- Number of rows
- Number of columns
- Column names
- Structure of the data frame
- First 6 rows
- Last 6 rows
- Summary statistics

12 Merging R data frame objects

Hide the code

# The WideData and HeightsData have the same set of ID numbers.
# We can use the merge() function to merge them into a single data frame.
# Here, x is the name of the left-side data frame and y is the name of the
# right-side data frame. The arguments by.x and by.y specify the variable(s)
# by which we will merge:
AllData = merge(x = WideData, y = HeightsData, by.x = "ID", by.y = "ID")
AllData

## Method 2: Use dplyr method (the pipe operator |> can be typed using Ctrl+Shift+M on Windows or Cmd+Shift+M on Mac)
library(dplyr)
WideData |> 
  left_join(HeightsData, by = "ID")

# Different types of joins:
# left_join(): Keep all rows from left table
# right_join(): Keep all rows from right table  
# inner_join(): Keep only rows that appear in both tables
# full_join(): Keep all rows from both tables

13 Transforming Wide to Long

Hide the code

# Sometimes, certain packages require repeated measures data to be in a long
# format (where each measurement is on a separate row rather than in separate columns). 

library(dplyr) # contains variable selection

## Wrong Way (pivoting DV and Age separately creates unwanted combinations)
AllDataLong <- AllData |> 
  tidyr::pivot_longer(starts_with("DVTime"), names_to = "DV", values_to = "DV_Value") |> 
  tidyr::pivot_longer(starts_with("AgeTime"), names_to = "Age", values_to = "Age_Value") 

OnePerson <- AllDataLong  |> 
  filter(ID == "1")

OnePerson

## Correct Way (pivot both variables together, then separate and widen properly)
AllDataLong <- AllData |> 
  tidyr::pivot_longer(c(starts_with("DVTime"), starts_with("AgeTime"))) |> 
  tidyr::separate(name, into = c("Variable", "Time"), sep = "Time") |> 
  tidyr::pivot_wider(names_from = "Variable", values_from = "value") -> AllDataLong

OnePerson <- AllDataLong |> 
  filter(ID == "1")
OnePerson

# Understanding data reshaping:
# Wide format: Each time point has its own column
# Long format: Time points are in rows, with a time variable

13.1 Exercise

13.1.1 Practice: Wide to Long with dplyr

In this exercise, you will practice reshaping repeated-measures data from wide format to long format using a dplyr pipeline (with tidyr functions).

Create the small wide data frame shown below.
Reshape it to long format so that you have four columns: id, time, dv, and age.
Compute the mean of dv by time as a verification step.

Hide the code

# Load packages
library(dplyr)
library(tidyr)

# 1) Start from a small wide toy data set
toy_wide <- tibble::tribble(
  ~id, ~dv_time1, ~dv_time2, ~dv_time3, ~age_time1, ~age_time2, ~age_time3,
   1,        10,        12,        15,         20,         21,         22,
   2,         8,        11,        11,         19,         20,         21,
   3,        14,        13,        16,         21,         22,         23
)

# 2) YOUR TURN: Convert to long using a single dplyr pipeline
#    Goal columns: id, time (1/2/3), dv, age
#    Hints:
#      - Use pivot_longer() on both dv_ and age_ columns together
#      - Separate the column name into variable (dv/age) and time (1/2/3)
#      - Use pivot_wider() to spread variable back into dv and age columns

13.1.1.1 Optional solution

Hide the code

toy_long <- toy_wide |> 
  pivot_longer(
    cols = c(starts_with("dv_"), starts_with("age_")),
    names_to = "name",
    values_to = "value"
  ) |> 
  separate(name, into = c("variable", "time"), sep = "_time") |> 
  pivot_wider(names_from = variable, values_from = value) |> 
  mutate(time = as.integer(time))

toy_long

toy_long |> 
  group_by(time) |> 
  summarize(mean_dv = mean(dv, na.rm = TRUE), .groups = "drop")

14 Data Manipulation with dplyr

Hide the code

# The dplyr package provides an intuitive set of functions for data manipulation

# Select columns
AllData |> 
  select(ID, starts_with("DV"))

# Filter rows
AllData |> 
  filter(ID < 5)

# Arrange rows
AllData |> 
  arrange(ID)

# Create new variables
AllData |> 
  mutate(
    DV_avg = (DVTime1 + DVTime2 + DVTime3) / 3,
    DV_range = DVTime3 - DVTime1
  )

# Group and summarize
AllDataLong |> 
  group_by(Time) |> 
  summarize(
    mean_DV = mean(DV, na.rm = TRUE),
    sd_DV = sd(DV, na.rm = TRUE),
    n = n()
  )

15 Gathering Descriptive Statistics

Hide the code

# The psych package provides convenient functions for computing descriptive statistics.
## If you haven't installed it yet, run: install.packages("psych")
library(psych)

# Use describe() to get comprehensive descriptive statistics for all variables:
DescriptivesWide = describe(AllData)
DescriptivesWide

DescriptivesLong = describe(AllDataLong)
DescriptivesLong

# Use describeBy() to compute descriptive statistics separately for each group:
DescriptivesLongID = describeBy(AllDataLong, group = AllDataLong$ID)
DescriptivesLongID

# Basic descriptive statistics without packages:
mean(AllDataLong$DV, na.rm = TRUE)
median(AllDataLong$DV, na.rm = TRUE)
sd(AllDataLong$DV, na.rm = TRUE)
var(AllDataLong$DV, na.rm = TRUE)
min(AllDataLong$DV, na.rm = TRUE)
max(AllDataLong$DV, na.rm = TRUE)
quantile(AllDataLong$DV, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

16 Transforming Data

Hide the code

# You can transform data by creating new variables. 
AllDataLong$AgeC = AllDataLong$Age - mean(AllDataLong$Age)

# You can also use functions to create new variables. Here we create new terms
#   using the function for significant digits:
AllDataLong$AgeYear = signif(x = AllDataLong$Age, digits = 2)
AllDataLong$AgeDecade = signif(x = AllDataLong$Age, digits = 1)
head(AllDataLong)

# Common data transformations:
# Centering: subtract mean
# Standardizing: (x - mean) / sd
# Log transformation: log(x)
# Square root: sqrt(x)
# Recoding: ifelse(condition, value_if_true, value_if_false)

# Example: Create standardized variables
AllDataLong$DV_z <- scale(AllDataLong$DV)
AllDataLong$Age_z <- scale(AllDataLong$Age)

17 Basic Plotting

Hide the code

# R has excellent plotting capabilities

# Base R plotting
hist(AllDataLong$DV, main = "Distribution of DV", xlab = "DV Values")
boxplot(DV ~ Time, data = AllDataLong, main = "DV by Time")
plot(AllDataLong$Age, AllDataLong$DV, main = "DV vs Age")

# Using ggplot2 (more modern and flexible)
# If you have not install the package yet, type in install.packages("ggplot2")
library(ggplot2)

# Histogram
ggplot(AllDataLong, aes(x = DV)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of DV", x = "DV Values", y = "Count")

# Boxplot
ggplot(AllDataLong, aes(x = Time, y = DV)) +
  geom_boxplot() +
  labs(title = "DV by Time")

# Scatter plot
ggplot(AllDataLong, aes(x = Age, y = DV)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "DV vs Age")

Hide the code

hw0_feedback <- read.csv(here::here("teaching/2025-01-13-Experiment-Design/Lecture01", "hw0_feedback.csv"))
table(hw0_feedback$Feedback)

18 Control Structures

Hide the code

# Conditional statements (if-else)
x <- 10
if (x > 5) {
  print("x is greater than 5")
} else {
  print("x is less than or equal to 5")
}

## Alternative method using ifelse() function (vectorized)
ifelse(x > 5,
       print("x is greater than 5"),
       print("x is less than or equal to 5"))

# For loops (repeat code a specific number of times)
for (i in 1:5) {
  print(paste("Iteration", i))
}

# While loops (repeat code while a condition is TRUE)
i <- 1
while (i <= 5) {
  print(paste("While iteration", i))
  i <- i + 1
}

# Apply functions (more efficient and "R-like" than explicit loops)
numbers <- 1:10
sapply(numbers, function(x) x^2)  # Returns a vector
lapply(numbers, function(x) x^2)  # Returns a list

19 Working with Missing Data

Hide the code

# R uses NA (Not Available) to represent missing data
# Check for missing values in a variable
is.na(AllDataLong$DV)              # Returns TRUE/FALSE for each value
sum(is.na(AllDataLong$DV))        # Count the number of missing values
complete.cases(AllDataLong)        # Check which rows have no missing data

# Remove rows that contain any missing data
AllDataLong_complete <- na.omit(AllDataLong)
# Alternative method (same result):
AllDataLong_complete <- AllDataLong[complete.cases(AllDataLong), ]

# Replace missing values with the mean (simple imputation)
AllDataLong$DV_imputed <- ifelse(is.na(AllDataLong$DV), 
                                 mean(AllDataLong$DV, na.rm = TRUE), 
                                 AllDataLong$DV)

20 Best Practices and Tips

Hide the code

# 1. Always use meaningful variable names
# 2. Comment your code
# 3. Use consistent formatting
# 4. Check your data after importing
# 5. Save your work regularly
# 6. Use version control (Git)
# 7. Write reproducible code
# 8. Use packages for common tasks
# 9. Learn to use help documentation
# 10. Practice regularly!

# Useful keyboard shortcuts in RStudio:
# Ctrl+Enter (Cmd+Enter on Mac): Run the current line or selected code
# Ctrl+Shift+Enter (Cmd+Shift+Enter on Mac): Run the entire script
# Ctrl+Shift+M (Cmd+Shift+M on Mac): Insert the pipe operator |>
# Ctrl+Shift+C (Cmd+Shift+C on Mac): Comment or uncomment selected lines
# Ctrl+Shift+R (Cmd+Shift+R on Mac): Insert a code section header

21 Getting Help

Hide the code

# R has excellent help documentation
?mean                    # Help for a function
??"regression"          # Search for functions containing "regression"
help(mean)              # Same as ?mean
example(mean)           # Run examples for a function

# Online resources:
# - R Documentation: https://www.rdocumentation.org/
# - Stack Overflow: https://stackoverflow.com/questions/tagged/r
# - R-bloggers: https://www.r-bloggers.com/
# - RStudio Community: https://community.rstudio.com/

# Installing and loading packages
install.packages("package_name")  # Install once
library(package_name)             # Load each session
require(package_name)             # Alternative to library()

--- title: "Example 01: Make Friends with R" execute: eval: false format: html: toc: true toc_float: true toc_depth: 2 toc_collapsed: true number-sections: true code-tools: true code-fold: show code-summary: "Hide the code" --- # How to use this file 1. This file was created using Quarto, a type of document that allows you to review and execute all R code on this webpage. 2. To test a certain chunk of code, click the "copy" icon in the upper right corner of the chunk block (see screenshot below) - ![](figures/R-copy-paste.png) - Try copying the following code ```{r} a = 1 + 1 b = a + 1 print(b) ``` 3. To review the whole file, click `</> Code` next to the title of this page. Find `View Source` and click the button. Then, you can paste the content into a newly created Quarto Document. ![](figures/code-copy-paste.png) # Getting Started with R ## What is R? R is a powerful programming language and environment specifically designed for statistical computing and graphics. It's free, open-source, and has a vast ecosystem of packages for data analysis, visualization, and machine learning. ## Why R? - **Free and Open Source**: No licensing costs - **Extensive Package Ecosystem**: Over 18,000 packages available - **Excellent for Statistics**: Built by statisticians, for statisticians - **Great Visualization**: ggplot2 and other packages for beautiful graphics - **Reproducible Research**: R Markdown and Quarto for literate programming - **Active Community**: Large, helpful community of users ## R vs RStudio - **R**: The programming language and computing environment ![](images/paste-3.png) - **RStudio**: An integrated development environment (IDE) that makes R easier to use ![](images/paste-2.png) # Suggestion in R ```{r} #| eval: FALSE # R comments begin with a # -- there are no multiline comments # RStudio helps you build syntax # GREEN: Comments and character values in single or double quotes # BLUE: Functions and keywords # BLACK: Variable names and values # You can use the tab key to complete object names, functions, and arguments # R is case sensitive. That means R and r are two different things. # Good naming conventions: # - Use descriptive names: my_data instead of x # - Use underscores or dots: my_data or my.data # - Avoid spaces and special characters (except . and _) # - Don't start with numbers: 1data is invalid, data1 is valid ## install some neccessary packages install.packages("dplyr") install.packages("foreign") install.packages("tidyr") install.packages("psych") ``` # Basic Data Types in R ```{r} #| eval: FALSE # R has several basic data types: # 1. Numeric (double) - decimal numbers numeric_value <- 3.14 class(numeric_value) # 2. Integer - whole numbers integer_value <- 42L # The L suffix makes it an integer class(integer_value) # 3. Character (string) - text character_value <- "Hello, R!" class(character_value) # 4. Logical (boolean) - TRUE/FALSE logical_value <- TRUE class(logical_value) # 5. Complex - complex numbers complex_value <- 3 + 4i class(complex_value) # Check the type of any object typeof(numeric_value) is.numeric(numeric_value) is.character(character_value) ``` # R Functions ```{r} # In R, every statement is a function # The print function prints the contents of what is inside to the console print(x = 10) # The terms inside the function are called the arguments; here print takes x # To find help with what the arguments are use: ?print # Each function returns an object print(x = 10) # You can determine what type of object is returned by using the class function class(print(x = 10)) # Function syntax: function_name(argument1, argument2, ...) # Examples of common functions: sqrt(16) # Square root abs(-5) # Absolute value round(3.14159, 2) # Round to 2 decimal places length(c(1,2,3,4)) # Length of a vector sum(c(1,2,3,4)) # Sum of values mean(c(1,2,3,4)) # Mean of values ``` # Getting Help ```{r} #| eval: FALSE # R has excellent help documentation ?mean # Help for a function ??"regression" # Search for functions containing "regression" help(mean) # Same as ?mean example(mean) # Run examples for a function # Online resources: # - R Documentation: https://www.rdocumentation.org/ # - Stack Overflow: https://stackoverflow.com/questions/tagged/r # - R-bloggers: https://www.r-bloggers.com/ # - RStudio Community: https://community.rstudio.com/ # Installing and loading packages install.packages("package_name") # Install once library(package_name) # Load each session require(package_name) # Alternative to library() ``` # Vectors - The Building Blocks Vectors are the most basic data structure in R. They are one-dimensional arrays that can contain multiple elements of the same type (e.g., all numbers, all text, or all logical values). ## Creating Vectors ```{r} #| eval: FALSE # Use the c() function (combine) to create vectors numeric_vector <- c(1, 2, 3, 4, 5) character_vector <- c("apple", "banana", "cherry") logical_vector <- c(TRUE, FALSE, TRUE) # Display the vectors numeric_vector character_vector logical_vector ``` ## Creating Sequences ```{r} #| eval: FALSE # Using the colon operator for simple sequences sequence <- 1:10 sequence # You can also create descending sequences 10:1 ``` ## Using seq() for More Control ```{r} #| eval: FALSE # seq() gives you more control over sequences # Create a sequence from 1 to 10, incrementing by 2 seq(from = 1, to = 10, by = 2) # Create a sequence with exactly 5 equally-spaced values between 1 and 10 seq(1, 10, length.out = 5) ``` ## Repeating Values with rep() ```{r} #| eval: FALSE # Repeat a single value multiple times rep(5, times = 3) # Repeat an entire vector multiple times rep(c(1, 2), times = 3) # Repeat each element multiple times before moving to the next rep(c(1, 2), each = 3) ``` ## Vector Operations ```{r} #| eval: FALSE # R performs operations element-wise on vectors x <- c(1, 2, 3, 4, 5) y <- c(10, 20, 30, 40, 50) # Element-wise addition x + y # Element-wise multiplication x * y # Element-wise exponentiation x^2 # You can also perform operations with a single value (vectorization) x + 10 x * 2 ``` # Categorical/Factor Vectors (Factors) Factors are used for categorical variables in R. They store both the values and the levels (categories), which is essential for statistical analysis and plotting. R uses factors to understand categorical variables properly. ## Creating Basic Factors ```{r} #| eval: FALSE # Create a factor from a character vector gender <- c("Male", "Female", "Male", "Female", "Male") gender_factor <- factor(gender) gender_factor # Check the levels (categories) levels(gender_factor) # See how many observations in each category table(gender_factor) ``` ## Creating Factors with Specific Levels ```{r} #| eval: FALSE # You can specify the order of levels explicitly # This is useful when you want a specific order for plotting or analysis education <- c("High School", "College", "Graduate", "High School") education_factor <- factor(education, levels = c("High School", "College", "Graduate")) education_factor # View the levels in the order you specified levels(education_factor) ``` ## Ordered Factors (Ordinal Data) ```{r} #| eval: FALSE # Use ordered = TRUE for ordinal data (categories with a meaningful order) satisfaction <- c("Low", "Medium", "High", "Medium", "Low") satisfaction_ordered <- factor(satisfaction, levels = c("Low", "Medium", "High"), ordered = TRUE) satisfaction_ordered # Notice the < signs indicating the order print(satisfaction_ordered) ``` ## Factor Operations and Summaries ```{r} #| eval: FALSE # Get frequency counts table(gender_factor) ``` ## Converting Between Data Types ```{r} #| eval: FALSE # Convert factor back to character as.character(gender_factor) # Convert to numeric (gives you the underlying level numbers, not always useful) as.numeric(gender_factor) # Be careful: converting numeric to factor age_values <- c(25, 30, 35, 25, 40, 30) age_factor <- factor(age_values) age_factor # Notice it treats each unique number as a separate category ``` ## Grouping Continuous Data into Categories ### Method 1: Using cut() Function ```{r} #| eval: FALSE # cut() is ideal for dividing continuous data into intervals ages <- c(22, 25, 30, 35, 40, 45, 50, 55, 60, 65) age_categories <- cut(ages, breaks = c(0, 30, 50, 100), # Define the breakpoints labels = c("Young", "Middle", "Senior"), # Label each interval include.lowest = TRUE) # Include the lowest value in the first interval age_categories # Check the distribution table(age_categories) ``` ### Method 2: Using ifelse() for Custom Grouping ```{r} #| eval: FALSE # ifelse() gives you more control over custom conditions ages <- c(22, 25, 30, 35, 40, 45, 50, 55, 60, 65) age_groups_custom <- ifelse(ages < 30, "Young", ifelse(ages < 50, "Middle", "Senior")) # Convert to an ordered factor age_groups_factor <- factor(age_groups_custom, levels = c("Young", "Middle", "Senior"), ordered = TRUE) age_groups_factor # View the distribution table(age_groups_factor) summary(age_groups_factor) ``` ## Why Use Factors? Factors are essential because they: - Help R recognize categorical data in statistical models (e.g., ANOVA, regression) - Control the order of categories in plots and tables - Store data more efficiently than character strings - Prevent typos from creating unintended new categories # R Objects ```{r} # Each object can be saved into the R environment (the workspace here) # You can save the results of a function call to a variable of any name MyObject = print(x = 10) class(MyObject) # You can view the objects you have saved in the Environment tab in RStudio # Or type their name MyObject # There are literally thousands of types of objects in R (you can create them), # but for our course we will mostly be working with data frames (more later) # The process of saving the results of a function to a variable is called # assignment. There are several ways you can assign function results to # variables: # The equals sign takes the result from the right-hand side and assigns it to # the variable name on the left-hand side: MyObject = print(x = 10) # The <- (Alt "-" in RStudio) functions like the equals (right to left) MyObject2 <- print(x = 10) identical(MyObject, MyObject2) # The -> assigns from left to right: print(x = 10) -> MyObject3 identical(MyObject, MyObject2, MyObject3) # Best practice: Use <- for assignment (more explicit) # Use = only for function arguments ``` # Working with Data Structures ## Lists ```{r} #| eval: FALSE # Lists can contain elements of different types my_list <- list( name = "John", age = 30, scores = c(85, 90, 78), passed = TRUE ) # Accessing list elements my_list$name my_list[["age"]] my_list[[3]] # Lists are very flexible and useful for complex data structures ``` ## Matrices ```{r} #| eval: FALSE # Matrices are 2-dimensional arrays with the same data type my_matrix <- matrix(1:12, nrow = 3, ncol = 4) my_matrix # Creating matrices from vectors matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3) # Matrix operations matrix1 <- matrix(1:4, nrow = 2) matrix2 <- matrix(5:8, nrow = 2) matrix1 + matrix2 matrix1 * matrix2 # Element-wise multiplication ``` # Importing and Exporting Data - A data frame is an R object that stores data in a rectangular (table) format. Each column represents a variable and can be of different types (e.g., numeric, character, factor). Each row represents an observation or case. - We will start by importing data from a comma-separated values (csv) file. - We will use the read.csv() function. Here, the argument `stringsAsFactors = FALSE` prevents R from automatically converting character strings into factors (categorical variables), giving us more control over data types - We can use the `here::here()` function to create reliable file paths that work across different operating systems and project structures. ```{r} #| error: true #| eval: false # You can also set the working directory using setwd(). # For example, to set it to your home folder: # setwd("~") getwd() # Get current working directory dir() # List files in current directory ``` ```{r} # The following might give an error if the file path is not correct from your current directory: HeightsData = read.csv(file = "heights.csv", stringsAsFactors = FALSE) HeightsData ``` ```{r} # Note: Windows users need to use either forward slashes (/) or # double backslashes (\\) in file paths. Single backslashes (\) don't work in R. # Example: "C:/Users/name/file.csv" or "C:\\Users\\name\\file.csv" # To view your data in RStudio, you can either: # 1) Double-click the data frame in the Environment tab, or # 2) Use the View() function # View(HeightsData) # You can access individual variables (columns) using the $ operator: HeightsData$ID # To read SPSS files, we need the foreign package. # The foreign package comes pre-installed with R (no need to use install.packages()). library(foreign) # The read.spss() function imports an SPSS file. # Setting to.data.frame = TRUE converts it to an R data frame (rather than a list) WideData = read.spss(file = "wide.sav", to.data.frame = TRUE) WideData ``` # Working with Data Frames ```{r} #| eval: FALSE # Data frames are the most common data structure for statistical analysis # They are like spreadsheets with rows (observations) and columns (variables) # Basic data frame operations dim(HeightsData) # Dimensions (rows, columns) nrow(HeightsData) # Number of rows ncol(HeightsData) # Number of columns names(HeightsData) # Column names str(HeightsData) # Structure of the data frame head(HeightsData) # First 6 rows tail(HeightsData) # Last 6 rows summary(HeightsData) # Summary statistics # Accessing data frame elements HeightsData[1, 2] # Row 1, Column 2 HeightsData[1:5, ] # Rows 1-5, all columns HeightsData[, "ID"] # All rows, column named "ID" HeightsData$ID # Same as above (preferred method) # Subsetting data frames subset(HeightsData, HeightIN > 70) HeightsData[HeightsData$HeightIN > 70, ] ``` ## Exercise - Obtain the following information from WideData - Dimensions (rows, columns) - Number of rows - Number of columns - Column names - Structure of the data frame - First 6 rows - Last 6 rows - Summary statistics # Merging R data frame objects ```{r} # The WideData and HeightsData have the same set of ID numbers. # We can use the merge() function to merge them into a single data frame. # Here, x is the name of the left-side data frame and y is the name of the # right-side data frame. The arguments by.x and by.y specify the variable(s) # by which we will merge: AllData = merge(x = WideData, y = HeightsData, by.x = "ID", by.y = "ID") AllData ## Method 2: Use dplyr method (the pipe operator |> can be typed using Ctrl+Shift+M on Windows or Cmd+Shift+M on Mac) library(dplyr) WideData |> left_join(HeightsData, by = "ID") # Different types of joins: # left_join(): Keep all rows from left table # right_join(): Keep all rows from right table # inner_join(): Keep only rows that appear in both tables # full_join(): Keep all rows from both tables ``` # Transforming Wide to Long ```{r} # Sometimes, certain packages require repeated measures data to be in a long # format (where each measurement is on a separate row rather than in separate columns). library(dplyr) # contains variable selection ## Wrong Way (pivoting DV and Age separately creates unwanted combinations) AllDataLong <- AllData |> tidyr::pivot_longer(starts_with("DVTime"), names_to = "DV", values_to = "DV_Value") |> tidyr::pivot_longer(starts_with("AgeTime"), names_to = "Age", values_to = "Age_Value") OnePerson <- AllDataLong |> filter(ID == "1") OnePerson ## Correct Way (pivot both variables together, then separate and widen properly) AllDataLong <- AllData |> tidyr::pivot_longer(c(starts_with("DVTime"), starts_with("AgeTime"))) |> tidyr::separate(name, into = c("Variable", "Time"), sep = "Time") |> tidyr::pivot_wider(names_from = "Variable", values_from = "value") -> AllDataLong OnePerson <- AllDataLong |> filter(ID == "1") OnePerson # Understanding data reshaping: # Wide format: Each time point has its own column # Long format: Time points are in rows, with a time variable ``` ## Exercise ### Practice: Wide to Long with dplyr In this exercise, you will practice reshaping repeated-measures data from wide format to long format using a dplyr pipeline (with tidyr functions). 1. Create the small wide data frame shown below. 2. Reshape it to long format so that you have four columns: `id`, `time`, `dv`, and `age`. 3. Compute the mean of `dv` by `time` as a verification step. ```{r} # Load packages library(dplyr) library(tidyr) # 1) Start from a small wide toy data set toy_wide <- tibble::tribble( ~id, ~dv_time1, ~dv_time2, ~dv_time3, ~age_time1, ~age_time2, ~age_time3, 1, 10, 12, 15, 20, 21, 22, 2, 8, 11, 11, 19, 20, 21, 3, 14, 13, 16, 21, 22, 23 ) # 2) YOUR TURN: Convert to long using a single dplyr pipeline # Goal columns: id, time (1/2/3), dv, age # Hints: # - Use pivot_longer() on both dv_ and age_ columns together # - Separate the column name into variable (dv/age) and time (1/2/3) # - Use pivot_wider() to spread variable back into dv and age columns ``` #### Optional solution ```{r} #| code-fold: true toy_long <- toy_wide |> pivot_longer( cols = c(starts_with("dv_"), starts_with("age_")), names_to = "name", values_to = "value" ) |> separate(name, into = c("variable", "time"), sep = "_time") |> pivot_wider(names_from = variable, values_from = value) |> mutate(time = as.integer(time)) toy_long toy_long |> group_by(time) |> summarize(mean_dv = mean(dv, na.rm = TRUE), .groups = "drop") ``` # Data Manipulation with dplyr ```{r} #| eval: FALSE # The dplyr package provides an intuitive set of functions for data manipulation # Select columns AllData |> select(ID, starts_with("DV")) # Filter rows AllData |> filter(ID < 5) # Arrange rows AllData |> arrange(ID) # Create new variables AllData |> mutate( DV_avg = (DVTime1 + DVTime2 + DVTime3) / 3, DV_range = DVTime3 - DVTime1 ) # Group and summarize AllDataLong |> group_by(Time) |> summarize( mean_DV = mean(DV, na.rm = TRUE), sd_DV = sd(DV, na.rm = TRUE), n = n() ) ``` # Gathering Descriptive Statistics ```{r} # The psych package provides convenient functions for computing descriptive statistics. ## If you haven't installed it yet, run: install.packages("psych") library(psych) # Use describe() to get comprehensive descriptive statistics for all variables: DescriptivesWide = describe(AllData) DescriptivesWide DescriptivesLong = describe(AllDataLong) DescriptivesLong # Use describeBy() to compute descriptive statistics separately for each group: DescriptivesLongID = describeBy(AllDataLong, group = AllDataLong$ID) DescriptivesLongID # Basic descriptive statistics without packages: mean(AllDataLong$DV, na.rm = TRUE) median(AllDataLong$DV, na.rm = TRUE) sd(AllDataLong$DV, na.rm = TRUE) var(AllDataLong$DV, na.rm = TRUE) min(AllDataLong$DV, na.rm = TRUE) max(AllDataLong$DV, na.rm = TRUE) quantile(AllDataLong$DV, probs = c(0.25, 0.5, 0.75), na.rm = TRUE) ``` # Transforming Data ```{r} # You can transform data by creating new variables. AllDataLong$AgeC = AllDataLong$Age - mean(AllDataLong$Age) # You can also use functions to create new variables. Here we create new terms # using the function for significant digits: AllDataLong$AgeYear = signif(x = AllDataLong$Age, digits = 2) AllDataLong$AgeDecade = signif(x = AllDataLong$Age, digits = 1) head(AllDataLong) # Common data transformations: # Centering: subtract mean # Standardizing: (x - mean) / sd # Log transformation: log(x) # Square root: sqrt(x) # Recoding: ifelse(condition, value_if_true, value_if_false) # Example: Create standardized variables AllDataLong$DV_z <- scale(AllDataLong$DV) AllDataLong$Age_z <- scale(AllDataLong$Age) ``` # Basic Plotting ```{r} #| eval: FALSE # R has excellent plotting capabilities # Base R plotting hist(AllDataLong$DV, main = "Distribution of DV", xlab = "DV Values") boxplot(DV ~ Time, data = AllDataLong, main = "DV by Time") plot(AllDataLong$Age, AllDataLong$DV, main = "DV vs Age") # Using ggplot2 (more modern and flexible) # If you have not install the package yet, type in install.packages("ggplot2") library(ggplot2) # Histogram ggplot(AllDataLong, aes(x = DV)) + geom_histogram(bins = 30) + labs(title = "Distribution of DV", x = "DV Values", y = "Count") # Boxplot ggplot(AllDataLong, aes(x = Time, y = DV)) + geom_boxplot() + labs(title = "DV by Time") # Scatter plot ggplot(AllDataLong, aes(x = Age, y = DV)) + geom_point() + geom_smooth(method = "lm") + labs(title = "DV vs Age") ``` ```{r} hw0_feedback <- read.csv(here::here("teaching/2025-01-13-Experiment-Design/Lecture01", "hw0_feedback.csv")) table(hw0_feedback$Feedback) ``` # Control Structures ```{r} #| eval: FALSE # Conditional statements (if-else) x <- 10 if (x > 5) { print("x is greater than 5") } else { print("x is less than or equal to 5") } ## Alternative method using ifelse() function (vectorized) ifelse(x > 5, print("x is greater than 5"), print("x is less than or equal to 5")) # For loops (repeat code a specific number of times) for (i in 1:5) { print(paste("Iteration", i)) } # While loops (repeat code while a condition is TRUE) i <- 1 while (i <= 5) { print(paste("While iteration", i)) i <- i + 1 } # Apply functions (more efficient and "R-like" than explicit loops) numbers <- 1:10 sapply(numbers, function(x) x^2) # Returns a vector lapply(numbers, function(x) x^2) # Returns a list ``` # Working with Missing Data ```{r} #| eval: FALSE # R uses NA (Not Available) to represent missing data # Check for missing values in a variable is.na(AllDataLong$DV) # Returns TRUE/FALSE for each value sum(is.na(AllDataLong$DV)) # Count the number of missing values complete.cases(AllDataLong) # Check which rows have no missing data # Remove rows that contain any missing data AllDataLong_complete <- na.omit(AllDataLong) # Alternative method (same result): AllDataLong_complete <- AllDataLong[complete.cases(AllDataLong), ] # Replace missing values with the mean (simple imputation) AllDataLong$DV_imputed <- ifelse(is.na(AllDataLong$DV), mean(AllDataLong$DV, na.rm = TRUE), AllDataLong$DV) ``` # Best Practices and Tips ```{r} #| eval: FALSE # 1. Always use meaningful variable names # 2. Comment your code # 3. Use consistent formatting # 4. Check your data after importing # 5. Save your work regularly # 6. Use version control (Git) # 7. Write reproducible code # 8. Use packages for common tasks # 9. Learn to use help documentation # 10. Practice regularly! # Useful keyboard shortcuts in RStudio: # Ctrl+Enter (Cmd+Enter on Mac): Run the current line or selected code # Ctrl+Shift+Enter (Cmd+Shift+Enter on Mac): Run the entire script # Ctrl+Shift+M (Cmd+Shift+M on Mac): Insert the pipe operator |> # Ctrl+Shift+C (Cmd+Shift+C on Mac): Comment or uncomment selected lines # Ctrl+Shift+R (Cmd+Shift+R on Mac): Insert a code section header ```