How to do Data Cleaning in R

This blog is trying to elaborate steps for cleaning the data. Since datasets varied, this blog could not cover all. Depedent on the data you’re using, different methods should be used.

Libraries

I use tidyverse as my main tool to clean the data. Tidyverse is a very useful package created by Hadley. It includes several sub-packages, such as dplyr (data manipulation), tidyr (data transforming), readr (data import), ggplot2 (data visulization) etc. If you haven’t installed this package yet, please run install.packages("tidyverse") in your R console.

library(tidyverse)

Step 1: Import Data

If your data is csv format, you could use read.csv() to import the data into R. Be careful to add stringsAsFactors = FALSE argument in the function, or all string variables will automate convert to factor by default. This will lead to some issues when you do further checking. # Configuration of Academic # Documentation: https://sourcethemes.com/academic/ # # This file is formatted using TOML syntax - learn more at https://learnxinyminutes.com/docs/toml/ # Each configuration section is defined by a name in square brackets (e.g. [outputs]).

dat1 <- read.csv("../../static/files/nfl_2010-2017.csv", stringsAsFactors = FALSE)

Step 2: Initial Check

Perform some initial check before doing further manipulation. This step is to let you get familiar with you data and have a big picture on what you need to do next.

Step 2.1: check variables

str() is a very useful fuction in R base package which provides you sample sizes, number of variabels, variables names, variabile types and several sample responses. In the sample data, there are 81,525 observations and 23 variables.

str(dat1)
## 'data.frame':    81525 obs. of  23 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ name        : chr  "Duce Staley" "Lamar Smith" "Tiki Barber" "Stephen Davis" ...
##  $ team        : chr  "PHI" "MIA" "NYG" "WAS" ...
##  $ game_year   : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ game_week   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ rush_att    : int  26 27 13 23 28 27 30 14 15 10 ...
##  $ rush_yds    : int  201 145 144 133 124 119 110 109 88 87 ...
##  $ rush_avg    : num  7.7 5.4 11.1 5.8 4.4 4.4 3.7 7.8 5.9 8.7 ...
##  $ rush_tds    : int  1 1 2 1 1 0 1 0 0 1 ...
##  $ rush_fumbles: int  0 1 0 0 1 0 0 0 0 0 ...
##  $ rec         : int  4 1 3 4 6 4 6 2 2 NA ...
##  $ rec_yds     : int  61 12 25 37 40 32 34 3 20 NA ...
##  $ rec_avg     : num  15.3 12 8.3 9.3 6.7 8 5.7 1.5 10 NA ...
##  $ rec_tds     : int  0 0 0 0 1 0 1 0 0 NA ...
##  $ rec_fumbles : int  0 0 0 0 0 0 0 0 0 NA ...
##  $ pass_att    : int  NA NA NA NA NA NA NA NA NA 41 ...
##  $ pass_yds    : int  NA NA NA NA NA NA NA NA NA 290 ...
##  $ pass_tds    : int  NA NA NA NA NA NA NA NA NA 2 ...
##  $ int         : int  NA NA NA NA NA NA NA NA NA 0 ...
##  $ sck         : int  NA NA NA NA NA NA NA NA NA 2 ...
##  $ pass_fumbles: int  NA NA NA NA NA NA NA NA NA 0 ...
##  $ rate        : num  NA NA NA NA NA ...
##  $ position    : chr  "RB" "RB" "RB" "RB" ...

step 2.2: check missing values and ranges

Missing values could be checked one by one variables or in case-level. Knowing which one or more variablea have high missing values will help you think about the reasons.

summary(dat1)
##        X             name               team             game_year   
##  Min.   :    1   Length:81525       Length:81525       Min.   :2000  
##  1st Qu.:20382   Class :character   Class :character   1st Qu.:2004  
##  Median :40763   Mode  :character   Mode  :character   Median :2009  
##  Mean   :40763                                         Mean   :2009  
##  3rd Qu.:61144                                         3rd Qu.:2013  
##  Max.   :81525                                         Max.   :2017  
##                                                                      
##    game_week         rush_att        rush_yds         rush_avg     
##  Min.   : 1.000   Min.   : 0.00   Min.   :-34.00   Min.   :-34.00  
##  1st Qu.: 4.000   1st Qu.: 1.00   1st Qu.:  4.00   1st Qu.:  2.00  
##  Median : 9.000   Median : 4.00   Median : 15.00   Median :  3.50  
##  Mean   : 8.592   Mean   : 6.96   Mean   : 28.91   Mean   :  4.05  
##  3rd Qu.:13.000   3rd Qu.:10.00   3rd Qu.: 42.00   3rd Qu.:  5.30  
##  Max.   :16.000   Max.   :43.00   Max.   :296.00   Max.   : 77.00  
##                   NA's   :47710   NA's   :47710    NA's   :47710   
##     rush_tds      rush_fumbles        rec            rec_yds      
##  Min.   :0.00    Min.   :0.00    Min.   : 0.000   Min.   :-22.00  
##  1st Qu.:0.00    1st Qu.:0.00    1st Qu.: 1.000   1st Qu.:  8.00  
##  Median :0.00    Median :0.00    Median : 2.000   Median : 21.00  
##  Mean   :0.21    Mean   :0.08    Mean   : 2.716   Mean   : 31.23  
##  3rd Qu.:0.00    3rd Qu.:0.00    3rd Qu.: 4.000   3rd Qu.: 45.00  
##  Max.   :5.00    Max.   :3.00    Max.   :21.000   Max.   :329.00  
##  NA's   :47710   NA's   :47710   NA's   :16722    NA's   :16722   
##     rec_avg          rec_tds       rec_fumbles       pass_att    
##  Min.   :-22.00   Min.   :0.000   Min.   :0.000   Min.   : 0.00  
##  1st Qu.:  5.10   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:21.00  
##  Median :  9.00   Median :0.000   Median :0.000   Median :30.00  
##  Mean   : 10.15   Mean   :0.188   Mean   :0.032   Mean   :27.64  
##  3rd Qu.: 13.50   3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:37.00  
##  Max.   : 99.00   Max.   :4.000   Max.   :2.000   Max.   :68.00  
##  NA's   :16722    NA's   :16722   NA's   :16722   NA's   :71044  
##     pass_yds        pass_tds          int             sck       
##  Min.   :-11.0   Min.   :0.00    Min.   :0.0     Min.   : 0.00  
##  1st Qu.:126.0   1st Qu.:0.00    1st Qu.:0.0     1st Qu.: 1.00  
##  Median :204.0   Median :1.00    Median :1.0     Median : 2.00  
##  Mean   :193.1   Mean   :1.16    Mean   :0.8     Mean   : 1.88  
##  3rd Qu.:267.0   3rd Qu.:2.00    3rd Qu.:1.0     3rd Qu.: 3.00  
##  Max.   :527.0   Max.   :7.00    Max.   :7.0     Max.   :12.00  
##  NA's   :71044   NA's   :71044   NA's   :71044   NA's   :71044  
##   pass_fumbles        rate          position        
##  Min.   :0.00    Min.   :  0.00   Length:81525      
##  1st Qu.:0.00    1st Qu.: 60.20   Class :character  
##  Median :0.00    Median : 82.10   Mode  :character  
##  Mean   :0.15    Mean   : 81.57                     
##  3rd Qu.:0.00    3rd Qu.:103.20                     
##  Max.   :3.00    Max.   :158.30                     
##  NA's   :71044   NA's   :71044

In this sample, variables rush_att, rush_yds, rush_avg, rush_tds, rush_fumbles have 47,710 missing values; variables rec, rec_yds, rec_avg, rec_tds, rec_fumbles have 16,722 missing values, variables pass_att, pass_yds, pass_tds, int, sck, pass_fumbles, rate have 71,044 missing values.

Also, look at the range (Min., Max.) of variables in summary output, sometimes the responses may exceed expected range. You may have to delete those cases in this situation.

step 2.3: check first and last cases

Sometimes you may find the second row of data including some information you don’t want. For example, Qualtrics survey data will put some background log information in row 2. Thus, it’s better to check the head and tail of dataset. You can use functions below:

head(dat1)
##   X           name team game_year game_week rush_att rush_yds rush_avg
## 1 1    Duce Staley  PHI      2000         1       26      201      7.7
## 2 2    Lamar Smith  MIA      2000         1       27      145      5.4
## 3 3    Tiki Barber  NYG      2000         1       13      144     11.1
## 4 4  Stephen Davis  WAS      2000         1       23      133      5.8
## 5 5 Edgerrin James  IND      2000         1       28      124      4.4
## 6 6  Priest Holmes  BAL      2000         1       27      119      4.4
##   rush_tds rush_fumbles rec rec_yds rec_avg rec_tds rec_fumbles pass_att
## 1        1            0   4      61    15.3       0           0       NA
## 2        1            1   1      12    12.0       0           0       NA
## 3        2            0   3      25     8.3       0           0       NA
## 4        1            0   4      37     9.3       0           0       NA
## 5        1            1   6      40     6.7       1           0       NA
## 6        0            0   4      32     8.0       0           0       NA
##   pass_yds pass_tds int sck pass_fumbles rate position
## 1       NA       NA  NA  NA           NA   NA       RB
## 2       NA       NA  NA  NA           NA   NA       RB
## 3       NA       NA  NA  NA           NA   NA       RB
## 4       NA       NA  NA  NA           NA   NA       RB
## 5       NA       NA  NA  NA           NA   NA       RB
## 6       NA       NA  NA  NA           NA   NA       RB
tail(dat1)
##           X           name team game_year game_week rush_att rush_yds
## 81520 81520     Jared Goff   LA      2017        15       NA       NA
## 81521 81521    Andy Dalton  CIN      2017        15       NA       NA
## 81522 81522 Trevor Siemian  DEN      2017        15       NA       NA
## 81523 81523  A.J. McCarron  CIN      2017        15       NA       NA
## 81524 81524 Derek Anderson  CAR      2017        16       NA       NA
## 81525 81525  Johnny Hekker   LA      2017        16       NA       NA
##       rush_avg rush_tds rush_fumbles rec rec_yds rec_avg rec_tds
## 81520       NA       NA           NA  NA      NA      NA      NA
## 81521       NA       NA           NA  NA      NA      NA      NA
## 81522       NA       NA           NA  NA      NA      NA      NA
## 81523       NA       NA           NA  NA      NA      NA      NA
## 81524       NA       NA           NA  NA      NA      NA      NA
## 81525       NA       NA           NA  NA      NA      NA      NA
##       rec_fumbles pass_att pass_yds pass_tds int sck pass_fumbles rate
## 81520          NA       21      120        2   1   2            0 93.4
## 81521          NA       22      113        0   2   3            0 27.3
## 81522          NA        9       67        0   1   2            0 39.8
## 81523          NA        6       19        0   0   0            0 56.9
## 81524          NA        1        0        0   0   0            0 39.6
## 81525          NA        1        0        0   0   0            0 39.6
##       position
## 81520       QB
## 81521       QB
## 81522       QB
## 81523       QB
## 81524    WR/TE
## 81525    WR/TE

The R output provide first 6 cases and last 6 case. You can use head(dat1, 10) to output first 10 cases.

Step 3: Select and rename Variables

After initial checks, some basic data cleaning could be done. You may want to select some variables you want and remove others. You can use select function to do that:

dat2 <- dat1 %>% select(name, team, pass_att)

I selected 3 variables (name, team, pass_att) from the dat1 and assign 3-variables data to “dat2”.

head(dat2)
##             name team pass_att
## 1    Duce Staley  PHI       NA
## 2    Lamar Smith  MIA       NA
## 3    Tiki Barber  NYG       NA
## 4  Stephen Davis  WAS       NA
## 5 Edgerrin James  IND       NA
## 6  Priest Holmes  BAL       NA

To rename the variables’ names, you could use set_names() from purrr packages.

dat2 %>% purrr::set_names(nm = "Players", "Team", "Pass_Attribute") %>% head()
##          Players Team Pass_Attribute
## 1    Duce Staley  PHI             NA
## 2    Lamar Smith  MIA             NA
## 3    Tiki Barber  NYG             NA
## 4  Stephen Davis  WAS             NA
## 5 Edgerrin James  IND             NA
## 6  Priest Holmes  BAL             NA
dat2 %>% purrr::set_names(nm = "V1", "V2", "V3") %>% head()
##               V1  V2 V3
## 1    Duce Staley PHI NA
## 2    Lamar Smith MIA NA
## 3    Tiki Barber NYG NA
## 4  Stephen Davis WAS NA
## 5 Edgerrin James IND NA
## 6  Priest Holmes BAL NA

Step 4: Remove missing values

If you prefer cases with no missing cases at all. You can use the function below.

dat_complete_cases <- dat2 %>% 
  filter_all(all_vars(!is.na(.)))

summary(dat_complete_cases)
##      name               team              pass_att    
##  Length:10481       Length:10481       Min.   : 0.00  
##  Class :character   Class :character   1st Qu.:21.00  
##  Mode  :character   Mode  :character   Median :30.00  
##                                        Mean   :27.64  
##                                        3rd Qu.:37.00  
##                                        Max.   :68.00

Or if you want remove the cases whose Pass_Attribute is missing, you can use:

dat_partialcomplete_cases <- dat2 %>% 
  filter(!is.na(pass_att))

summary(dat_complete_cases)
##      name               team              pass_att    
##  Length:10481       Length:10481       Min.   : 0.00  
##  Class :character   Class :character   1st Qu.:21.00  
##  Mode  :character   Mode  :character   Median :30.00  
##                                        Mean   :27.64  
##                                        3rd Qu.:37.00  
##                                        Max.   :68.00

To be continued…

Related