Manipulation of dataframes means many things to many researchers: we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s):

observations = rows
variables = columns

or we even calculate summary statistics. We can do these operations using the normal base R operations:

base-r: * grouping variables * calculate stats

# download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv", destfile = "data/gapminder-FiveYearData.csv")

gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv", header = TRUE)
str(gapminder)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])

## [1] 2193.755

mean(gapminder[gapminder$continent =="Americas","gdpPercap"])

## [1] 7136.11

mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])

## [1] 7902.15

This is difficult because of repetition. repetition cost time, and could introduce bugs.

dplyr package

useful functions for manipulation dataframes efficiently, dplyr grammar might be easier to read

5 most commonly used functions and (%>%) pipe to combine them 1. select() 2. filter() 3. group_by() 4. summarize() 5. mutate()

# install dplyr package:
install.packages('dplyr')

# load package
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Using select() function

if we wanted to move forward with only a few of the variables (columns) in our dataframe we could use the select() function. This will keep only the variables you select

year_country_gdp <- select(gapminder, year, country, gdpPercap)

head(year_country_gdp) #use head to get a sample

##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

year_country_gdp shows variable only contains the year, country, and gdpPercap. We used normal grammar.
the benefit of dplyr is using it to combine several functions using pipes. (%>%) since we are introducing pipes, let’s repeat what we’ve done in the previous code using pipes:
pipe short cut CTRL + SHIFT + M (or CMD + SHIFT + M for OSX)

# redoing varialble using pipes
year_country_gdp <- gapminder %>% select(year, country, gdpPercap )

head(year_country_gdp) #using head to get a sample

##   year     country gdpPercap
## 1 1952 Afghanistan  779.4453
## 2 1957 Afghanistan  820.8530
## 3 1962 Afghanistan  853.1007
## 4 1967 Afghanistan  836.1971
## 5 1972 Afghanistan  739.9811
## 6 1977 Afghanistan  786.1134

step by step: 1. summon gapminder dataframe 2. pass it on using the pipe symbol %>% to the select() function

Using filter()

If we continue the code example, but this time only with European countries we can combine select() and filter()

year_country_gdp_euro <- gapminder %>%
  filter(continent =="Europe") %>%
  select(year,country,gdpPercap)

head(year_country_gdp_euro) # use head for sample

##   year country gdpPercap
## 1 1952 Albania  1601.056
## 2 1957 Albania  1942.284
## 3 1962 Albania  2312.889
## 4 1967 Albania  2760.197
## 5 1972 Albania  3313.422
## 6 1977 Albania  3533.004

Challenge 1

http://swcarpentry.github.io/r-novice-gapminder/13-dplyr/

What’s happening: 1. first pass the gapminder dataframe to the filter() 2. then pass the filtered version of the gapminder dataframe to the select() 3. Note: order of operations is very important in this case. ** if select() was used first, filter would not be able to find the variable continent since we removed it in the previous step.**

Using group_by() and summarize()

As mentioned in the beginning examples, we want to reduce the error prone repetitiveness of what can be done with base R
we haven’t done that since we would have to repeat the last code chunk for each continent
instead of filter(), which will only pass observations that meet your criteria: continent =="Europe"
we can use group_by() wich will use every unique criteria that you could have used in filter

# take a look at gapminder str()
str(gapminder)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

str(gapminder %>% group_by(continent))

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
##  - attr(*, "vars")= chr "continent"
##  - attr(*, "drop")= logi TRUE
##  - attr(*, "indices")=List of 5
##   ..$ : int  24 25 26 27 28 29 30 31 32 33 ...
##   ..$ : int  48 49 50 51 52 53 54 55 56 57 ...
##   ..$ : int  0 1 2 3 4 5 6 7 8 9 ...
##   ..$ : int  12 13 14 15 16 17 18 19 20 21 ...
##   ..$ : int  60 61 62 63 64 65 66 67 68 69 ...
##  - attr(*, "group_sizes")= int  624 300 396 360 24
##  - attr(*, "biggest_group_size")= int 624
##  - attr(*, "labels")='data.frame':   5 obs. of  1 variable:
##   ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
##   ..- attr(*, "vars")= chr "continent"
##   ..- attr(*, "drop")= logi TRUE

note the structure of the dataframe where group_by() (or grouped_df) is not the same as the original gapminder (data.frame).
grouped_df can be thought of a s a list where each item in the list is a data.frame which contains only rows that correspond to the particular value continent

show graphic - gapminder %>% group_by()

using group_by() with summarize()
this will allow us to create new variable(s) by using functions that repeat for each of the continent specific data frames.
e.g. using group_by() we split original gapminder dataframe into multiple pieces
then run functions e.g. mean(), sd() or summarize()

gdp_bycontinents <- gapminder %>% 
  group_by(continent) %>%
  summarize(mean_gdpPercap=mean(gdpPercap))

gdp_bycontinents

## # A tibble: 5 x 2
##   continent mean_gdpPercap
##   <fctr>             <dbl>
## 1 Africa              2194
## 2 Americas            7136
## 3 Asia                7902
## 4 Europe             14469
## 5 Oceania            18622

This allowed us to calculate the mean gdpPercap for each continent

Challenge 2

The fucntion group_by() allows us to group by multiple variables. e.g. let’s group by year and contient

gdp_bycontinents_byyear <- gapminder %>%
  group_by(continent, year) %>%
  summarize(mean_gdpPercap=mean(gdpPercap),
             sd_gdpPercap=sd(gdpPercap),
             mean_pop=mean(pop),
             sd_pop=sd(pop))

gdp_bycontinents_byyear

## # A tibble: 60 x 6
## # Groups: continent [?]
##    continent  year mean_gdpPercap sd_gdpPercap mean_pop   sd_pop
##    <fctr>    <int>          <dbl>        <dbl>    <dbl>    <dbl>
##  1 Africa     1952           1253          983  4570010  6317450
##  2 Africa     1957           1385         1135  5093033  7076042
##  3 Africa     1962           1598         1462  5702247  7957545
##  4 Africa     1967           2050         2848  6447875  8985505
##  5 Africa     1972           2340         3287  7305376 10130833
##  6 Africa     1977           2586         4142  8328097 11585184
##  7 Africa     1982           2482         3243  9602857 13456243
##  8 Africa     1987           2283         2567 11054502 15277484
##  9 Africa     1992           2282         2644 12674645 17562719
## 10 Africa     1997           2379         2821 14304480 19873013
## # ... with 50 more rows

Skip to: ## mutate ()

we can also create new variables prior to (or after) summarizing information using mutate()

gdp_pop_bycontinents_byyear <- gapminder %>%
  mutate(gdp_billion=gdpPercap*pop/10^9) %>%
  group_by(continent, year) %>%
  summarize(mean_gdpPercap=mean(gdpPercap),
             sd_gdpPercap=sd(gdpPercap),
              mean_pop=mean(pop),
              sd_pop=sd(pop),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))

gdp_pop_bycontinents_byyear

## # A tibble: 60 x 8
## # Groups: continent [?]
##    continent  year mean_gdpPercap sd_gdpPercap mean_pop sd_pop mean… sd_g…
##    <fctr>    <int>          <dbl>        <dbl>    <dbl>  <dbl> <dbl> <dbl>
##  1 Africa     1952           1253          983  4570010 6.32e⁶  5.99  11.4
##  2 Africa     1957           1385         1135  5093033 7.08e⁶  7.36  14.5
##  3 Africa     1962           1598         1462  5702247 7.96e⁶  8.78  17.2
##  4 Africa     1967           2050         2848  6447875 8.99e⁶ 11.4   23.2
##  5 Africa     1972           2340         3287  7305376 1.01e⁷ 15.1   30.4
##  6 Africa     1977           2586         4142  8328097 1.16e⁷ 18.7   38.1
##  7 Africa     1982           2482         3243  9602857 1.35e⁷ 22.0   46.6
##  8 Africa     1987           2283         2567 11054502 1.53e⁷ 24.1   51.4
##  9 Africa     1992           2282         2644 12674645 1.76e⁷ 26.3   55.1
## 10 Africa     1997           2379         2821 14304480 1.99e⁷ 30.0   63.0
## # ... with 50 more rows

This was just a basic introduction to dplyr, it’s powerful and if you use R learn more about it.

End dplyr lesson

Introduce Tidyr Lesson

It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected.
The principles of tidy data provide a standard way to organise data values within a dataset (collection of values).

researchers often want to manipulate their data from the wide to long format, or vise versa.

The long format is where: * each column is a variable * each row is an observation

In the long format, you have 1 column for the observed variable and the other columns are ID variables

for the wide format: * each row is often a site/subject/or patient and you have multiple observation variables containing the same type of data * these can be either repeated observations over time, or observations of multiple variables` (or a mix of both)

you may find data input may be simpler or some other applications may prefer the wide format.
many of R functions have been designed assuming you have long format data.

** This lesson will introduce transforming your data. **

These data formats maily affect readbility.
for humans, the wide format is often more intuitive sicne we can often see more of the data on the screen due to its shape.
however, the long format is more machine readable and is closer to formatting of databases
The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values

# install tidyr and dplyr
#install.packages("tidyr")
#install.packages("dplyr")

# load packages 
library("tidyr")
library("dplyr")

first, lets look at the structure of our orig. gapminder dataframe:

gapminder$country <- as.character(gapminder$country)
gapminder$continent <- as.character(gapminder$continent)
str(gapminder)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Sometimes, as with the gapminder dataset, we have multiple types of observed data.
it’s somewhere in between the purely long and wide data formats
we have 3 ID variables (continent, country, year)
and 3 observation variables (pop, lifeExp, gdpPercap)
this intermediate format in most cases despite not having ALL observations in 1 column given that all 3 observation variables have different units.
There are few operations that would need us to stretch out this dataframe any longer (i.e. 4 ID variables and 1 Observation variable).
which are often vector based, you usually do not want to do mathematical operations on values with different units.
for example, using the purely long format, a single mean for all of the values of population, life expectancy, and GDP would not be meaningful since it would return the mean of values with 3 incompatible units.
The solution is that we first manipulate the data either by grouping (see the lesson on dplyr), or we change the structure of the dataframe.
Note: Some plotting functions in R actually work better in the wide format data.

From wide to long format with gather()

Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. your own research data) will never be so well organized. Here let’s start with the wide format version of the gapminder dataset.
We’ll load the data file and look at it. Note: we don’t want our continent and country columns to be factors, so we use the stringsAsFactors argument for read.csv() to disable that.

download wide gapminder

#download.file('https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv', 'data/gapminder_wide.csv')

gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)

## 'data.frame':    142 obs. of  38 variables:
##  $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
##  $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
##  $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
##  $ gdpPercap_1957: num  3014 3828 960 918 617 ...
##  $ gdpPercap_1962: num  2551 4269 949 984 723 ...
##  $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
##  $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
##  $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
##  $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
##  $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
##  $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
##  $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
##  $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
##  $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
##  $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
##  $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
##  $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
##  $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
##  $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
##  $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
##  $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
##  $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
##  $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
##  $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
##  $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
##  $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
##  $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
##  $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
##  $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
##  $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
##  $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
##  $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
##  $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
##  $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
##  $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
##  $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
##  $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
##  $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...

Wide * The first step towards getting our nice intermediate data format is to first convert from the wide to the long format.

The tidyr function gather() will ‘gather’ your observation variables into a single variable.

gap_long <- gap_wide %>%
  gather(obstype_year, obs_values, starts_with('pop'),
         starts_with('lifeExp'), starts_with('gdpPercap'))
str(gap_long)

## 'data.frame':    5112 obs. of  4 variables:
##  $ continent   : chr  "Africa" "Africa" "Africa" "Africa" ...
##  $ country     : chr  "Algeria" "Angola" "Benin" "Botswana" ...
##  $ obstype_year: chr  "pop_1952" "pop_1952" "pop_1952" "pop_1952" ...
##  $ obs_values  : num  9279525 4232095 1738315 442308 4469979 ...

here we have used piping syntax which is simiplar to what we were doing with dplyr.
TIP: tidyr and dplyr are compatible and you can use a mix of tidyr and dplyr functions by piping them together.
inside gather() we first name the new column for the new ID variable (obstype_year) the name for the new amalgamated observation variable (obs_value)
then the names of the old observation variables.
we could have typed out all the observation variables, but as in the select() function in dplyr, we can use the starts_with() argument to select all variables that starts with the desired character string.
gather also allows the alternative syntax fo using the - symbol to identfy which variables are not to be gathered (i.e. ID variables)

gap_long <- gap_wide %>% gather(obstype_year,obs_value,-continent,-country)
str(gap_long)

## 'data.frame':    5112 obs. of  4 variables:
##  $ continent   : chr  "Africa" "Africa" "Africa" "Africa" ...
##  $ country     : chr  "Algeria" "Angola" "Benin" "Botswana" ...
##  $ obstype_year: chr  "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
##  $ obs_value   : num  2449 3521 1063 851 543 ...

this may seem trival with this dataset, but sometimes you have 1 ID variable and 40 Observations variables with irregular variable names
flexibility is a big time saver!
now obstype_year contains 2 pieces of information, the observation type(pop, lifeExp, or gdpPercap) and the year.
we can use the separate() function to split the character strings into multiple variables

gap_long <- gap_long %>% separate(obstype_year,into=c('obs_type','year'),sep="_")
gap_long$year <- as.integer(gap_long$year)
str(gap_long)

## 'data.frame':    5112 obs. of  5 variables:
##  $ continent: chr  "Africa" "Africa" "Africa" "Africa" ...
##  $ country  : chr  "Algeria" "Angola" "Benin" "Botswana" ...
##  $ obs_type : chr  "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
##  $ year     : int  1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
##  $ obs_value: num  2449 3521 1063 851 543 ...

head(gap_long, n=25)

##    continent                  country  obs_type year obs_value
## 1     Africa                  Algeria gdpPercap 1952 2449.0082
## 2     Africa                   Angola gdpPercap 1952 3520.6103
## 3     Africa                    Benin gdpPercap 1952 1062.7522
## 4     Africa                 Botswana gdpPercap 1952  851.2411
## 5     Africa             Burkina Faso gdpPercap 1952  543.2552
## 6     Africa                  Burundi gdpPercap 1952  339.2965
## 7     Africa                 Cameroon gdpPercap 1952 1172.6677
## 8     Africa Central African Republic gdpPercap 1952 1071.3107
## 9     Africa                     Chad gdpPercap 1952 1178.6659
## 10    Africa                  Comoros gdpPercap 1952 1102.9909
## 11    Africa          Congo Dem. Rep. gdpPercap 1952  780.5423
## 12    Africa               Congo Rep. gdpPercap 1952 2125.6214
## 13    Africa            Cote d'Ivoire gdpPercap 1952 1388.5947
## 14    Africa                 Djibouti gdpPercap 1952 2669.5295
## 15    Africa                    Egypt gdpPercap 1952 1418.8224
## 16    Africa        Equatorial Guinea gdpPercap 1952  375.6431
## 17    Africa                  Eritrea gdpPercap 1952  328.9406
## 18    Africa                 Ethiopia gdpPercap 1952  362.1463
## 19    Africa                    Gabon gdpPercap 1952 4293.4765
## 20    Africa                   Gambia gdpPercap 1952  485.2307
## 21    Africa                    Ghana gdpPercap 1952  911.2989
## 22    Africa                   Guinea gdpPercap 1952  510.1965
## 23    Africa            Guinea-Bissau gdpPercap 1952  299.8503
## 24    Africa                    Kenya gdpPercap 1952  853.5409
## 25    Africa                  Lesotho gdpPercap 1952  298.8462

From long to intermediate format with `spread()`

let’s use the opposite of gather to spread our observation variables back out
we cna then spread gap_long() to the original intermediate format or the widest format
let’s start with intermediate format

gap_normal <- gap_long %>% spread(obs_type,obs_value)
dim(gap_normal)

## [1] 1704    6

dim(gapminder)

## [1] 1704    6

names(gap_normal)

## [1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"

names(gapminder)

## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

*Now we’ve got an intermediate dataframe gap_normal with the same dimensions as the original gapminder, but the order of the variables is different.

Let’s fix that before checking if they are all.equal().

gap_normal <- gap_normal[,names(gapminder)]
all.equal(gap_normal,gapminder)

## [1] "Component \"country\": 1704 string mismatches"              
## [2] "Component \"pop\": Mean relative difference: 1.634504"      
## [3] "Component \"continent\": 1212 string mismatches"            
## [4] "Component \"lifeExp\": Mean relative difference: 0.203822"  
## [5] "Component \"gdpPercap\": Mean relative difference: 1.162302"

head(gap_normal)

##   country year      pop continent lifeExp gdpPercap
## 1 Algeria 1952  9279525    Africa  43.077  2449.008
## 2 Algeria 1957 10270856    Africa  45.685  3013.976
## 3 Algeria 1962 11000948    Africa  48.303  2550.817
## 4 Algeria 1967 12760499    Africa  51.407  3246.992
## 5 Algeria 1972 14760787    Africa  54.518  4182.664
## 6 Algeria 1977 17152804    Africa  58.014  4910.417

head(gapminder)

##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

We’re almost there, the original was sorted by country, continent, then year.

gap_normal <- gap_normal %>% arrange(country,continent,year)
all.equal(gap_normal,gapminder)

## [1] TRUE

Great! we’ve gone from the longest format back to the intermediate adn we didn’t introduce any errors in our code.

** Now lets convert the long all the way back to the wide. **

in the wide format, we will keep country and continent as ID variables and spread the observations across the 3 metrics (pop, lifeExp, gdpPercap) and time (year)
we need appropriate lables for all our new variables (time*metric combinations)
we also need to unify our ID varialbes to simplify the process of defining gap_wide

gap_temp <- gap_long %>% unite(var_ID,continent,country,sep = "_")
str(gap_temp)

## 'data.frame':    5112 obs. of  4 variables:
##  $ var_ID   : chr  "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
##  $ obs_type : chr  "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
##  $ year     : int  1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
##  $ obs_value: num  2449 3521 1063 851 543 ...

gap_temp <- gap_long %>%
  unite(ID_var,continent,country,sep = "_") %>%
  unite(var_names,obs_type,year,sep="_")
str(gap_temp)

## 'data.frame':    5112 obs. of  3 variables:
##  $ ID_var   : chr  "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
##  $ var_names: chr  "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
##  $ obs_value: num  2449 3521 1063 851 543 ...

Using unite() we now have a single ID variable which is a combination of continent,country,and we have defined variable names.
We’re now ready to pipe in spread()

gap_wide_new <- gap_long %>%
  unite(ID_var,continent,country,sep="_") %>%
  unite(var_name,obs_type,year,sep="_") %>%
  spread(var_name,obs_value)
str(gap_wide_new)

## 'data.frame':    142 obs. of  37 variables:
##  $ ID_var        : chr  "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
##  $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
##  $ gdpPercap_1957: num  3014 3828 960 918 617 ...
##  $ gdpPercap_1962: num  2551 4269 949 984 723 ...
##  $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
##  $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
##  $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
##  $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
##  $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
##  $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
##  $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
##  $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
##  $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
##  $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
##  $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
##  $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
##  $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
##  $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
##  $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
##  $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
##  $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
##  $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
##  $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
##  $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
##  $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
##  $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
##  $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
##  $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
##  $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
##  $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
##  $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
##  $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
##  $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
##  $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
##  $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
##  $ pop_2002      : num  31287142 10866106 7026113 1630347 12251209 ...
##  $ pop_2007      : num  33333216 12420476 8078314 1639131 14326203 ...

Now we have a great ‘wide’ format dataframe, but the ID_var could be more usable, let’s separate it into 2 variables with separate()

gap_wide_betterID <- separate(gap_wide_new,ID_var,c("continent","country"), sep = "_")
gap_wide_betterID <- gap_long %>%
  unite(ID_var,continent,country,sep="_") %>%
  unite(var_name, obs_type,year,sep ="_") %>%
  spread(var_name,obs_value) %>%
  separate(ID_var,c("continent","country"),sep = "_")
str(gap_wide_betterID)

## 'data.frame':    142 obs. of  38 variables:
##  $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
##  $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
##  $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
##  $ gdpPercap_1957: num  3014 3828 960 918 617 ...
##  $ gdpPercap_1962: num  2551 4269 949 984 723 ...
##  $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
##  $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
##  $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
##  $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
##  $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
##  $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
##  $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
##  $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
##  $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
##  $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
##  $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
##  $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
##  $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
##  $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
##  $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
##  $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
##  $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
##  $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
##  $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
##  $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
##  $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
##  $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
##  $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
##  $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
##  $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
##  $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
##  $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
##  $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
##  $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
##  $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
##  $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
##  $ pop_2002      : num  31287142 10866106 7026113 1630347 12251209 ...
##  $ pop_2007      : num  33333216 12420476 8078314 1639131 14326203 ...

check for errors using all.equals()

all.equal(gap_wide,gap_wide_betterID)

## [1] TRUE

remember dplyr and tidyr * are the Swiss army knives of R.

dplyr:

filter
select
mutate

tidyr:

gather
spread
separate

[data wrangling - cheat sheet]https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Tidyr wide to long with plot example:

iris[1:2,]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

df <- gather(iris, variable, value, -Species) #gather() observation variables (rows) into a single variable df

df[1:2, ]

##   Species     variable value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9

str(df)

## 'data.frame':    600 obs. of  3 variables:
##  $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ variable: chr  "Sepal.Length" "Sepal.Length" "Sepal.Length" "Sepal.Length" ...
##  $ value   : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

library("ggplot2")

ggplot(df, aes(Species, value, fill=variable)) +
  geom_bar(stat = "identity", position ="dodge")

dplyr-tidyr lesson

Reid Otsuji

1/21/2018

dplyr package

Using select() function

Using filter()

Challenge 1

Using group_by() and summarize()

show graphic - gapminder %>% group_by()

Challenge 2

End dplyr lesson

Introduce Tidyr Lesson

From wide to long format with gather()

download wide gapminder

From long to intermediate format with `spread()`

dplyr:

tidyr:

Tidyr wide to long with plot example:

dplyr-tidyr lesson

Reid Otsuji

1/21/2018

dplyr package

Using select() function

Using filter()

Challenge 1

Using group_by() and summarize()

show graphic - gapminder %>% group_by()

Challenge 2

End dplyr lesson

Introduce Tidyr Lesson

From wide to long format with gather()

download wide gapminder

From long to intermediate format with spread()

dplyr:

tidyr:

Tidyr wide to long with plot example:

From long to intermediate format with `spread()`