Manipulation of dataframes means many things to many researchers: we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s):
or we even calculate summary statistics. We can do these operations using the normal base R operations:
base-r: * grouping variables * calculate stats
# download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv", destfile = "data/gapminder-FiveYearData.csv")
gapminder <- read.csv(file = "data/gapminder-FiveYearData.csv", header = TRUE)
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
## [1] 2193.755
mean(gapminder[gapminder$continent =="Americas","gdpPercap"])
## [1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
## [1] 7902.15
This is difficult because of repetition. repetition cost time, and could introduce bugs.
useful functions for manipulation dataframes efficiently, dplyr grammar might be easier to read
5 most commonly used functions and (%>%) pipe to combine them 1. select() 2. filter() 3. group_by() 4. summarize() 5. mutate()
# install dplyr package:
install.packages('dplyr')
# load package
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if we wanted to move forward with only a few of the variables (columns) in our dataframe we could use the select() function. This will keep only the variables you select
year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp) #use head to get a sample
## year country gdpPercap
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134
year_country_gdp
shows variable only contains the year, country, and gdpPercap. We used normal grammar.
the benefit of dplyr is using it to combine several functions using pipes. (%>%
) since we are introducing pipes, let’s repeat what we’ve done in the previous code using pipes:
pipe short cut CTRL + SHIFT + M (or CMD + SHIFT + M for OSX)
# redoing varialble using pipes
year_country_gdp <- gapminder %>% select(year, country, gdpPercap )
head(year_country_gdp) #using head to get a sample
## year country gdpPercap
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134
step by step: 1. summon gapminder dataframe 2. pass it on using the pipe symbol %>% to the select() function
If we continue the code example, but this time only with European countries we can combine select() and filter()
year_country_gdp_euro <- gapminder %>%
filter(continent =="Europe") %>%
select(year,country,gdpPercap)
head(year_country_gdp_euro) # use head for sample
## year country gdpPercap
## 1 1952 Albania 1601.056
## 2 1957 Albania 1942.284
## 3 1962 Albania 2312.889
## 4 1967 Albania 2760.197
## 5 1972 Albania 3313.422
## 6 1977 Albania 3533.004
http://swcarpentry.github.io/r-novice-gapminder/13-dplyr/
What’s happening: 1. first pass the gapminder dataframe to the filter()
2. then pass the filtered version of the gapminder dataframe to the select()
3. Note: order of operations is very important in this case. ** if select() was used first, filter would not be able to find the variable continent since we removed it in the previous step.**
filter()
, which will only pass observations that meet your criteria: continent =="Europe"
group_by()
wich will use every unique criteria that you could have used in filter# take a look at gapminder str()
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
str(gapminder %>% group_by(continent))
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## - attr(*, "vars")= chr "continent"
## - attr(*, "drop")= logi TRUE
## - attr(*, "indices")=List of 5
## ..$ : int 24 25 26 27 28 29 30 31 32 33 ...
## ..$ : int 48 49 50 51 52 53 54 55 56 57 ...
## ..$ : int 0 1 2 3 4 5 6 7 8 9 ...
## ..$ : int 12 13 14 15 16 17 18 19 20 21 ...
## ..$ : int 60 61 62 63 64 65 66 67 68 69 ...
## - attr(*, "group_sizes")= int 624 300 396 360 24
## - attr(*, "biggest_group_size")= int 624
## - attr(*, "labels")='data.frame': 5 obs. of 1 variable:
## ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
## ..- attr(*, "vars")= chr "continent"
## ..- attr(*, "drop")= logi TRUE
continent
group_by()
with summarize()
this will allow us to create new variable(s) by using functions that repeat for each of the continent specific data frames.
e.g. using group_by() we split original gapminder dataframe into multiple pieces
then run functions e.g. mean()
, sd()
or summarize()
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
gdp_bycontinents
## # A tibble: 5 x 2
## continent mean_gdpPercap
## <fctr> <dbl>
## 1 Africa 2194
## 2 Americas 7136
## 3 Asia 7902
## 4 Europe 14469
## 5 Oceania 18622
The fucntion group_by() allows us to group by multiple variables. e.g. let’s group by year
and contient
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop))
gdp_bycontinents_byyear
## # A tibble: 60 x 6
## # Groups: continent [?]
## continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 1253 983 4570010 6317450
## 2 Africa 1957 1385 1135 5093033 7076042
## 3 Africa 1962 1598 1462 5702247 7957545
## 4 Africa 1967 2050 2848 6447875 8985505
## 5 Africa 1972 2340 3287 7305376 10130833
## 6 Africa 1977 2586 4142 8328097 11585184
## 7 Africa 1982 2482 3243 9602857 13456243
## 8 Africa 1987 2283 2567 11054502 15277484
## 9 Africa 1992 2282 2644 12674645 17562719
## 10 Africa 1997 2379 2821 14304480 19873013
## # ... with 50 more rows
Skip to: ## mutate ()
we can also create new variables prior to (or after) summarizing information using mutate()
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
gdp_pop_bycontinents_byyear
## # A tibble: 60 x 8
## # Groups: continent [?]
## continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop mean… sd_g…
## <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 1253 983 4570010 6.32e⁶ 5.99 11.4
## 2 Africa 1957 1385 1135 5093033 7.08e⁶ 7.36 14.5
## 3 Africa 1962 1598 1462 5702247 7.96e⁶ 8.78 17.2
## 4 Africa 1967 2050 2848 6447875 8.99e⁶ 11.4 23.2
## 5 Africa 1972 2340 3287 7305376 1.01e⁷ 15.1 30.4
## 6 Africa 1977 2586 4142 8328097 1.16e⁷ 18.7 38.1
## 7 Africa 1982 2482 3243 9602857 1.35e⁷ 22.0 46.6
## 8 Africa 1987 2283 2567 11054502 1.53e⁷ 24.1 51.4
## 9 Africa 1992 2282 2644 12674645 1.76e⁷ 26.3 55.1
## 10 Africa 1997 2379 2821 14304480 1.99e⁷ 30.0 63.0
## # ... with 50 more rows
This was just a basic introduction to dplyr, it’s powerful and if you use R learn more about it.
It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected.
The principles of tidy data provide a standard way to organise data values within a dataset (collection of values).
researchers often want to manipulate their data from the wide
to long
format, or vise versa.
The long format is where: * each column is a variable * each row is an observation
In the long format, you have 1 column for the observed variable
and the other columns are ID variables
for the wide
format: * each row is often a site/subject/or patient
and you have multiple observation variables containing the same type of data * these can be either repeated observations over time
, or observations of multiple variables` (or a mix of both)
you may find data input may be simpler or some other applications may prefer the wide
format.
many of R functions have been designed assuming you have long
format data.
** This lesson will introduce transforming your data. **
These data formats maily affect readbility.
for humans, the wide format is often more intuitive sicne we can often see more of the data on the screen due to its shape.
however, the long format is more machine readable and is closer to formatting of databases
The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values
# install tidyr and dplyr
#install.packages("tidyr")
#install.packages("dplyr")
# load packages
library("tidyr")
library("dplyr")
gapminder$country <- as.character(gapminder$country)
gapminder$continent <- as.character(gapminder$continent)
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
long
and wide
data formatsID variables
(continent
, country
, year
)pop
, lifeExp
, gdpPercap
)There are few operations that would need us to stretch out this dataframe any longer (i.e. 4 ID variables and 1 Observation variable).
dplyr
), or we change the structure of the dataframe.Note: Some plotting functions in R actually work better in the wide format data.
Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. your own research data) will never be so well organized. Here let’s start with the wide format version of the gapminder dataset.
We’ll load the data file and look at it. Note: we don’t want our continent and country columns to be factors, so we use the stringsAsFactors argument for read.csv() to disable that.
#download.file('https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv', 'data/gapminder_wide.csv')
gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)
## 'data.frame': 142 obs. of 38 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
## $ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
## $ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
## $ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
## $ pop_2002 : int 31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
## $ pop_2007 : int 33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
* The first step towards getting our nice intermediate data format is to first convert from the wide to the long format.
gather()
will ‘gather’ your observation variables into a single variable.gap_long <- gap_wide %>%
gather(obstype_year, obs_values, starts_with('pop'),
starts_with('lifeExp'), starts_with('gdpPercap'))
str(gap_long)
## 'data.frame': 5112 obs. of 4 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: chr "pop_1952" "pop_1952" "pop_1952" "pop_1952" ...
## $ obs_values : num 9279525 4232095 1738315 442308 4469979 ...
TIP: tidyr and dplyr are compatible and you can use a mix of tidyr and dplyr functions by piping them together.
gather()
we first name the new column for the new ID variable (obstype_year
) the name for the new amalgamated observation variable (obs_value
)we could have typed out all the observation variables, but as in the select()
function in dplyr, we can use the starts_with()
argument to select all variables that starts with the desired character string.
gather also allows the alternative syntax fo using the -
symbol to identfy which variables are not to be gathered (i.e. ID variables
)
gap_long <- gap_wide %>% gather(obstype_year,obs_value,-continent,-country)
str(gap_long)
## 'data.frame': 5112 obs. of 4 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: chr "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
## $ obs_value : num 2449 3521 1063 851 543 ...
this may seem trival with this dataset, but sometimes you have 1 ID variable and 40 Observations variables with irregular variable names
flexibility is a big time saver!
now obstype_year
contains 2 pieces of information, the observation type(pop, lifeExp, or gdpPercap) and the year.
we can use the separate()
function to split the character strings into multiple variables
gap_long <- gap_long %>% separate(obstype_year,into=c('obs_type','year'),sep="_")
gap_long$year <- as.integer(gap_long$year)
str(gap_long)
## 'data.frame': 5112 obs. of 5 variables:
## $ continent: chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obs_type : chr "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
## $ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
## $ obs_value: num 2449 3521 1063 851 543 ...
head(gap_long, n=25)
## continent country obs_type year obs_value
## 1 Africa Algeria gdpPercap 1952 2449.0082
## 2 Africa Angola gdpPercap 1952 3520.6103
## 3 Africa Benin gdpPercap 1952 1062.7522
## 4 Africa Botswana gdpPercap 1952 851.2411
## 5 Africa Burkina Faso gdpPercap 1952 543.2552
## 6 Africa Burundi gdpPercap 1952 339.2965
## 7 Africa Cameroon gdpPercap 1952 1172.6677
## 8 Africa Central African Republic gdpPercap 1952 1071.3107
## 9 Africa Chad gdpPercap 1952 1178.6659
## 10 Africa Comoros gdpPercap 1952 1102.9909
## 11 Africa Congo Dem. Rep. gdpPercap 1952 780.5423
## 12 Africa Congo Rep. gdpPercap 1952 2125.6214
## 13 Africa Cote d'Ivoire gdpPercap 1952 1388.5947
## 14 Africa Djibouti gdpPercap 1952 2669.5295
## 15 Africa Egypt gdpPercap 1952 1418.8224
## 16 Africa Equatorial Guinea gdpPercap 1952 375.6431
## 17 Africa Eritrea gdpPercap 1952 328.9406
## 18 Africa Ethiopia gdpPercap 1952 362.1463
## 19 Africa Gabon gdpPercap 1952 4293.4765
## 20 Africa Gambia gdpPercap 1952 485.2307
## 21 Africa Ghana gdpPercap 1952 911.2989
## 22 Africa Guinea gdpPercap 1952 510.1965
## 23 Africa Guinea-Bissau gdpPercap 1952 299.8503
## 24 Africa Kenya gdpPercap 1952 853.5409
## 25 Africa Lesotho gdpPercap 1952 298.8462
spread()
gather
to spread our observation variables back outgap_long()
to the original intermediate format or the widest formatgap_normal <- gap_long %>% spread(obs_type,obs_value)
dim(gap_normal)
## [1] 1704 6
dim(gapminder)
## [1] 1704 6
names(gap_normal)
## [1] "continent" "country" "year" "gdpPercap" "lifeExp" "pop"
names(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
*Now we’ve got an intermediate dataframe gap_normal
with the same dimensions as the original gapminder
, but the order of the variables is different
.
all.equal()
.gap_normal <- gap_normal[,names(gapminder)]
all.equal(gap_normal,gapminder)
## [1] "Component \"country\": 1704 string mismatches"
## [2] "Component \"pop\": Mean relative difference: 1.634504"
## [3] "Component \"continent\": 1212 string mismatches"
## [4] "Component \"lifeExp\": Mean relative difference: 0.203822"
## [5] "Component \"gdpPercap\": Mean relative difference: 1.162302"
head(gap_normal)
## country year pop continent lifeExp gdpPercap
## 1 Algeria 1952 9279525 Africa 43.077 2449.008
## 2 Algeria 1957 10270856 Africa 45.685 3013.976
## 3 Algeria 1962 11000948 Africa 48.303 2550.817
## 4 Algeria 1967 12760499 Africa 51.407 3246.992
## 5 Algeria 1972 14760787 Africa 54.518 4182.664
## 6 Algeria 1977 17152804 Africa 58.014 4910.417
head(gapminder)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
country
, continent
, then year
.gap_normal <- gap_normal %>% arrange(country,continent,year)
all.equal(gap_normal,gapminder)
## [1] TRUE
** Now lets convert the long all the way back to the wide. **
in the wide format, we will keep country and continent as ID variables and spread the observations across the 3 metrics (pop
, lifeExp
, gdpPercap
) and time (year
)
we need appropriate lables for all our new variables (time*metric combinations
)
we also need to unify our ID varialbes to simplify the process of defining gap_wide
gap_temp <- gap_long %>% unite(var_ID,continent,country,sep = "_")
str(gap_temp)
## 'data.frame': 5112 obs. of 4 variables:
## $ var_ID : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
## $ obs_type : chr "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
## $ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
## $ obs_value: num 2449 3521 1063 851 543 ...
gap_temp <- gap_long %>%
unite(ID_var,continent,country,sep = "_") %>%
unite(var_names,obs_type,year,sep="_")
str(gap_temp)
## 'data.frame': 5112 obs. of 3 variables:
## $ ID_var : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
## $ var_names: chr "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
## $ obs_value: num 2449 3521 1063 851 543 ...
unite()
we now have a single ID variable which is a combination of continent
,country
,and we have defined variable names.spread()
gap_wide_new <- gap_long %>%
unite(ID_var,continent,country,sep="_") %>%
unite(var_name,obs_type,year,sep="_") %>%
spread(var_name,obs_value)
str(gap_wide_new)
## 'data.frame': 142 obs. of 37 variables:
## $ ID_var : chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
## $ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
## $ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
## $ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
## $ pop_2002 : num 31287142 10866106 7026113 1630347 12251209 ...
## $ pop_2007 : num 33333216 12420476 8078314 1639131 14326203 ...
ID_var
could be more usable, let’s separate it into 2 variables with separate()
gap_wide_betterID <- separate(gap_wide_new,ID_var,c("continent","country"), sep = "_")
gap_wide_betterID <- gap_long %>%
unite(ID_var,continent,country,sep="_") %>%
unite(var_name, obs_type,year,sep ="_") %>%
spread(var_name,obs_value) %>%
separate(ID_var,c("continent","country"),sep = "_")
str(gap_wide_betterID)
## 'data.frame': 142 obs. of 38 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
## $ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
## $ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
## $ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
## $ pop_2002 : num 31287142 10866106 7026113 1630347 12251209 ...
## $ pop_2007 : num 33333216 12420476 8078314 1639131 14326203 ...
all.equals()
all.equal(gap_wide,gap_wide_betterID)
## [1] TRUE
remember dplyr and tidyr * are the Swiss army knives of R.
[data wrangling - cheat sheet]https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
iris[1:2,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
df <- gather(iris, variable, value, -Species) #gather() observation variables (rows) into a single variable df
df[1:2, ]
## Species variable value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
str(df)
## 'data.frame': 600 obs. of 3 variables:
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ variable: chr "Sepal.Length" "Sepal.Length" "Sepal.Length" "Sepal.Length" ...
## $ value : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
library("ggplot2")
ggplot(df, aes(Species, value, fill=variable)) +
geom_bar(stat = "identity", position ="dodge")