R Summer Workshop

class: center, middle, inverse, title-slide

# R Summer Workshop
## <br/>Data Wrangling and Manipulation
### Alex Stephenson
### UC Berkeley
### 6/29/2020

---

# Owls and Apologies

.center[
<img src="img/owl.jpg" width="500px" height="400px" />
]

---

# Learning Objectives

By the end of this workshop we will hopefully have learned:
    
--
    
1. General Principles of Data Wrangling

--
    
2. Basic, but powerful, tools for reshaping data

3. How to apply these principles on actual research projects (plus heteroskedastic error corrections!)

---
    
# What packages do I need to follow along 
    
This talk is completely in **R**.

We will spend our time using functions from the following packages:

`tidyverse`: The absolute workhorse king for data wrangling in R

`haven`: The best way to read in Stata files to R

`estimatr`: The easiest way to get HC* error corrections

`lmtest`: For when estimatr fails us.

---

# Data Wrangling

Working with data means answering three questions

What do we want to do?

How do we instruct a computer to do that thing?

How do we execute that thing in a reasonable amount of time?

---

# Tidy Data

A way to help us get started is to constrain our data to be tidy

Tidy data means:

- Each Variable is saved in its own column 
  - Each Observation is saved in its own row 
  - Each Cell is a single value

Tidy data makes the analysis of data **much** easier for you and for your future self

---

# The Verbs of Data Wrangling

Data Wrangling has a few major "verbs". The most common ones in every day data analysis are:

`arrange()`: Order observations according to a condition.

`filter()`: Pick observations and variables of interest.

`group_by()`: Organize observations into groups of observations

`mutate()`: Make a new variable or change an existing variable according to a condition

`select()`: Include or exclude a variable of interest.

`summarise()`: Get a summary of a variable

`pivot`: Change the shape of of a dataset. 
---

# Reshaping Untidy Data Example 1

Suppose we have the following data, based on the standard format of the World Bank WDI

```
##   country y2010 y2011 y2012
## 1     USA   100    85    NA
## 2  Mexico    90    95    99
```

This data frame is an example of *wide* data. There are more columns than rows.

For almost everything we do in R, we prefer *long* data. 
---

# Reshaping our Untidy data set

The tidyverse package to reshape data is called tidyr.

The key functions are `pivot_longer()` and `pivot_wider()`. `pivot_longer()` converts wide data to long. `pivot_wider()` does the reverse.

```r
ex1_l <- ex1_w %>% 
  pivot_longer(-country, names_to = "year", values_to = "count")%>%
  mutate(year = str_replace_all(year, "y", "")%>%
           trimws()%>%
           as.integer()) #<<

ex1_l
```

```
## # A tibble: 6 x 3
##   country  year count
##   <chr>   <int> <dbl>
## 1 USA      2010   100
## 2 USA      2011    85
## 3 USA      2012    NA
## 4 Mexico   2010    90
## 5 Mexico   2011    95
## 6 Mexico   2012    99
```

---

# Reshaping Data Example 2

Sometimes our data was made by evil people who put multiple pieces of information in a column name.

```
##   country year employ_r_1519 unemploy_r_1519 retire_r_1519
## 1     USA 2010            50              45            70
## 2  Mexico 2010            60              50            80
```

Here, the authors of this dataset include the type of employment and age in the same variable name.

---

# Fix Untidy Data 2

Our basic structure remains the same. Here we use two tricks to grab columns faster.

We can specify the columns we want to pivot with `cols`.

We can use `names_sep` to split up a column name.

```r
ex2_l <- ex2_w %>% 
  pivot_longer(cols = employ_r_1519:retire_r_1519, #<<
               names_to = c("status","type", "age"),
               names_sep = "_",
               values_to = "value")
```

```
## # A tibble: 6 x 6
##   country  year status   type  age   value
##   <chr>   <dbl> <chr>    <chr> <chr> <dbl>
## 1 USA      2010 employ   r     1519     50
## 2 USA      2010 unemploy r     1519     45
## 3 USA      2010 retire   r     1519     70
## 4 Mexico   2010 employ   r     1519     60
## 5 Mexico   2010 unemploy r     1519     50
## 6 Mexico   2010 retire   r     1519     80
```
---

# Wrangling Incorrect Data

Data Entry is fraught with challenges. As an example consider Fearon and Laitin 2003.

.center[
<img src="img/Fearon.png" width="500px" height="400px" />
]

Focus on Model 1 of this table.

---

# A Failed Replication

```r
# Replication file for Fearon and Laitin 2003 
fearon <- read_dta(here("Workshop1/data/fearon.dta"))

# Main model specification

tryCatch({f1 <- glm(onset ~ warl + gdpenl + lpopl1 + lmtnest + 
                      ncontig + Oil + nwstate + instab + polity2l +
                      ethfrac + relfrac, 
                    data = fearon, 
                    family = "binomial")}, 
         error = function(e){ 
           print("Replication fail due to y greater than 1")
           })
```

```
## [1] "Replication fail due to y greater than 1"
```

---

# Examining the Error

Fearon and Laitin's data fails to replicate because of a bad data entry.

```
##      onset        
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.01725  
##  3rd Qu.:0.00000  
##  Max.   :4.00000
```

The problem entry is Russia 1946

```r
fearon %>% 
  filter(onset == 4)%>% 
  select(country, year, onset)
```

```
## # A tibble: 1 x 3
##   country  year onset
##   <chr>   <dbl> <dbl>
## 1 RUSSIA   1946     4
```
---

# Fixing The Error

```r
fearon <- fearon %>%
  mutate(onset = case_when(
    onset == 4~1,
    TRUE~onset))

# Now this will work 
f1 <- glm(onset ~ warl + gdpenl + lpopl1 + lmtnest + 
            ncontig + Oil + nwstate + instab + polity2l + 
            ethfrac + relfrac, data = fearon, family = "binomial")
tidy(f1)
```

```
## # A tibble: 12 x 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)  -6.73      0.736     -9.15  5.76e-20
##  2 warl         -0.954     0.314     -3.04  2.40e- 3
##  3 gdpenl       -0.344     0.0718    -4.79  1.67e- 6
##  4 lpopl1        0.263     0.0727     3.62  2.99e- 4
##  5 lmtnest       0.219     0.0848     2.58  9.83e- 3
##  6 ncontig       0.443     0.274      1.62  1.06e- 1
##  7 Oil           0.858     0.279      3.07  2.13e- 3
##  8 nwstate       1.71      0.339      5.05  4.45e- 7
##  9 instab        0.618     0.235      2.63  8.62e- 3
## 10 polity2l      0.0209    0.0168     1.24  2.14e- 1
## 11 ethfrac       0.166     0.373      0.446 6.56e- 1
## 12 relfrac       0.285     0.509      0.560 5.75e- 1
```

---
class: center, middle 
# Using the Verbs in a real analysis

The first demonstration of a basic workflow with our verbs is a replication of Daniel Altman's "The Evolution of Territorial Conquest after 1945 and the Limits of the Norm of Territorial Integrity"

.center[
<img src="img/altman.png" width="400px" height="450px" />
]

---

# Replicating Altman 2020

We will do it as a coefficient plot 
.center[
<img src="dataWrangling101_files/figure-html/unnamed-chunk-13-1.png" width="500px" />
]
---
    
# Step 1: Read in Data 
    
We use haven::read_dta() to read in STATA files.

```r
## Use read_dta() from haven to import .dta file 
conquest <- read_dta(here("Conquest REP.dta"))
```

---
    
# Step 2: Wrangle Data 
    
Using mutate(), we apply a log scale to two variables, recode two variables, and create two new variables.

Note the filter() command to subset to just the rows where entire == 0 and retaliatory == 0.

```r
## Generate variables used in regression specifications

data <- conquest %>% 
    mutate(logmilex = log(milex),
           logmilexchal = log(milexadv),
           logmilexshare = logmilex / (logmilexchal + logmilex),
           rspolity2 = polity2 + 10, 
           rspolity2adv = polity2adv + 10, 
           rspolity2int = rspolity2 * rspolity2adv)%>%
    filter(entire == 0, #<<
           retaliatory == 0) #<<
```

---
    
# Step 3: Fit Models 
    
We will fit three models. For exposition, here is the first model.

```r
## Logistic Model Fit for Model 1 
m1 <- glm(cowwar ~ popul + ethnic + island + colony + 
            logmilexshare + rspolity2 + rspolity2adv + 
            rspolity2int, 
          data = data, 
          family = "binomial")

##  Replicate clustered stata errors 
o1 <- coeftest(m1, #<<
               vcov. = vcovCL(m1, #<<
                              type = "HC0", #<<
                              cluster = data$confnum))%>% #<<
    tidy()%>%
    mutate(model = "Model 1")
```

The highlighted code is the R implementation of Stata's cluster() command for logistic models

---

# Step 3: Fit Models

For output, we want to put our models into one data frame.

```r
## Bind Results into a single data frame 
results <- bind_rows(list(o1,o2,o3))%>%
  mutate(conf.low = estimate - std.error,
         conf.high = estimate + std.error) %>%
  # We don't need the intercept so filter it out 
filter(term != "(Intercept)")%>%
    # Not a necessary step. The following is done to match Altman paper
    mutate(term = factor(term, 
                         levels = c("tcstratloc", "tcresource",
                                    "rspolity2int", "rspolity2adv", 
                                    "rspolity2", "logmilexshare",
                                    "colony", "island",
                                    "ethnic","garrison",
                                    "popul")))
```

---
    
# Step 4: Report Results 
    
I'm partial to graphs whenever possible. What follows is example ggplot2 code.

```r
 ggplot(results, aes(x = term, y = estimate, group = model))+
  geom_point()+
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high))+
  scale_x_discrete(labels = c("Strategic Location", 
                              "Natural Resources", 
                              "Regime Type Interaction", 
                              "Challenger Regime Type",
                              "Defender Regime Type", 
                              "Defender Power Share", 
                              "Colony", "Island", 
                              "Ethnic Motive", "Garrison",
                              "Population"))+
  coord_flip()+
  facet_wrap(~model)+
  ylim(-15,15)+
  ylab("")+
  xlab("")+
  theme_classic()+
  labs(title = "Conditions Under Which Conquest Attempts More Often Lead to War", 
       subtitle = "Table 6", 
       caption = "Altman 2020")
```

---

# Part 2: Robust SEs In R the Easy Way

---

# Robust SEs (Conceptual)

Standard Errors are always wrong

Your data never has errors distributed with a constant variance

There is practically never any harm in assuming that the errors are not distributed with a constant variance

Everyone is going to ask to see your regressions with heteroskedastic consistent standard errors.

---

# Robust SEs in R

Base R does not compute HC errors by default. This means that we require additional packages.

- Yes, this is annoying. 
  - Yes, it used to be easier to do this in Stata than in R
  
Fortunately, the `estimatr` package more or less solves all of your problems if you are using OLS as an estimation strategy

---
# Conceptual Example

Imagine we are running an experiment of 1000 people with a pre-treatment covariate x, which represents a pre-test score

```
##     ID         x          y0       y1 treat_indicator           y
## 1 0001 0.9148060  1.94394676 2.943947               0  1.94394676
## 2 0002 0.9370754  1.85185028 2.851850               1  2.85185028
## 3 0003 0.2861395  0.28368327 1.283683               1  1.28368327
## 4 0004 0.8304476  0.96645718 1.966457               0  0.96645718
## 5 0005 0.6417455 -0.07840803 0.921592               0 -0.07840803
## 6 0006 0.5190959  0.32097162 1.320972               0  0.32097162
```

---

# Conceptual Example Continued

the estimatr packages provides `lm_robust()` to fit linear models with heteroskedastic standard errors.

```r
lm_vanilla <- lm_robust(y~treat_indicator + x, data = fake_data, se_type = "HC0")

lm_stata <- lm_robust(y~treat_indicator + x, data = fake_data, se_type = "stata")
```

You can even easily replicate stata errors with a single argument option

---

# Conceptual Example Continued

<table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;">
<caption>Statistical models</caption>
<thead>
<tr>
<th style="padding-left: 5px;padding-right: 5px;">&nbsp;</th>
<th style="padding-left: 5px;padding-right: 5px;">Model 1</th>
<th style="padding-left: 5px;padding-right: 5px;">Model 2</th>
</tr>
</thead>
<tbody>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">(Intercept)</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.11</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.11</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.07)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.07)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">treat_indicator</td>
<td style="padding-left: 5px;padding-right: 5px;">1.03<sup>***</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">1.03<sup>***</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.06)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.06)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">x</td>
<td style="padding-left: 5px;padding-right: 5px;">1.14<sup>***</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">1.14<sup>***</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.11)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.11)</td>
</tr>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.27</td>
<td style="padding-left: 5px;padding-right: 5px;">0.27</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.26</td>
<td style="padding-left: 5px;padding-right: 5px;">0.26</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td>
<td style="padding-left: 5px;padding-right: 5px;">1000</td>
<td style="padding-left: 5px;padding-right: 5px;">1000</td>
</tr>
<tr style="border-bottom: 2px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">RMSE</td>
<td style="padding-left: 5px;padding-right: 5px;">1.00</td>
<td style="padding-left: 5px;padding-right: 5px;">1.00</td>
</tr>
</tbody>
<tfoot>
<tr>
<td style="font-size: 0.8em;" colspan="3"><sup>***</sup>p &lt; 0.001; <sup>**</sup>p &lt; 0.01; <sup>*</sup>p &lt; 0.05</td>
</tr>
</tfoot>
</table>
---

# Benedetto, Hix, and Matrorocco 2020

Benedetto, Hix, and Matrorocco (BHM) study the electoral history of social democratic parties in Europe.

They are interested in the correlates of social democracy party vote shares in different time periods during the 20th century

---

# Replicating BHM

For replication, we will focus on their first model of correlates between 1918-1939

.center[
<img src="img/bhm.png" width="2835" />
]
---

# Data Cleaning

```r
BHM <- read_dta(here("Workshop1/data/BHM_SocialDemocratsDatabase_Final.dta"))

BHM_wrangled <- BHM %>%
    group_by(country_code, election_year)%>%
    mutate(dup = ifelse(n() ==1, 0, n()))%>%
    ungroup()%>%
    filter(year >= 1918, year <= 1939, #<<
           dup == 0) #<<
```

---

# Model Fit

In Stata, BHM fit the following:

```
xtreg sd_vote_share pubspend_comb elsyst_dm_log sd_gov_single
sd_gov_coal_pm sd_gov_coal_other turnout 
d1920s d1930s if year>=1918 & year<=1939, 
fe cluster(country_code)
```

The equivalent syntax in R using `estimatr::lm_robust()`

```r
# Previous data wrangling means we have already subsetted to 
# appropriate years
o4 <- lm_robust(sd_vote_share ~ pubspend_comb + elsyst_dm_log +
                  sd_gov_single + sd_gov_coal_pm + sd_gov_coal_other +
                  turnout+  d1920s, data = BHM_wrangled, 
                fixed_effects = ~country_code, 
                clusters = country_code, 
                se_type = "stata")%>% # though for annoying reasons we probably want CR0
  tidy()
```

---

# Model Fit 2

We get the same coefficients and slightly different clustered standard errors.

<table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;">
<caption>Statistical models</caption>
<thead>
<tr>
<th style="padding-left: 5px;padding-right: 5px;">&nbsp;</th>
<th style="padding-left: 5px;padding-right: 5px;">Model 1</th>
</tr>
</thead>
<tbody>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">Public Spending (% of GDP</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.10</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.14)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">District Magnitude (log)</td>
<td style="padding-left: 5px;padding-right: 5px;">-2.34</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(5.46)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">SD party in gov't (single-party)</td>
<td style="padding-left: 5px;padding-right: 5px;">-1.11</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(1.88)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">SD party in gov't (coalition-PM)</td>
<td style="padding-left: 5px;padding-right: 5px;">0.53</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(3.52)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">SD party in gov't (coalition-junior)</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.46</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(3.32)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Turnout</td>
<td style="padding-left: 5px;padding-right: 5px;">0.27</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.13)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Decade Dummy</td>
<td style="padding-left: 5px;padding-right: 5px;">-2.13</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(1.31)</td>
</tr>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.89</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.85</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td>
<td style="padding-left: 5px;padding-right: 5px;">77</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">RMSE</td>
<td style="padding-left: 5px;padding-right: 5px;">4.69</td>
</tr>
<tr style="border-bottom: 2px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">N Clusters</td>
<td style="padding-left: 5px;padding-right: 5px;">16</td>
</tr>
</tbody>
<tfoot>
<tr>
<td style="font-size: 0.8em;" colspan="2"><sup>***</sup>p &lt; 0.001; <sup>**</sup>p &lt; 0.01; <sup>*</sup>p &lt; 0.05</td>
</tr>
</tfoot>
</table>

# Data Wranging and Manipulation

Thanks for participating. Code, data, and the presentation is available at https://github.com/asteves/ircp_workshops