Lab 2 - R Code

Lab 2 - R Code#

Group 3: Dube, V., Garay, E. Guerrero, J., Villalba, M.

Multicollinearity#

Multicolinearity occurs when two or more predictors in a regresion model are highly correlated to one another, causing a higher variance our the estimated coefficients. To understand the way multicollinearity affects our regresion we can examine the composition of the variance of our estimates.

Suppose some Data Generating Process follows:

\begin{equation*} Y = X_1\beta_1 +\epsilon \end{equation*}

Considering the partitioned regression model:

\begin{align*} Y &= X\beta + e \ Y &= X_1\beta_1 + X_2\beta_2 + e \end{align*}

We know that the OLS estimator will solve this equation:

\begin{align*} (X’X)\hat{\beta} &=X’Y \

\begin{bmatrix} X_1’X_1 & X_1’X_2 \ X_2’X_1 & X_2’X_2 \end{bmatrix} \begin{bmatrix} \hat{\beta_1} \ \hat{\beta_2} \end{bmatrix} & = \begin{bmatrix} X_1’Y \ X_2’Y \end{bmatrix} \end{align*}

This, because of the Frisch-Whaugh-Lovell Theorem, yields:

\begin{align*} \hat{\beta_1} &= (X_1’M_2X_1)^{-1}X_1’M_2Y \end{align*}

Where $M_2 = I - X_2(X_2'X_2)^{-1}X_2'$, is the orthogonal projection matrix to $X_2$.

Note that $M_2$ is symmetric, idempotent, and that any variable premultiplied by it yields the residual from from running $X_2$ on that variable. For an arbitrary variable $Z$:

\begin{align*} M_2Z &= (I - X_2(X_2’X_2)^{-1}X_2’)Z \ &= Z - X_2(X_2’X_2)^{-1}X_2’Z \ &= Z - X_2\hat{\omega} \ &= Z - \hat{Z} \ &= e_{Z} \end{align*}

Where $e_{Z}$ and $\hat{\omega} \equiv (X_2'X_2)^{-1}X_2'Z$ come from the regresion: $$ Z = X_2\hat{\omega} + e_{Z}$$

In a sense, the $M_2$ matrix cleanses or “filters out” our $Z$ variable, keeping only the part which is orthogonal to $X_2$.

Returning to $\hat{\beta_1}$:

\begin{align*} \hat{\beta_1} &= (X_1’M_2X_1)^{-1}X_1’M_2Y \ &= (X_1’M_2X_1)^{-1}X_1’M_2(X_1\beta_1 + \epsilon) \ &= \beta_1 + (X_1’M_2X_1)^{-1}X_1’M_2\epsilon \end{align*}

For the conditional variance of $\hat{\beta_1}$ this has great implications:

\begin{align*} Var(\hat{\beta_1}|X) &= Var(\beta_1 + (X_1’M_2X_1)^{-1}X_1’M_2\epsilon|X) \ &= Var((X_1’M_2X_1)^{-1}X_1’M_2\epsilon|X) \ &= E[((X_1’M_2X_1)^{-1}X_1’M_2\epsilon)((X_1’M_2X_1)^{-1}X_1’M_2\epsilon)’|X] \ &= E[(X_1’M_2X_1)^{-1}X_1’M_2\epsilon\epsilon’M_2’X_1(X_1’M_2’X_1)^{-1}|X] \ &= (X_1’M_2X_1)^{-1}X_1’M_2E[\epsilon\epsilon’|X]M_2’X_1(X_1’M_2’X_1)^{-1} \end{align*}

Under the traditional assumption that $E[\epsilon\epsilon'|X] = \sigma^2I$:

\begin{align*} Var(\hat{\beta_1}|X) &= \sigma^2(X_1’M_2X_1)^{-1}X_1’M_2M_2’X_1(X_1’M_2’X_1)^{-1} \ &= \sigma^2(X_1’M_2’X_1)^{-1} \end{align*}

Remembering that the variance of $X_1$ can be decomposed into two positive components:

\begin{align*} X_1 &= X_2\alpha + v \ Var(X_1) &= Var(X_2\alpha) + Var(v) \ Var(X_1) - Var(X_2\alpha) &= Var(v) \ E[X_1’X_1] - Var(X_2\alpha) &= E[X_1’M_2’X_1] \end{align*}

Thus, necessarily: $$E[X_1'M_2X_1] \leq E[X_1'X_1]$$

Altogether this means: $$\sigma^2(X_1'X_1)^{-1} \leq \sigma^2(X_1'M_2'X_1)^{-1}$$

This shows that controlling for the irrelevant variables $X_2$ will in fact increase the variance of $\hat{\beta_1}$ by limiting us to the “usable” variance of $X_1$ which is orthogonal to $X_2$.

Suppose we want to estimate the impact of years of schooling on future salary. Imagine as well that we have a vast array of possible control variables at our disposal. Someone who is not familiar with the concept of multicollinearity might think that to avoid any possibility of ommited variable bias and ensure consistency it is best to control for everything we can. We now know this is not the case and that this approach can inadvertently introduce multicollinearity.

Consider that we have as a possible control variable the total number of courses taken by each student. Intuitively, years of schooling are likely to correlate strongly with the number of total courses taken (more years in school tipically leads to more courses completed) and so controlling for this variable may result in the problem described above, inflating the variance of the estimated coefficients and potentially distorting our understanding of the true effect of schooling on salary.

The same rationale applies to many other examples. For instance, imagine estimating the impact of marketing expenditure on sales. Controlling for variable such as number of marketing campaigns will probably cause the same issue.

Perfectly collinear regressors#

A special case of the previously mentioned concept of multicollinearity arises when a variable is a linear combination of some other variables from our dataset, so not only are these variables highly correlated, but we say that they are perfectly collinear.

Considering the partitioned regression model:

\begin{align*} Y &= x_1\beta_1 + X_2\beta_2 + \epsilon \ x_1 &= X_2 \alpha \end{align*}

where the second equation is deterministic.

We know that the OLS estimator will solve this equation:

\begin{align*} (X’X)\hat{\beta} &=X’Y \

\begin{bmatrix} X_1’X_1 & X_1’X_2 \ X_2’X_1 & X_2’X_2 \end{bmatrix} \begin{bmatrix} \hat{\beta_1} \ \hat{\beta_2} \end{bmatrix} & = \begin{bmatrix} X_1’Y \ X_2’Y \end{bmatrix} \end{align*}

Substituting $X_1 = X_2 \alpha$ on the $(X'X)$ matrix:

\begin{align*} (X’X) &= \begin{bmatrix} (X_2\alpha)’X_2\alpha & (X_2\alpha)’X_2 \ X_2’(X_2\alpha) & X_2’X_2 \end{bmatrix} \

&= \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \end{align*}

This yields the determinant:

\begin{align*} \det (X’X) &= \det\left( \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \ \end{align*}

Transforming the matrix using row operations: \begin{align*} \det (X’X) &= \det\left( \begin{bmatrix} I & -\alpha’ \ 0 & I \end{bmatrix} \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \ &= \det\left( \begin{bmatrix} 0 & 0 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \end{align*}

Like this, we can see that: \begin{align*} \det (X’X) &= 0 \end{align*}

Because of this, $(X'X)$ is not invertible and there is no solution for the OLS estimation

Practical application#

We can easily show what we have theoretically explained with a practical application. We will create a dataset simulating the regressors that we may want to include in the estimation of a linear model.

The first 9 variables follow a normal distribution.

set.seed(0)
matrix <- matrix(rnorm(90), nrow=10, ncol=9)
matrix

A matrix: 10 × 9 of type dbl
1.262954285	0.7635935	-0.22426789	-0.2357066	1.7579031	0.26613736	0.35872890	0.01915639	-0.7970895
-0.326233361	-0.7990092	0.37739565	-0.5428883	0.5607461	-0.37670272	-0.01104548	0.25733838	1.2540831
1.329799263	-1.1476570	0.13333636	-0.4333103	-0.4527840	2.44136463	-0.94064916	-0.64901008	0.7721422
1.272429321	-0.2894616	0.80418951	-0.6494716	-0.8320433	-0.79533912	-0.11582532	-0.11916876	-0.2195156
0.414641434	-0.2992151	-0.05710677	0.7267507	-1.1665705	-0.05487747	-0.81496871	0.66413570	-0.4248103
-1.539950042	-0.4115108	0.50360797	1.1519118	-1.0655906	0.25014132	0.24226348	1.10096910	-0.4189801
-0.928567035	0.2522234	1.08576936	0.9921604	-1.5637821	0.61824329	-1.42509839	0.14377148	0.9969869
-0.294720447	-0.8919211	-0.69095384	-0.4295131	1.1565370	-0.17262350	0.36594112	-0.11775360	-0.2757780
-0.005767173	0.4356833	-1.28459935	1.2383041	0.8320471	-2.22390027	0.24841265	-0.91206837	1.2560188
2.404653389	-1.2375384	0.04672617	-0.2793463	-0.2273287	-1.26361438	0.06528818	-1.43758624	0.6466744

However, the 10th variable is a linear combination (the sum) of variables 1, 5 and 9.

coefficients <- c(1, 1, 1)
selected_columns <- c(1, 5, 9)
matrix <- cbind(matrix, matrix[, selected_columns] %*% coefficients)
matrix

A matrix: 10 × 10 of type dbl
1.262954285	0.7635935	-0.22426789	-0.2357066	1.7579031	0.26613736	0.35872890	0.01915639	-0.7970895	2.2237678
-0.326233361	-0.7990092	0.37739565	-0.5428883	0.5607461	-0.37670272	-0.01104548	0.25733838	1.2540831	1.4885958
1.329799263	-1.1476570	0.13333636	-0.4333103	-0.4527840	2.44136463	-0.94064916	-0.64901008	0.7721422	1.6491575
1.272429321	-0.2894616	0.80418951	-0.6494716	-0.8320433	-0.79533912	-0.11582532	-0.11916876	-0.2195156	0.2208704
0.414641434	-0.2992151	-0.05710677	0.7267507	-1.1665705	-0.05487747	-0.81496871	0.66413570	-0.4248103	-1.1767394
-1.539950042	-0.4115108	0.50360797	1.1519118	-1.0655906	0.25014132	0.24226348	1.10096910	-0.4189801	-3.0245207
-0.928567035	0.2522234	1.08576936	0.9921604	-1.5637821	0.61824329	-1.42509839	0.14377148	0.9969869	-1.4953622
-0.294720447	-0.8919211	-0.69095384	-0.4295131	1.1565370	-0.17262350	0.36594112	-0.11775360	-0.2757780	0.5860385
-0.005767173	0.4356833	-1.28459935	1.2383041	0.8320471	-2.22390027	0.24841265	-0.91206837	1.2560188	2.0822988
2.404653389	-1.2375384	0.04672617	-0.2793463	-0.2273287	-1.26361438	0.06528818	-1.43758624	0.6466744	2.8239991

As we saw in theory, this causes our dataset to have a determinant of zero. Thus, X is singular and we get an error message when trying to invert it.

det(matrix)

0

solve(matrix)

Error in solve.default(matrix): Lapack routine dgesv: system is exactly singular: U[10,10] = 0
Traceback:

1. solve(matrix)
2. solve.default(matrix)

Nonetheless, this is not the case when doing the same experiment in other programming languages. For instance, Python and Julia yield a determinant extremely close to cero but not equal to cero, so they are able to find a “supposed” inverse to the matrix. This would seem to contradict what we have proven in theory, however, this is a problem rooted in the way those languages handle float values. Thus, that is not a contradiction of theory but rather an illustration of how computational environments deal differently with the inherent limitations of floating-point arithmetic.

Analyzing RCT data with precision adjustment#

df <- read.table("path_to/penn_jae.dat", header=TRUE, sep="", dec=".", fill=TRUE)

library(dplyr)
penn <- filter(df, tg %in% c(2, 0))

penn$tg[penn$tg == 2] <- 1
names(penn)[names(penn) == "tg"] <- "T2"
penn$dep <- as.factor(as.character(penn$dep))

Histogram#

library(dplyr)
treat <- filter(penn, T2 == 1) %>%
         select(inuidur1)
notreat <- filter(penn, T2 == 0) %>%
           select(inuidur1)

library(ggplot2)
ggplot(treat, aes(x=inuidur1)) +
  geom_histogram(bins=15, fill="gray", colour="black", aes(y=..density..)) +
  xlim(0, 60) +
  ggtitle("Distribution of weeks unemployed for treated individuals") +
  xlab("Weeks unemployed") +
  ylab("Probability Density")

#1 Classical 2-Sample Approach (CL)

alpha <- 0.05
# Calculate the t-statistic and p-value using Welch Two Sample t-test
t_test_result <- t.test(treatmat, notreatmat, alternative = "two.sided", var.equal = FALSE)

p_value <- t_test_result$p.value
reject_H0 <- p_value < alpha

# Print results
alpha
p_value
reject_H0

# Simple linear regression model using log of duration:
ols_cl <- lm(log(inuidur1) ~ T2, data = penn)

# To display the summary of the regression model:
summary(ols_cl)

#2 Classical Linear Regression Adjustment (CRA)

# Specifying the model with interaction terms
# In R, **2 at the end of a formula includes all main effects and their pairwise interactions
ols_cra <- lm(log(inuidur1) ~ T2 + (female + black + othrace + dep + q2 + q3 + q4 + q5 + q6 + agelt35 + agegt54 + durable + lusd + husd)^2, data = penn)

# To display the summary of the regression model
summary(ols_cra)

# If you use stargazer or another package to create nice tables (similar to regtable in Julia):
if("stargazer" %in% rownames(installed.packages()) == FALSE) {install.packages("stargazer")}
library(stargazer)
stargazer(ols_cra, type = "text", title = "CRA model", single.row = TRUE)

#3 Interactive Regression Adjustment (IRA)

# R function to demean a matrix
desv_mean <- function(a) {
  apply(a, 2, function(x) x - mean(x))
}

# Specifying the model and generating the model matrix
reg1 <- T2 ~ (female + black + othrace + dep + q2 + q3 + q4 + q5 + q6 + agelt35 + agegt54 + durable + lusd + husd)^2
X <- model.matrix(reg1, data = penn)

# Applying the demean function
X <- desv_mean(X)

# Selecting necessary variables and preparing the data
Y <- penn[, c("inuidur1", "T2")]
X <- cbind(X, penn$T2 * X)  # Join X, (T2*X)
base <- cbind(Y, X)  # Join inuidur1, T2, X, and (T2*X)
base$inuidur1 <- log(base$inuidur1)  # log(inuidur1)

# Specifying the regression model using reformulate to dynamically build the formula
terms <- names(base)
formula <- reformulate(terms[-1], response = terms[1])

# Fitting the interactive regression model
ols_ira <- lm(formula, data = base)

# Printing summary of the model
summary(ols_ira)

# If you use 'stargazer' for generating tables, similar to regtable in Julia
if("stargazer" %in% rownames(installed.packages()) == FALSE) {install.packages("stargazer")}
library(stargazer)
stargazer(ols_ira, type = "text", title = "Interactive model", single.row = TRUE)

#4 Interactive Regression Adjustment Using Lasso (IRA using Lasso)

# Load necessary libraries
library(dplyr)


X <- model.matrix(~ (female + black + othrace + dep + q2 + q3 + q4 + q5 + q6 + agelt35 + agegt54 + durable + lusd + husd)^2, data = penn)
X <- desv_mean(X)
D <- data.frame(T2 = X[, 1])  # Extract Treatment variable and rename

X <- cbind(X[, -1], X[, 1] * X[, -1])  # Join Controls (X) + T2*X "interactive"

Y <- penn %>% select(inuidur1)
Y$inuidur1 <- log(Y$inuidur1)  # Apply log transformation to inuidur1

library(glmnet)

X_matrix <- as.matrix(X)
D_matrix <- as.matrix(D$T2)
Y_matrix <- as.matrix(Y$inuidur1)

# Fit the Lasso model for D
D_fit <- glmnet(X_matrix, D_matrix, alpha = 1, lambda = 1.1)
D_resid <- residuals(D_fit)

# Fit the Lasso model for Y
Y_fit <- glmnet(X_matrix, Y_matrix, alpha = 1, lambda = 1.1)
Y_resid <- residuals(Y_fit)

# D_resid_matrix <- matrix(D_resid, ncol = 1)

# Fit a linear model using the residuals
Lasso_ira <- lm(D_resid ~ Y_resid)

# View summary of the model
summary(Lasso_ira)

A crash course in good and bad controls (val)#

In this section, we will explore different scenarios where we need to decide whether the inclusion of a control variable, denoted by Z, will help (or not) to improve the estimation of the average treatment effect (ATE) of treatment X on outcome Y. The effect of observed variables will be represented by a continuous line, while that of unobserved variables will be represented by and discontinuous line.

# install.packages("librarian", quiet = T)
# install.packages("ggdag")
librarian::shelf(
    dagitty, tibble, stargazer, ggplot2, ggdag,
    quiet = T
)
theme_set(theme_void())

Failed to start the Kernel. 

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 

  there is no package called ‘htmltools’

Calls: :: ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart

Execution halted. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

# cleans workspace
rm(list = ls()) 

Good control (Blocking back-door paths)#

Model 1

We will assume that X measures whether or not the student attends the extra tutoring session, that affects the student’s grade (Y). Then, we have another observable variable, as hours of the student sleep (Z), that impacts X and Y. Theory says that when controlling by Z, we block the back-door path from X to Y. Thus, we see that in the second regression, the coefficient of X is closer to the real one (2.923 ≈ 3).

g <- dagitty("dag {
    X <- Z -> Y
    X -> Y
    }")
coordinates(g) <- list(
    x=c(X=1, Z=1.5, Y=2),
    y=c(X=1, Z=0 ,Y=1))

ggdag(g)

Failed to start the Kernel. 

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 

  there is no package called ‘htmltools’

Calls: :: ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart

Execution halted. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

# Generate data
set.seed(24)
n <- 1000
Z <- rnorm(n)
X <- 5 * Z + rnorm(n)
Y <- 3 * X + 1.5 * Z + rnorm(n)

d <- tibble(X=X, Y=Y, Z=Z)

# Regressions and summary results
lm_1 <- lm(Y ~ X, d)
lm_2 <- lm(Y ~ X + Z, d)
stargazer(lm_1, lm_2, 
          type = "text", 
          column.labels = c("NoControl", "UsingControl")
)

Error in tibble(X = X, Y = Y, Z = Z): could not find function "tibble"
Traceback:

Model 2

We will assume that X stands for the police salaries that affect the crime rate (Y). Then, we have another observable variable, as the policemen’s supply (Z), that impacts X but not Y. And, additionally, we know that there is an unobservable variable (U), as the preference for maintaining civil order, that affects Z and Y. The theory says that when controlling by Z, we block (some) of the unobservable variable’s back-door path from X to Y. Thus, we see that in the second regression, the coefficient of X is equal to the real one (0.5).

g <- dagitty("dag {
    X <- Z <- U -> Y
    X -> Y
}")
coordinates(g) <- list(
    x=c(X=1, Z=2,U=3, Y=4),
    y=c(X=2, Z=1, U=0, Y=2))

ggdag(g)

set.seed(24)
n <- 1000   
U <- rnorm(n)
Z <- 7 * U + rnorm(n)
X <- 2 * Z + rnorm(n)
Y <- 0.5 * X + 0.2 * U + rnorm(n)

# Create a dataframe
d <- tibble(X=X, Y=Y, Z=Z, U=U)

Bad Control (M-bias)#

Model 7

Let us suppose that X stands for a job training program aimed at reducing unemployment. Then, there is a first unobserved confounder, which could be the planning effort and good design of the job program (U1) that impacts directly on the participation in job training programs (X) and the proximity of job programs (that would be the bad control Z). Furthermore, we have another unobserved confounder (U2), as the soft skills of the unemployed, that affects the employment status of individuals (Y) and the likelihood of beeing in a job training program that is closer (Z). That is why including Z in the second regression makes X coefficient value further to the real one.

g <- dagitty("dag {
    X <- U1 -> Z <- U2 -> Y
    X -> Y
    }")
coordinates(g) <- list(
    x = c(X=1, U1=1, Z=2, U2=3, Y=3),
    y = c(X=1, U1=0, Z=0, U2=0, Y=1))

ggdag(g)

set.seed(1)
n <- 1000  
U1 <- rnorm(n)
U2 <- rnorm(n)
Z <- 0.3 * U1 + 0.9 * U2 + rnorm(n)
X <- 4 * U1 + rnorm(n)
Y <- 3 * X + U2 + rnorm(n)

# Create a dataframe
d <- tibble(X=X, Y=Y, Z=Z, U1=U1, U2=U2)

lm_1 <- lm(Y ~ X, d)
lm_2 <- lm(Y ~ X + Z, d)
stargazer(lm_1, lm_2, 
          type = "text", 
          column.labels = c("NoControl", "UsingControl")
          )

Neutral Control (possibly good for precision)#

Model 8

In this scenario, we will assume that X represents the implementation of a new government policy to provide subsidies and guidance for small companies. There is another variable, Z, that stands for the % inflation rate. And both X and Z affect Y, which represents the GDP growth rate of the country. Then, even if Z does not impact X, its inclusion improves the precision of the ATE estimator (8.523 is closer to 8.6).

g <- dagitty("dag {X -> Y <-Z}")
coordinates(g) <- list(
    x = c(X=1, Y=2, Z=3),
    y = c(X=1, Y=1, Z=0))

ggdag(g)

set.seed(24)
n <- 1000  
Z <- rnorm(n)
X <- rnorm(n)
Y <- 8.6 * X + 5 * Z + rnorm(n)

# Create a dataframe
d <- tibble(X=X, Y=Y, Z=Z)

lm_1 <- lm(Y ~ X, d)
lm_2 <- lm(Y ~ X + Z, d)
stargazer(lm_1, lm_2, 
          type = "text", 
          column.labels = c("NoControl", "UsingControl"))

Bad Controls (Bias amplification)#

Model 10

Let us assume that X measures the implementation of a housing program for young adults buying their first house, which impacts the average housing prices (Y). There is another observable variable, Z, that measures the expenditure of the program and affects only X. Also, there is an unobservable variable (U) that represents the preference of young adults to move from their parent’s house and impacts only X and Y. Therefore, the inclusion of Z will “amplify the bias” of U on X, so the ATE estimator will be worse. We can see that in the second regression, the estimator (0.835) is much farther from the real value (0.8).

g <- dagitty("dag {
    Z -> X <- U -> Y
    X -> Y
    }")
coordinates(g) <- list(
    x = c(Z=0.5, X=1, U=1.5, Y=2),
    y = c(Z=0, X=1, U=0.5, Y=1))

ggdag(g)

set.seed(24)
n <- 1000  
Z <- rnorm(n)
U <- rnorm(n)
X <- 3 * Z + 6 * U + rnorm(n)
Y <- 0.8 * X + 0.2 * U + rnorm(n)

# Create a dataframe
d <- tibble(X=X, Y=Y, Z=Z, U=U)

d <- tibble(X=X, Y=Y, Z=Z)

# Regressions and summary results
lm_1 <- lm(Y ~ X, d)
lm_2 <- lm(Y ~ X + Z, d)
stargazer(lm_1, lm_2, 
          type = "text", 
          column.labels = c("NoControl", "UsingControl"))

Error in tibble(X = X, Y = Y, Z = Z): could not find function "tibble"

Traceback: