Lab 2 - Julia Code#

Group 3: Dube, V., Garay, E. Guerrero, J., Villalba, M.

Multicollinearity#

Multicolinearity occurs when two or more predictors in a regresion model are highly correlated to one another, causing a higher variance our the estimated coefficients. To understand the way multicollinearity affects our regresion we can examine the composition of the variance of our estimates.

Suppose some Data Generating Process follows:

\begin{equation*} Y = X_1\beta_1 +\epsilon \end{equation*}

Considering the partitioned regression model:

\begin{align*} Y &= X\beta + e \ Y &= X_1\beta_1 + X_2\beta_2 + e \end{align*}

We know that the OLS estimator will solve this equation:

\begin{align*} (X’X)\hat{\beta} &=X’Y \

\begin{bmatrix} X_1’X_1 & X_1’X_2 \ X_2’X_1 & X_2’X_2 \end{bmatrix} \begin{bmatrix} \hat{\beta_1} \ \hat{\beta_2} \end{bmatrix} & = \begin{bmatrix} X_1’Y \ X_2’Y \end{bmatrix} \end{align*}

This, because of the Frisch-Whaugh-Lovell Theorem, yields:

\begin{align*} \hat{\beta_1} &= (X_1’M_2X_1)^{-1}X_1’M_2Y \end{align*}

Where \(M_2 = I - X_2(X_2'X_2)^{-1}X_2'\), is the orthogonal projection matrix to \(X_2\).

Note that \(M_2\) is symmetric, idempotent, and that any variable premultiplied by it yields the residual from from running \(X_2\) on that variable. For an arbitrary variable \(Z\):

\begin{align*} M_2Z &= (I - X_2(X_2’X_2)^{-1}X_2’)Z \ &= Z - X_2(X_2’X_2)^{-1}X_2’Z \ &= Z - X_2\hat{\omega} \ &= Z - \hat{Z} \ &= e_{Z} \end{align*}

Where \(e_{Z}\) and \(\hat{\omega} \equiv (X_2'X_2)^{-1}X_2'Z\) come from the regresion: $\( Z = X_2\hat{\omega} + e_{Z}\)$

In a sense, the \(M_2\) matrix cleanses or “filters out” our \(Z\) variable, keeping only the part which is orthogonal to \(X_2\).

Returning to \(\hat{\beta_1}\):

\begin{align*} \hat{\beta_1} &= (X_1’M_2X_1)^{-1}X_1’M_2Y \ &= (X_1’M_2X_1)^{-1}X_1’M_2(X_1\beta_1 + \epsilon) \ &= \beta_1 + (X_1’M_2X_1)^{-1}X_1’M_2\epsilon \end{align*}

For the conditional variance of \(\hat{\beta_1}\) this has great implications:

\begin{align*} Var(\hat{\beta_1}|X) &= Var(\beta_1 + (X_1’M_2X_1)^{-1}X_1’M_2\epsilon|X) \ &= Var((X_1’M_2X_1)^{-1}X_1’M_2\epsilon|X) \ &= E[((X_1’M_2X_1)^{-1}X_1’M_2\epsilon)((X_1’M_2X_1)^{-1}X_1’M_2\epsilon)’|X] \ &= E[(X_1’M_2X_1)^{-1}X_1’M_2\epsilon\epsilon’M_2’X_1(X_1’M_2’X_1)^{-1}|X] \ &= (X_1’M_2X_1)^{-1}X_1’M_2E[\epsilon\epsilon’|X]M_2’X_1(X_1’M_2’X_1)^{-1} \end{align*}

Under the traditional assumption that \(E[\epsilon\epsilon'|X] = \sigma^2I\):

\begin{align*} Var(\hat{\beta_1}|X) &= \sigma^2(X_1’M_2X_1)^{-1}X_1’M_2M_2’X_1(X_1’M_2’X_1)^{-1} \ &= \sigma^2(X_1’M_2’X_1)^{-1} \end{align*}

Remembering that the variance of \(X_1\) can be decomposed into two positive components:

\begin{align*} X_1 &= X_2\alpha + v \ Var(X_1) &= Var(X_2\alpha) + Var(v) \ Var(X_1) - Var(X_2\alpha) &= Var(v) \ E[X_1’X_1] - Var(X_2\alpha) &= E[X_1’M_2’X_1] \end{align*}

Thus, necessarily: $\(E[X_1'M_2X_1] \leq E[X_1'X_1]\)$

Altogether this means: $\(\sigma^2(X_1'X_1)^{-1} \leq \sigma^2(X_1'M_2'X_1)^{-1}\)$

This shows that controlling for the irrelevant variables \(X_2\) will in fact increase the variance of \(\hat{\beta_1}\) by limiting us to the “usable” variance of \(X_1\) which is orthogonal to \(X_2\).

Suppose we want to estimate the impact of years of schooling on future salary. Imagine as well that we have a vast array of possible control variables at our disposal. Someone who is not familiar with the concept of multicollinearity might think that to avoid any possibility of ommited variable bias and ensure consistency it is best to control for everything we can. We now know this is not the case and that this approach can inadvertently introduce multicollinearity.

Consider that we have as a possible control variable the total number of courses taken by each student. Intuitively, years of schooling are likely to correlate strongly with the number of total courses taken (more years in school tipically leads to more courses completed) and so controlling for this variable may result in the problem described above, inflating the variance of the estimated coefficients and potentially distorting our understanding of the true effect of schooling on salary.

The same rationale applies to many other examples. For instance, imagine estimating the impact of marketing expenditure on sales. Controlling for variable such as number of marketing campaigns will probably cause the same issue.

Perfectly collinear regressors#

A special case of the previously mentioned concept of multicollinearity arises when a variable is a linear combination of some other variables from our dataset, so not only are these variables highly correlated, but we say that they are perfectly collinear.

Considering the partitioned regression model:

\begin{align*} Y &= x_1\beta_1 + X_2\beta_2 + \epsilon \ x_1 &= X_2 \alpha \end{align*}

where the second equation is deterministic.

We know that the OLS estimator will solve this equation:

\begin{align*} (X’X)\hat{\beta} &=X’Y \

\begin{bmatrix} X_1’X_1 & X_1’X_2 \ X_2’X_1 & X_2’X_2 \end{bmatrix} \begin{bmatrix} \hat{\beta_1} \ \hat{\beta_2} \end{bmatrix} & = \begin{bmatrix} X_1’Y \ X_2’Y \end{bmatrix} \end{align*}

Substituting \(X_1 = X_2 \alpha\) on the \((X'X)\) matrix:

\begin{align*} (X’X) &= \begin{bmatrix} (X_2\alpha)’X_2\alpha & (X_2\alpha)’X_2 \ X_2’(X_2\alpha) & X_2’X_2 \end{bmatrix} \

&= \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \end{align*}

This yields the determinant:

\begin{align*} \det (X’X) &= \det\left( \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \ \end{align*}

Transforming the matrix using row operations: \begin{align*} \det (X’X) &= \det\left( \begin{bmatrix} I & -\alpha’ \ 0 & I \end{bmatrix} \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \ &= \det\left( \begin{bmatrix} 0 & 0 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \end{align*}

Like this, we can see that: \begin{align*} \det (X’X) &= 0 \end{align*}

Because of this, \((X'X)\) is not invertible and there is no solution for the OLS estimation

Practical application#

We can easily show what we have theoretically explained with a practical application. We will create a dataset simulating the regressors that we may want to include in the estimation of a linear model.

The first 9 variables follow a normal distribution.

using LinearAlgebra
using Random

Random.seed!(0)
matrix = randn(10, 9)
10×9 Matrix{Float64}:
 -0.231909   -0.269885  -1.07458   …  -0.57796     0.27262    0.849341
  0.94039     0.578004  -1.38197      -1.78676     1.23521   -0.210716
  0.596762    1.16034    0.951853     -1.36705     1.11072   -0.143171
  1.99782     0.287888  -0.292291      0.660498    1.85158   -0.121578
 -0.0515618  -0.441193   0.328946      0.0371086  -0.131821   0.47354
  1.17224     0.134576  -0.495999  …   0.877644   -0.406776  -0.0923911
 -1.69681     0.165543   0.22277       1.79296    -0.379297  -0.550621
 -2.11615    -1.0721    -1.66125      -1.35785    -0.499403  -0.474732
  0.558366   -1.26016    1.92524      -1.48326    -0.635276   0.237598
 -0.86473     0.196408  -1.04847       0.475453    2.28402    0.100588

However, the 10th variable is a linear combination (the sum) of variables 1, 5 and 9.

coefficients = [1, 1, 1]
selected_columns = [1, 5, 9]
matrix = hcat(matrix, matrix[:, selected_columns] * coefficients)
10×10 Matrix{Float64}:
 -0.231909   -0.269885  -1.07458   …   0.27262    0.849341    0.100641
  0.94039     0.578004  -1.38197       1.23521   -0.210716   -0.828557
  0.596762    1.16034    0.951853      1.11072   -0.143171   -0.510355
  1.99782     0.287888  -0.292291      1.85158   -0.121578    3.1357
 -0.0515618  -0.441193   0.328946     -0.131821   0.47354    -0.951298
  1.17224     0.134576  -0.495999  …  -0.406776  -0.0923911   0.815082
 -1.69681     0.165543   0.22277      -0.379297  -0.550621   -1.00174
 -2.11615    -1.0721    -1.66125      -0.499403  -0.474732   -2.84605
  0.558366   -1.26016    1.92524      -0.635276   0.237598    2.11939
 -0.86473     0.196408  -1.04847       2.28402    0.100588   -0.864591

As we saw in theory, this should cause our dataset to have a determinant of zero. Thus, making X singular and yielding an error message when trying to invert it.

det(matrix)
-2.548968811760904e-14
inv_matrix = inv(matrix)
10×10 Matrix{Float64}:
  4.21177e15   3.21803e15   2.84373e15  …   5.35458e15   2.48399e15
  0.35937      0.106139     0.159937       -0.497513    -0.629165
 -1.9799      -1.79593     -0.971094       -2.79778     -1.85628
  0.382724    -0.282216     0.588743        0.183997     0.0466157
  4.21177e15   3.21803e15   2.84373e15      5.35458e15   2.48399e15
 -3.02412     -2.49823     -2.12259     …  -4.6385      -2.89822
 -1.40691     -1.3032      -0.896793       -2.0195      -0.996493
 -1.28789     -0.920547    -0.753483       -1.62345     -0.682825
  4.21177e15   3.21803e15   2.84373e15      5.35458e15   2.48399e15
 -4.21177e15  -3.21803e15  -2.84373e15     -5.35458e15  -2.48399e15

We can see that this is not the case. Is this a contradiction to our theoretical proof? Julia, as well as Python, yield a determinant extremely close to cero but not equal to cero, so they are able to find a “supposed” inverse to the matrix. This would seem to contradict what we have proven in theory, however, this is a problem rooted in the way those languages handle float values. Thus, that is not a contradiction of theory but rather an illustration of how computational environments deal differently with the inherent limitations of floating-point arithmetic.

Analyzing RCT data with Precision Adjustment#

using DataFrames
using FilePaths
using Queryverse
using GLM
using StatsModels
using Combinatorics
using Iterators
using CategoricalArrays
using HypothesisTests
using StatsBase
using Lasso
using TypedTables
using MacroTools
using NamedArrays
using DataTables
using Latexify
using PrettyTables
using TypedTables
using TexTables
using DataTables
using FilePaths
using Combinatorics
using CategoricalArrays
using TypedTables
using MacroTools
using HDMjl
ArgumentError: Package FilePaths not found in current path.
- Run `import Pkg; Pkg.add("FilePaths")` to install the FilePaths package.

Stacktrace:
 [1] macro expansion
   @ .\loading.jl:1772 [inlined]
 [2] macro expansion
   @ .\lock.jl:267 [inlined]
 [3] __require(into::Module, mod::Symbol)
   @ Base .\loading.jl:1753
 [4] #invoke_in_world#3
   @ .\essentials.jl:926 [inlined]
 [5] invoke_in_world
   @ .\essentials.jl:923 [inlined]
 [6] require(into::Module, mod::Symbol)
   @ Base .\loading.jl:1746
mat, head = readdlm("../../../data/penn_jae.dat", header=true, Float64)
mat
df =DataFrame(mat, vec(head))
penn = filter(row -> row[:tg] in [2,0], df)
# Treatment group n°2
replace!(penn.tg, 2 => 1)

rename!(penn, "tg" => "T2")

# from float to string
penn[!,:dep] = string.(penn[!,:dep]) 

# dep varaible in categorical format 
penn[!,:dep] = categorical(penn[!,:dep])

Histogram#

treat = filter(row -> row[:T2] in [1], penn)
treat= select(treat, [:inuidur1])
treatmat=Matrix(treat)

notreat = filter(row -> row[:T2] in [0], penn)
notreat= select(notreat, [:inuidur1])
notreatmat=Matrix(notreat)

Treated individuals

histogram(treatmat, bins=15, normalize=:pdf, color=:gray)
xlims!(0,60)
title!("Distribution of weeks unemployed for treated individuals")
xlabel!("Weeks unemployed")
ylabel!("P(x)")

Control individuals

histogram(notreatmat, bins=15, normalize=:pdf, color=:gray)
xlims!(0,60)
title!("Distribution of weeks unemployed for control individuals")
xlabel!("Weeks unemployed")
ylabel!("P(x)")

1 Classical 2-Sample Approach (CL)

alpha=0.05
# Calculate the t-statistic and p-value
p_value = pvalue(UnequalVarianceTTest(vec(treatmat), vec(notreatmat)))

reject_H0 = p_value < alpha
alpha, p_value, reject_H0
ols_cl = lm(@formula(log(inuidur1) ~ T2), penn)

table1 = regtable( "No adjustment model" => ols_cl)  
#coefic=coef(ols_cl)

2 Classical Linear Regression Adjustment (CRA)

reg2 = @formula(log(inuidur1) ~ T2 + (female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)^2)
reg2 = apply_schema(reg2, schema(reg2, penn))

ols_cra = lm(reg2, penn)
table2 = regtable("CRA model" => ols_cra)

3 Interactive Regression Adjustment (IRA)

# demean function
function desv_mean(a)
    A = mean(a, dims = 1)
    M = zeros(Float64, size(X,1), size(X,2))
    
    for i in 1:size(a,2)
          M[:,i] = a[:,i] .- A[i]
    end
    return M
end    
reg1 = @formula(T2 ~ (female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)^2)
reg1 = apply_schema(reg1, schema(reg1, penn))
# Matrix Model & demean
X = StatsModels.modelmatrix(reg1.rhs,penn)
X = desv_mean(X) # matrix format 
Y = select(penn, [:inuidur1,:T2]) # select inuidur1 y T2

X = DataFrame(hcat(X, Matrix(select(penn, [:T2])).*X))  # Joint X, (T2*X)

base = hcat(Y, X) # Joint inuidur1, T2, X y (T2*X)

base.inuidur1 = log.(base.inuidur1)  # log(inuidur1)

terms = term.(names(base)) # term.() let us to get all variables as objects

#interactive regression model

ols_ira  = lm(terms[1] ~ sum(terms[2:end]), base)


table3 = regtable("Interactive model" => ols_ira)

#terms[1] : select first variable. In this case, oucome of interest 
#sum(terms[2:end]) : independent variables as regresors in the linear regression   

4 Interactive Regression Adjustment Using Lasso (IRA using Lasso)

X = StatsModels.modelmatrix(reg2.rhs,penn)
X = desv_mean(X)
D = DataFrame([X[:,1]])  # Treatment varaible



rename!(D, Dict(:x1 => :T2)) #rename x1 -> T2

X = DataFrame(hcat(X[:,2:end], X[:,1].*X[:,2:end]))  # Join Controls (X) + T2*X "interactive"

Y = select(penn, [:inuidur1]) #select just inuidur1

Y.inuidur1 = log.(Y.inuidur1)  # log(inuidur1)





#terms = term.(names(base)) # all terms  
#model = terms[1] ~ sum(terms[2:end])
#lazso = fit(LassoModel,terms[1] ~ sum(terms[2:end]), base; standardize=false, α = 0.1)
#coef(lasso)[2]
#table3 = regtable("Lasso adjustment model" => lasso)
D_reg_0  = rlasso_arg( X, D, nothing, true, true, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )
D_resid = rlasso(D_reg_0)
D_resid = rlasso(D_reg_0)["residuals"]
Y_reg_0  = rlasso_arg( X, Y, nothing, true, true, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )
Y_resid = rlasso(Y_reg_0)["residuals"]
D_resid = reshape(D_resid, length(D_resid), 1)
Lasso_ira = lm(D_resid, Y_resid)

Plot coefficients#

#T2*female      1
#T2*black       2
#T2*agelt35     4
#T2*factor(dep)1  11
# 23 var:
# Intercept, T2, 120INTERACTIONS ...,T2*female,T2*black, ... T2*INTERACTION
# 1        ,  2,         ...     122,  123    ,   124  ,    ...          240
#coef(ols_ira)[1]     

95% confidence

# Extraer los coeficientes para sex y los errores estándar de las últimas dos regresiones
coefs = Dict("T2*fem" => coef(ols_ira)[122],
             "T2*black" => coef(ols_ira)[123],
             "T2*agelt35" => coef(ols_ira)[125],
             "T2*f(dep)1" => coef(ols_ira)[132])  

ses = Dict("T2*fem" => stderror(ols_ira)[122],
             "T2*black" => stderror(ols_ira)[123],
             "T2*agelt35" => stderror(ols_ira)[125],
             "T2*f(dep)1" => stderror(ols_ira)[132]) 

# Gráfico de dispersión con barras de error
scatter_plot = scatter(coefs, yerr=1.96 .* collect(values(ses)), legend=false, 
    xlabel="Coefficient", ylabel="Value", title="Coefficients using IRA(95%)",
    markershape=:circle, markercolor=:blue, markerstrokecolor=:black,
    markersize=8, markeralpha=0.8)

# Línea horizontal en y=0
hline!(scatter_plot, [0], color="gray", linestyle=:dash)

90% confidence

# Extraer los coeficientes para sex y los errores estándar de las últimas dos regresiones
coefs = Dict("T2*fem" => coef(ols_ira)[122],
             "T2*black" => coef(ols_ira)[123],
             "T2*agelt35" => coef(ols_ira)[125],
             "T2*f(dep)1" => coef(ols_ira)[132])  

ses = Dict("T2*fem" => stderror(ols_ira)[122],
             "T2*black" => stderror(ols_ira)[123],
             "T2*agelt35" => stderror(ols_ira)[125],
             "T2*f(dep)1" => stderror(ols_ira)[132]) 

# Gráfico de dispersión con barras de error
scatter_plot = scatter(coefs, yerr=1.645 .* collect(values(ses)), legend=false, 
    xlabel="Coefficient", ylabel="Value", title="Coefficients using IRA(90%)",
    markershape=:circle, markercolor=:blue, markerstrokecolor=:black,
    markersize=8, markeralpha=0.8)

# Línea horizontal en y=0
hline!(scatter_plot, [0], color="gray", linestyle=:dash)

No Significant Interaction Effects:

Our analysis found no statistically significant interaction effects between the incentive program and other factors at the 90% or 95% confidence level. This suggests there’s no evidence of heterogeneity (variation in effects) within subgroups defined by these factors.

Marginally Positive Effect for Dependents = 1:

However, there’s a trend worth mentioning. Individuals with one dependent (e.g., one child) showed a marginally positive (but not statistically significant) effect. This could tentatively suggest they might be slightly more likely to remain unemployed when receiving the incentive.

Similar Trend for Under 35:

A similar trend was observed for individuals under 35.

Negative Estimates for Women and Black People:

Interestingly, women and Black people displayed negative point estimates. This means they might be more inclined to seek employment sooner upon receiving the incentive program’s benefits.

Key Points:

No statistically significant interaction effects were found between the incentive program and other factors. Individuals with one dependent and those under 35 showed a slight (not significant) tendency to remain unemployed longer with the incentive. Women and Black people seemed more likely to seek employment sooner when offered the incentive.

A crash course in good and bad controls#

In this section, we will explore different scenarios where we need to decide whether the inclusion of a control variable, denoted by Z, will help (or not) to improve the estimation of the average treatment effect (ATE) of treatment X on outcome Y. The effect of observed variables will be represented by a continuous line, while that of unobserved variables will be represented by and discontinuous line.

# import Pkg; Pkg.add("GraphPlot")
# import Pkg; Pkg.add("Compose")
# import Pkg; Pkg.add("Cairo")
# import Pkg; Pkg.add("Fontconfig")
# import Pkg; Pkg.add("Colors")
# import Pkg; Pkg.add("Graphs")
# Pkg.add("CausalInference")
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.10/Project.toml`
  No Changes to `~/.julia/environments/v1.10/Manifest.toml`
Precompiling project...
  ? MLJ
# Libraries
using Graphs, GraphPlot
using Compose, Cairo, Fontconfig
using Colors
using DataFrames, GLM

import Random

Good control (Blocking back-door paths)#

Model 1

We will assume that X measures whether or not the student attends the extra tutoring session, that affects the student’s grade (Y). Then, we have another observable variable, as hours of the student sleep (Z), that impacts X and Y. Theory says that when controlling by Z, we block the back-door path from X to Y. Thus, we see that in the second regression, the coefficient of X is closer to the real one (3.03339 ≈ 3).

# Define the graph with 3 nodes
g = SimpleDiGraph(3)
add_edge!(g, 1, 2)  # X -> Y
add_edge!(g, 3, 1)  # Z -> X
add_edge!(g, 3, 2)  # Z -> Y

# Node labels for X, Y, Z
nodelabel = ["X", "Y", "Z"]

# Define node colors based on labels
nodefillc_dict = Dict("X" => colorant"turquoise",
                      "Y" => colorant"turquoise",
                      "Z" => colorant"turquoise")

nodefillc = [nodefillc_dict[label] for label in nodelabel]

# Create the graph plot
g_plot = gplot(g,
               nodefillc = nodefillc,
               nodestrokec = "black",
               nodestrokelw = 0,
               nodelabel = nodelabel,
               NODESIZE = 0.15,
               NODELABELSIZE = 10,
               edgestrokec = "black")

# Save and display the plot
draw(PNG("causal_discovery_1.png", 13cm, 13cm), g_plot)
g_plot
../../_images/340e506341ebcfbdd4f934203e0c1ba251f08209aeedf6323a61dd01b8800299.svg
# Set seed
Random.seed!(24)

# Generate data
n = 1000
Z = rand(Normal(0, 1), n)
X = 5 * Z + rand(Normal(0, 1), n)
Y = 3 * X + 1.5 * Z + rand(Normal(0, 1), n)

# Create dataframe
df = DataFrame(Z = Z, X = X, Y = Y)

1,000 rows × 3 columns

ZXY
Float64Float64Float64
10.121698-0.646629-2.17627
2-0.03291042.150835.12964
3-1.86593-8.56013-28.4291
4-0.402004-3.37725-10.4183
50.2988212.030125.97373
61.888388.4863229.592
7-0.998066-4.56851-14.6436
8-1.11586-4.5261-15.5884
9-0.558549-2.6828-10.8751
100.3762782.729788.67289
111.738387.5954125.222
12-0.222023-1.21483-3.54479
131.055655.0051815.7835
14-0.293296-2.40651-8.40434
15-0.653213-4.72798-14.8385
16-0.426636-1.59728-5.77415
17-0.875714-6.23685-20.3148
18-0.625727-2.71617-7.12168
19-1.90824-10.5502-35.426
20-0.753929-1.1718-5.8239
21-1.46926-5.98103-21.324
221.366884.5646815.0488
232.2561111.47337.9483
240.6174812.547658.29642
250.391430.598491.8718
26-2.14499-11.1892-37.0807
270.8333134.2577215.3056
28-0.947989-4.69399-17.6018
29-0.03802260.180093-1.74691
300.5810133.3901112.4434
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
──────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error       t  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  0.021246  0.032701      0.65    0.5160  -0.0429246  0.0854166
X            3.29515   0.00623332  528.63    <1e-99   3.28291    3.30738
──────────────────────────────────────────────────────────────────────────
# Correct regression
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + Z

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  0.0102651   0.0315753   0.33    0.7452  -0.0516965  0.0722268
X            3.03339     0.0307871  98.53    <1e-99   2.97297    3.0938
Z            1.35458     0.156251    8.67    <1e-16   1.04796    1.6612
──────────────────────────────────────────────────────────────────────────

Model 2

We will assume that X stands for the police salaries that affect the crime rate (Y). Then, we have another observable variable, as the policemen’s supply (Z), that impacts X but not Y. And, additionally, we know that there is an unobservable variable (U), as the preference for maintaining civil order, that affects Z and Y. The theory says that when controlling by Z, we block (some) of the unobservable variable’s back-door path from X to Y. Thus, we see that in the second regression, the coefficient of X is equal to the real one (0.5).

# Define the graph with 4 nodes: U, Z, X, Y
g = SimpleDiGraph(4)
add_edge!(g, 1, 4)  # U -> Y
add_edge!(g, 1, 2)  # U -> Z
add_edge!(g, 2, 3)  # Z -> X
add_edge!(g, 3, 4)  # X -> Y

# Node labels for U, Z, X, Y
nodelabel = ["U", "Z", "X", "Y"]

# Define node colors based on labels
nodefillc_dict = Dict("U" => colorant"lightgrey",
                      "Z" => colorant"turquoise",
                      "X" => colorant"turquoise",
                      "Y" => colorant"turquoise")

nodefillc = [nodefillc_dict[label] for label in nodelabel]

# Create the graph plot
g_plot = gplot(g,
               nodefillc = nodefillc,
               nodestrokec = "black",
               nodestrokelw = 0,
               nodelabel = nodelabel,
               NODESIZE = 0.15,
               NODELABELSIZE = 10,
               edgestrokec = "black")

# Save and display the plot
draw(PNG("causal_graph.png", 13cm, 13cm), g_plot)
g_plot
../../_images/c019f822434198148753b4d648a9b8f1f2485b475976e030974b9e46ffd2f579.svg
# Set seed
Random.seed!(24)

# Generate data
n = 1000
U = rand(Normal(0, 1), n)
Z = 7 * U + rand(Normal(0, 1), n)
X = 2 * Z + rand(Normal(0, 1), n)
Y = 0.5 * X + 0.2 * U + rand(Normal(0, 1), n)

# Create dataframe
df = DataFrame(Z = Z, X = X, Y = Y, U = U)

1,000 rows × 4 columns

ZXYU
Float64Float64Float64Float64
1-0.403233-1.2254-2.25110.121698
22.085012.896541.13287-0.0329104
3-12.292-24.5338-12.1042-1.86593
4-4.18126-8.04607-4.90671-0.402004
52.627774.690662.449940.298821
612.263125.826613.01741.88838
7-6.56464-12.5702-6.84891-0.998066
8-6.75783-13.8519-7.02647-1.11586
9-3.7999-9.58863-4.76023-0.558549
103.482336.88384.367920.376278
1111.072221.972510.98411.73838
12-1.65888-2.88502-1.31234-0.222023
137.1164913.41757.986431.05565
14-2.9931-6.73107-3.86902-0.293296
15-6.03441-11.7435-5.76025-0.653213
16-2.45055-5.24346-2.72531-0.426636
17-7.98828-16.2672-8.1003-0.875714
18-3.96763-5.96982-2.35263-0.625727
19-14.3667-29.6464-15.1005-1.90824
20-2.67965-6.53693-4.24367-0.753929
21-8.91955-19.0161-8.64774-1.46926
227.2984413.90136.881511.36688
2315.985232.115616.21892.25611
243.782617.292482.912770.617481
251.381352.251891.292550.39143
26-15.4791-31.254-16.412-2.14499
275.9243413.13128.511290.833313
28-6.58996-15.2778-6.27984-0.947989
290.104048-2.02206-0.578553-0.0380226
304.5521310.50583.951930.581013
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)   Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.0337287  0.0307735    -1.10    0.2733  -0.0941169  0.0266596
X             0.514806   0.00210354  244.73    <1e-99   0.510678   0.518934
────────────────────────────────────────────────────────────────────────────
# Controlling by the confounder Z
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + Z

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  -0.0327959   0.0306374  -1.07    0.2847  -0.092917  0.0273253
X             0.417904    0.0307391  13.60    <1e-38   0.357583  0.478225
Z             0.195056    0.0617315   3.16    0.0016   0.073917  0.316194
──────────────────────────────────────────────────────────────────────────

Bad Control (M-bias)#

Model 7

Let us suppose that X stands for a job training program aimed at reducing unemployment. Then, there is a first unobserved confounder, which could be the planning effort and good design of the job program (U1) that impacts directly on the participation in job training programs (X) and the proximity of job programs (that would be the bad control Z). Furthermore, we have another unobserved confounder (U2), as the soft skills of the unemployed, that affects the employment status of individuals (Y) and the likelihood of beeing in a job training program that is closer (Z). That is why including Z in the second regression makes X coefficient value further to the real one.

# Define the graph with 5 nodes: U1, U2, Z, X, Y
g = SimpleDiGraph(5)
add_edge!(g, 1, 3)  # U1 -> Z
add_edge!(g, 1, 4)  # U1 -> X
add_edge!(g, 2, 3)  # U2 -> Z
add_edge!(g, 2, 5)  # U2 -> Y
add_edge!(g, 4, 5)  # X -> Y

# Node labels for U1, U2, Z, X, Y
nodelabel = ["U1", "U2", "Z", "X", "Y"]

# Define node colors based on labels
nodefillc_dict = Dict("U1" => colorant"lightgrey",
                      "U2" => colorant"lightgrey",
                      "Z" => colorant"turquoise",
                      "X" => colorant"turquoise",
                      "Y" => colorant"turquoise")

nodefillc = [nodefillc_dict[label] for label in nodelabel]

# Create the graph plot
g_plot = gplot(g,
               nodefillc = nodefillc,
               nodestrokec = "black",
               nodestrokelw = 0,
               nodelabel = nodelabel,
               NODESIZE = 0.15,
               NODELABELSIZE = 10,
               edgestrokec = "black")

# Save and display the plot
draw(PNG("causal_graph.png", 13cm, 13cm), g_plot)
g_plot
../../_images/d5b8c6c5c1e32dcdc398a302cbc862e1a2e8e0196d2e2f891dfb8c93d3d376f1.svg
# Set seed
Random.seed!(24)

# Generate data
n = 1000
U1 = rand(Normal(0, 1), n)
U2 = rand(Normal(0, 1), n)
Z = 0.3 * U1 + 0.9 * U2 + rand(Normal(0, 1), n)
X = 4 * U1 + rand(Normal(0, 1), n)
Y = 3 * X + U2 + rand(Normal(0, 1), n)

# Create dataframe
df = DataFrame(Z = Z, X = X, Y = Y, U1 = U1, U2 = U2)

1,000 rows × 5 columns

ZXYU1U2
Float64Float64Float64Float64Float64
1-1.51203-1.17595-5.822910.121698-1.25512
20.800487-0.4404591.23247-0.03291042.31538
30.182939-6.92784-20.038-1.865930.769514
4-1.03465-2.41129-8.13353-0.402004-1.36723
50.007185581.240134.104670.2988210.53602
61.006987.2799119.80461.88838-0.955557
70.639282-4.35645-12.425-0.9980660.421824
80.276855-4.34078-11.7458-1.115861.0532
9-2.05745-2.0884-6.33028-0.5585490.109943
100.795572.355887.043240.3762780.848389
11-0.6371446.6037217.67751.73838-1.09647
120.27188-0.713512-1.33155-0.222023-0.104714
13-0.7445645.2891715.62131.05565-0.273081
14-1.67887-1.61801-5.69842-0.293296-0.940029
15-1.18641-2.37069-6.84398-0.653213-1.46192
160.0119644-1.7248-5.99998-0.4266360.535899
17-2.22583-3.29441-11.8443-0.875714-1.85828
182.14893-1.74549-2.56299-0.6257270.412462
19-2.39359-7.52863-25.1647-1.90824-1.00899
200.934262-3.84013-7.58954-0.7539292.59785
21-0.389081-4.72286-12.8162-1.469261.36527
22-2.328215.1249712.78691.36688-2.26971
230.9951788.7343825.01592.256110.192415
24-0.5732861.612965.247150.617481-0.539758
25-1.616181.654042.681830.39143-1.35866
26-1.35704-8.93594-28.2335-2.14499-0.464224
271.614515.1122915.23190.8333130.0911555
28-2.34086-2.24332-6.4752-0.9479890.0459587
29-1.908380.2879922.68483-0.03802260.370206
302.012370.9068842.544610.5810130.485038
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)  -0.0121698   0.0470352   -0.26    0.7959  -0.104469  0.0801295
X             2.98862     0.0110309  270.93    <1e-99   2.96698   3.01027
───────────────────────────────────────────────────────────────────────────
# Controlling by the confounder Z
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + Z

Coefficients:
───────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)  0.0022004  0.0409946     0.05    0.9572  -0.0782452   0.082646
X            2.95344    0.00981324  300.96    <1e-99   2.93418     2.9727
Z            0.526533   0.0295594    17.81    <1e-61   0.468528    0.584539
───────────────────────────────────────────────────────────────────────────

Neutral Control (possibly good for precision)#

Model 8

In this scenario, we will assume that X represents the implementation of a new government policy to provide subsidies and guidance for small companies. There is another variable, Z, that stands for the % inflation rate. And both X and Z affect Y, which represents the GDP growth rate of the country. Then, even if Z does not impact X, its inclusion improves the precision of the ATE estimator (8.63339 is closer to 8.6).

# Define the graph with 3 nodes: Z, X, Y
g = SimpleDiGraph(3)
add_edge!(g, 1, 3)  # Z -> Y
add_edge!(g, 2, 3)  # X -> Y

# Node labels for Z, X, Y
nodelabel = ["Z", "X", "Y"]

# Define node colors based on labels
nodefillc_dict = Dict("Z" => colorant"turquoise",
                      "X" => colorant"turquoise",
                      "Y" => colorant"turquoise")

nodefillc = [nodefillc_dict[label] for label in nodelabel]

# Create the graph plot
g_plot = gplot(g,
               nodefillc = nodefillc,
               nodestrokec = "black",
               nodestrokelw = 0,
               nodelabel = nodelabel,
               NODESIZE = 0.15,
               NODELABELSIZE = 10,
               edgestrokec = "black")

# Save and display the plot
draw(PNG("causal_graph.png", 13cm, 13cm), g_plot)
g_plot
../../_images/fef8b26cf9fdb1bc24fb94e591767b7611205e3680451c11ff15e82c98577d9f.svg
# Set seed
Random.seed!(24)

# Generate data
n = 1000
Z = rand(Normal(0, 1), n)
X = rand(Normal(0, 1), n)
Y = 8.6 * X + 5 * Z + rand(Normal(0, 1), n)

# Create dataframe
df = DataFrame(Z = Z, X = X, Y = Y)

1,000 rows × 3 columns

ZXY
Float64Float64Float64
10.121698-1.25512-10.6044
2-0.03291042.3153818.4743
3-1.865930.769514-2.66167
4-0.402004-1.36723-13.4518
50.2988210.536025.539
61.88838-0.9555572.52455
7-0.9980660.421824-0.803581
8-1.115861.05323.14196
9-0.5585490.109943-3.83607
100.3762780.8483899.09667
111.73838-1.09647-0.909568
12-0.222023-0.104714-1.57793
131.05565-0.2730812.11428
14-0.293296-0.940029-10.2956
15-0.653213-1.46192-15.5133
16-0.4266360.5358992.13319
17-0.875714-1.85828-20.6504
18-0.6257270.4124622.38397
19-1.90824-1.00899-19.1316
20-0.7539292.5978517.3942
21-1.469261.365273.21796
221.36688-2.26971-13.3806
232.256110.19241513.0805
240.617481-0.539758-1.82726
250.39143-1.35866-10.2382
26-2.14499-0.464224-15.013
270.8333130.09115556.23298
28-0.9479890.0459587-6.44253
29-0.03802260.3702060.763502
300.5810130.4850388.47793
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  -0.0115351    0.167397  -0.07    0.9451  -0.340025   0.316955
X             8.5183       0.163177  52.20    <1e-99   8.19809    8.83851
──────────────────────────────────────────────────────────────────────────
# Controlling by the confounder Z
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + Z

Coefficients:
───────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)  0.0102651   0.0315753    0.33    0.7452  -0.0516965  0.0722268
X            8.63339     0.0307871  280.42    <1e-99   8.57297    8.6938
Z            5.02152     0.0305298  164.48    <1e-99   4.96161    5.08143
───────────────────────────────────────────────────────────────────────────

Bad Controls (Bias amplification)#

Model 10

Let us assume that X measures the implementation of a housing program for young adults buying their first house, which impacts the average housing prices (Y). There is another observable variable, Z, that measures the expenditure of the program and affects only X. Also, there is an unobservable variable (U) that represents the preference of young adults to move from their parent’s house and impacts only X and Y. Therefore, the inclusion of Z will “amplify the bias” of (•) on X, so the ATE estimator will be worse. We can see that in the second regression, the estimator (0.833083) is much farther from the real value (0.8).

# Define the graph with 4 nodes: U, Z, X, Y
g = SimpleDiGraph(4)
add_edge!(g, 1, 3)  # U -> X
add_edge!(g, 1, 4)  # U -> Y
add_edge!(g, 2, 3)  # Z -> X
add_edge!(g, 3, 4)  # X -> Y

# Node labels for U, Z, X, Y
nodelabel = ["U", "Z", "X", "Y"]

# Define node colors based on labels
nodefillc_dict = Dict("U" => colorant"lightgrey",
                      "Z" => colorant"turquoise",
                      "X" => colorant"turquoise",
                      "Y" => colorant"turquoise")

nodefillc = [nodefillc_dict[label] for label in nodelabel]

# Create the graph plot
g_plot = gplot(g,
               nodefillc = nodefillc,
               nodestrokec = "black",
               nodestrokelw = 0,
               nodelabel = nodelabel,
               NODESIZE = 0.15,
               NODELABELSIZE = 10,
               edgestrokec = "black")

# Save and display the plot
draw(PNG("causal_graph.png", 13cm, 13cm), g_plot)
g_plot
../../_images/9b658c5c4fc943e8ac23750c0aa7d01885b61ad8dad8f2b37bf81acd1259c523.svg
# Set seed
Random.seed!(24)

# Generate data
n = 1000
U = rand(Normal(0, 1), n)
Z = rand(Normal(0, 1), n)
X = 3 * Z + 6 * U + rand(Normal(0, 1), n)
Y = 0.8 * X + 0.2 * U + rand(Normal(0, 1), n)

# Create dataframe
df = DataFrame(U = U, Z = Z, X = X, Y = Y)

1,000 rows × 4 columns

UZXY
Float64Float64Float64Float64
10.121698-1.25512-3.4541-4.40168
2-0.03291042.315385.47524.06476
3-1.865930.769514-8.83687-6.90681
4-0.402004-1.36723-6.19727-5.84149
50.2988210.536022.836112.3735
61.88838-0.9555579.764057.91532
7-0.9980660.421824-4.16386-3.89489
8-1.115861.0532-3.87183-3.19797
9-0.5585490.109943-5.0103-3.97415
100.3762780.8483894.721974.7036
111.73838-1.096476.969025.57311
12-0.222023-0.104714-1.21355-0.840665
131.05565-0.2730814.699195.03703
14-0.293296-0.940029-5.32473-4.76326
15-0.653213-1.46192-7.97976-6.27229
16-0.4266360.535899-1.29447-1.13916
17-0.875714-1.85828-11.1198-8.86252
18-0.6257270.412462-0.5515440.191041
19-1.90824-1.00899-15.3895-12.5889
20-0.7539292.597852.092350.698675
21-1.469261.36527-5.8968-3.85711
221.36688-2.269710.696610.488127
232.256110.19241514.259111.5684
240.617481-0.5397581.812870.716825
250.39143-1.35866-2.23822-1.62397
26-2.14499-0.464224-14.5583-12.4317
270.8333130.09115556.555827.19036
28-0.9479890.0459587-7.64789-4.75927
29-0.03802260.370206-1.34767-0.645662
300.5810130.4850386.342733.77322
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)   Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.0317809  0.0309595    -1.03    0.3049  -0.0925341  0.0289723
X             0.827152   0.00444388  186.13    <1e-99   0.818432   0.835872
────────────────────────────────────────────────────────────────────────────
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X + Z

Coefficients:
─────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)   Lower 95%   Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.0349805  0.0308707    -1.13    0.2574  -0.0955594   0.0255984
X             0.833083   0.00489468  170.20    <1e-99   0.823478    0.842688
Z            -0.0946024  0.0332626    -2.84    0.0045  -0.159875   -0.0293297
─────────────────────────────────────────────────────────────────────────────