Lab 2 - Julia Code#
Group 3: Dube, V., Garay, E. Guerrero, J., Villalba, M.
Multicollinearity#
Multicolinearity occurs when two or more predictors in a regresion model are highly correlated to one another, causing a higher variance our the estimated coefficients. To understand the way multicollinearity affects our regresion we can examine the composition of the variance of our estimates.
Suppose some Data Generating Process follows:
\begin{equation*} Y = X_1\beta_1 +\epsilon \end{equation*}
Considering the partitioned regression model:
\begin{align*} Y &= X\beta + e \ Y &= X_1\beta_1 + X_2\beta_2 + e \end{align*}
We know that the OLS estimator will solve this equation:
\begin{align*} (X’X)\hat{\beta} &=X’Y \
\begin{bmatrix} X_1’X_1 & X_1’X_2 \ X_2’X_1 & X_2’X_2 \end{bmatrix} \begin{bmatrix} \hat{\beta_1} \ \hat{\beta_2} \end{bmatrix} & = \begin{bmatrix} X_1’Y \ X_2’Y \end{bmatrix} \end{align*}
This, because of the Frisch-Whaugh-Lovell Theorem, yields:
\begin{align*} \hat{\beta_1} &= (X_1’M_2X_1)^{-1}X_1’M_2Y \end{align*}
Where \(M_2 = I - X_2(X_2'X_2)^{-1}X_2'\), is the orthogonal projection matrix to \(X_2\).
Note that \(M_2\) is symmetric, idempotent, and that any variable premultiplied by it yields the residual from from running \(X_2\) on that variable. For an arbitrary variable \(Z\):
\begin{align*} M_2Z &= (I - X_2(X_2’X_2)^{-1}X_2’)Z \ &= Z - X_2(X_2’X_2)^{-1}X_2’Z \ &= Z - X_2\hat{\omega} \ &= Z - \hat{Z} \ &= e_{Z} \end{align*}
Where \(e_{Z}\) and \(\hat{\omega} \equiv (X_2'X_2)^{-1}X_2'Z\) come from the regresion: $\( Z = X_2\hat{\omega} + e_{Z}\)$
In a sense, the \(M_2\) matrix cleanses or “filters out” our \(Z\) variable, keeping only the part which is orthogonal to \(X_2\).
Returning to \(\hat{\beta_1}\):
\begin{align*} \hat{\beta_1} &= (X_1’M_2X_1)^{-1}X_1’M_2Y \ &= (X_1’M_2X_1)^{-1}X_1’M_2(X_1\beta_1 + \epsilon) \ &= \beta_1 + (X_1’M_2X_1)^{-1}X_1’M_2\epsilon \end{align*}
For the conditional variance of \(\hat{\beta_1}\) this has great implications:
\begin{align*} Var(\hat{\beta_1}|X) &= Var(\beta_1 + (X_1’M_2X_1)^{-1}X_1’M_2\epsilon|X) \ &= Var((X_1’M_2X_1)^{-1}X_1’M_2\epsilon|X) \ &= E[((X_1’M_2X_1)^{-1}X_1’M_2\epsilon)((X_1’M_2X_1)^{-1}X_1’M_2\epsilon)’|X] \ &= E[(X_1’M_2X_1)^{-1}X_1’M_2\epsilon\epsilon’M_2’X_1(X_1’M_2’X_1)^{-1}|X] \ &= (X_1’M_2X_1)^{-1}X_1’M_2E[\epsilon\epsilon’|X]M_2’X_1(X_1’M_2’X_1)^{-1} \end{align*}
Under the traditional assumption that \(E[\epsilon\epsilon'|X] = \sigma^2I\):
\begin{align*} Var(\hat{\beta_1}|X) &= \sigma^2(X_1’M_2X_1)^{-1}X_1’M_2M_2’X_1(X_1’M_2’X_1)^{-1} \ &= \sigma^2(X_1’M_2’X_1)^{-1} \end{align*}
Remembering that the variance of \(X_1\) can be decomposed into two positive components:
\begin{align*} X_1 &= X_2\alpha + v \ Var(X_1) &= Var(X_2\alpha) + Var(v) \ Var(X_1) - Var(X_2\alpha) &= Var(v) \ E[X_1’X_1] - Var(X_2\alpha) &= E[X_1’M_2’X_1] \end{align*}
Thus, necessarily: $\(E[X_1'M_2X_1] \leq E[X_1'X_1]\)$
Altogether this means: $\(\sigma^2(X_1'X_1)^{-1} \leq \sigma^2(X_1'M_2'X_1)^{-1}\)$
This shows that controlling for the irrelevant variables \(X_2\) will in fact increase the variance of \(\hat{\beta_1}\) by limiting us to the “usable” variance of \(X_1\) which is orthogonal to \(X_2\).
Suppose we want to estimate the impact of years of schooling on future salary. Imagine as well that we have a vast array of possible control variables at our disposal. Someone who is not familiar with the concept of multicollinearity might think that to avoid any possibility of ommited variable bias and ensure consistency it is best to control for everything we can. We now know this is not the case and that this approach can inadvertently introduce multicollinearity.
Consider that we have as a possible control variable the total number of courses taken by each student. Intuitively, years of schooling are likely to correlate strongly with the number of total courses taken (more years in school tipically leads to more courses completed) and so controlling for this variable may result in the problem described above, inflating the variance of the estimated coefficients and potentially distorting our understanding of the true effect of schooling on salary.
The same rationale applies to many other examples. For instance, imagine estimating the impact of marketing expenditure on sales. Controlling for variable such as number of marketing campaigns will probably cause the same issue.
Perfectly collinear regressors#
A special case of the previously mentioned concept of multicollinearity arises when a variable is a linear combination of some other variables from our dataset, so not only are these variables highly correlated, but we say that they are perfectly collinear.
Considering the partitioned regression model:
\begin{align*} Y &= x_1\beta_1 + X_2\beta_2 + \epsilon \ x_1 &= X_2 \alpha \end{align*}
where the second equation is deterministic.
We know that the OLS estimator will solve this equation:
\begin{align*} (X’X)\hat{\beta} &=X’Y \
\begin{bmatrix} X_1’X_1 & X_1’X_2 \ X_2’X_1 & X_2’X_2 \end{bmatrix} \begin{bmatrix} \hat{\beta_1} \ \hat{\beta_2} \end{bmatrix} & = \begin{bmatrix} X_1’Y \ X_2’Y \end{bmatrix} \end{align*}
Substituting \(X_1 = X_2 \alpha\) on the \((X'X)\) matrix:
\begin{align*} (X’X) &= \begin{bmatrix} (X_2\alpha)’X_2\alpha & (X_2\alpha)’X_2 \ X_2’(X_2\alpha) & X_2’X_2 \end{bmatrix} \
&= \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \end{align*}
This yields the determinant:
\begin{align*} \det (X’X) &= \det\left( \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \ \end{align*}
Transforming the matrix using row operations: \begin{align*} \det (X’X) &= \det\left( \begin{bmatrix} I & -\alpha’ \ 0 & I \end{bmatrix} \begin{bmatrix} \alpha’X_2’X_2\alpha & \alpha’X_2’X_2 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \ &= \det\left( \begin{bmatrix} 0 & 0 \ X_2’X_2\alpha & X_2’X_2 \end{bmatrix} \right) \end{align*}
Like this, we can see that: \begin{align*} \det (X’X) &= 0 \end{align*}
Because of this, \((X'X)\) is not invertible and there is no solution for the OLS estimation
Practical application#
We can easily show what we have theoretically explained with a practical application. We will create a dataset simulating the regressors that we may want to include in the estimation of a linear model.
The first 9 variables follow a normal distribution.
using LinearAlgebra
using Random
Random.seed!(0)
matrix = randn(10, 9)
10×9 Matrix{Float64}:
-0.231909 -0.269885 -1.07458 … -0.57796 0.27262 0.849341
0.94039 0.578004 -1.38197 -1.78676 1.23521 -0.210716
0.596762 1.16034 0.951853 -1.36705 1.11072 -0.143171
1.99782 0.287888 -0.292291 0.660498 1.85158 -0.121578
-0.0515618 -0.441193 0.328946 0.0371086 -0.131821 0.47354
1.17224 0.134576 -0.495999 … 0.877644 -0.406776 -0.0923911
-1.69681 0.165543 0.22277 1.79296 -0.379297 -0.550621
-2.11615 -1.0721 -1.66125 -1.35785 -0.499403 -0.474732
0.558366 -1.26016 1.92524 -1.48326 -0.635276 0.237598
-0.86473 0.196408 -1.04847 0.475453 2.28402 0.100588
However, the 10th variable is a linear combination (the sum) of variables 1, 5 and 9.
coefficients = [1, 1, 1]
selected_columns = [1, 5, 9]
matrix = hcat(matrix, matrix[:, selected_columns] * coefficients)
10×10 Matrix{Float64}:
-0.231909 -0.269885 -1.07458 … 0.27262 0.849341 0.100641
0.94039 0.578004 -1.38197 1.23521 -0.210716 -0.828557
0.596762 1.16034 0.951853 1.11072 -0.143171 -0.510355
1.99782 0.287888 -0.292291 1.85158 -0.121578 3.1357
-0.0515618 -0.441193 0.328946 -0.131821 0.47354 -0.951298
1.17224 0.134576 -0.495999 … -0.406776 -0.0923911 0.815082
-1.69681 0.165543 0.22277 -0.379297 -0.550621 -1.00174
-2.11615 -1.0721 -1.66125 -0.499403 -0.474732 -2.84605
0.558366 -1.26016 1.92524 -0.635276 0.237598 2.11939
-0.86473 0.196408 -1.04847 2.28402 0.100588 -0.864591
As we saw in theory, this should cause our dataset to have a determinant of zero. Thus, making X singular and yielding an error message when trying to invert it.
det(matrix)
-2.548968811760904e-14
inv_matrix = inv(matrix)
10×10 Matrix{Float64}:
4.21177e15 3.21803e15 2.84373e15 … 5.35458e15 2.48399e15
0.35937 0.106139 0.159937 -0.497513 -0.629165
-1.9799 -1.79593 -0.971094 -2.79778 -1.85628
0.382724 -0.282216 0.588743 0.183997 0.0466157
4.21177e15 3.21803e15 2.84373e15 5.35458e15 2.48399e15
-3.02412 -2.49823 -2.12259 … -4.6385 -2.89822
-1.40691 -1.3032 -0.896793 -2.0195 -0.996493
-1.28789 -0.920547 -0.753483 -1.62345 -0.682825
4.21177e15 3.21803e15 2.84373e15 5.35458e15 2.48399e15
-4.21177e15 -3.21803e15 -2.84373e15 -5.35458e15 -2.48399e15
We can see that this is not the case. Is this a contradiction to our theoretical proof? Julia, as well as Python, yield a determinant extremely close to cero but not equal to cero, so they are able to find a “supposed” inverse to the matrix. This would seem to contradict what we have proven in theory, however, this is a problem rooted in the way those languages handle float values. Thus, that is not a contradiction of theory but rather an illustration of how computational environments deal differently with the inherent limitations of floating-point arithmetic.
Analyzing RCT data with Precision Adjustment#
using DataFrames
using FilePaths
using Queryverse
using GLM
using StatsModels
using Combinatorics
using Iterators
using CategoricalArrays
using HypothesisTests
using StatsBase
using Lasso
using TypedTables
using MacroTools
using NamedArrays
using DataTables
using Latexify
using PrettyTables
using TypedTables
using TexTables
using DataTables
using FilePaths
using Combinatorics
using CategoricalArrays
using TypedTables
using MacroTools
using HDMjl
ArgumentError: Package FilePaths not found in current path.
- Run `import Pkg; Pkg.add("FilePaths")` to install the FilePaths package.
Stacktrace:
[1] macro expansion
@ .\loading.jl:1772 [inlined]
[2] macro expansion
@ .\lock.jl:267 [inlined]
[3] __require(into::Module, mod::Symbol)
@ Base .\loading.jl:1753
[4] #invoke_in_world#3
@ .\essentials.jl:926 [inlined]
[5] invoke_in_world
@ .\essentials.jl:923 [inlined]
[6] require(into::Module, mod::Symbol)
@ Base .\loading.jl:1746
mat, head = readdlm("../../../data/penn_jae.dat", header=true, Float64)
mat
df =DataFrame(mat, vec(head))
penn = filter(row -> row[:tg] in [2,0], df)
# Treatment group n°2
replace!(penn.tg, 2 => 1)
rename!(penn, "tg" => "T2")
# from float to string
penn[!,:dep] = string.(penn[!,:dep])
# dep varaible in categorical format
penn[!,:dep] = categorical(penn[!,:dep])
Histogram#
treat = filter(row -> row[:T2] in [1], penn)
treat= select(treat, [:inuidur1])
treatmat=Matrix(treat)
notreat = filter(row -> row[:T2] in [0], penn)
notreat= select(notreat, [:inuidur1])
notreatmat=Matrix(notreat)
Treated individuals
histogram(treatmat, bins=15, normalize=:pdf, color=:gray)
xlims!(0,60)
title!("Distribution of weeks unemployed for treated individuals")
xlabel!("Weeks unemployed")
ylabel!("P(x)")
Control individuals
histogram(notreatmat, bins=15, normalize=:pdf, color=:gray)
xlims!(0,60)
title!("Distribution of weeks unemployed for control individuals")
xlabel!("Weeks unemployed")
ylabel!("P(x)")
1 Classical 2-Sample Approach (CL)
alpha=0.05
# Calculate the t-statistic and p-value
p_value = pvalue(UnequalVarianceTTest(vec(treatmat), vec(notreatmat)))
reject_H0 = p_value < alpha
alpha, p_value, reject_H0
ols_cl = lm(@formula(log(inuidur1) ~ T2), penn)
table1 = regtable( "No adjustment model" => ols_cl)
#coefic=coef(ols_cl)
2 Classical Linear Regression Adjustment (CRA)
reg2 = @formula(log(inuidur1) ~ T2 + (female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)^2)
reg2 = apply_schema(reg2, schema(reg2, penn))
ols_cra = lm(reg2, penn)
table2 = regtable("CRA model" => ols_cra)
3 Interactive Regression Adjustment (IRA)
# demean function
function desv_mean(a)
A = mean(a, dims = 1)
M = zeros(Float64, size(X,1), size(X,2))
for i in 1:size(a,2)
M[:,i] = a[:,i] .- A[i]
end
return M
end
reg1 = @formula(T2 ~ (female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)^2)
reg1 = apply_schema(reg1, schema(reg1, penn))
# Matrix Model & demean
X = StatsModels.modelmatrix(reg1.rhs,penn)
X = desv_mean(X) # matrix format
Y = select(penn, [:inuidur1,:T2]) # select inuidur1 y T2
X = DataFrame(hcat(X, Matrix(select(penn, [:T2])).*X)) # Joint X, (T2*X)
base = hcat(Y, X) # Joint inuidur1, T2, X y (T2*X)
base.inuidur1 = log.(base.inuidur1) # log(inuidur1)
terms = term.(names(base)) # term.() let us to get all variables as objects
#interactive regression model
ols_ira = lm(terms[1] ~ sum(terms[2:end]), base)
table3 = regtable("Interactive model" => ols_ira)
#terms[1] : select first variable. In this case, oucome of interest
#sum(terms[2:end]) : independent variables as regresors in the linear regression
4 Interactive Regression Adjustment Using Lasso (IRA using Lasso)
X = StatsModels.modelmatrix(reg2.rhs,penn)
X = desv_mean(X)
D = DataFrame([X[:,1]]) # Treatment varaible
rename!(D, Dict(:x1 => :T2)) #rename x1 -> T2
X = DataFrame(hcat(X[:,2:end], X[:,1].*X[:,2:end])) # Join Controls (X) + T2*X "interactive"
Y = select(penn, [:inuidur1]) #select just inuidur1
Y.inuidur1 = log.(Y.inuidur1) # log(inuidur1)
#terms = term.(names(base)) # all terms
#model = terms[1] ~ sum(terms[2:end])
#lazso = fit(LassoModel,terms[1] ~ sum(terms[2:end]), base; standardize=false, α = 0.1)
#coef(lasso)[2]
#table3 = regtable("Lasso adjustment model" => lasso)
D_reg_0 = rlasso_arg( X, D, nothing, true, true, true, false, false,
nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )
D_resid = rlasso(D_reg_0)
D_resid = rlasso(D_reg_0)["residuals"]
Y_reg_0 = rlasso_arg( X, Y, nothing, true, true, true, false, false,
nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )
Y_resid = rlasso(Y_reg_0)["residuals"]
D_resid = reshape(D_resid, length(D_resid), 1)
Lasso_ira = lm(D_resid, Y_resid)
Plot coefficients#
#T2*female 1
#T2*black 2
#T2*agelt35 4
#T2*factor(dep)1 11
# 23 var:
# Intercept, T2, 120INTERACTIONS ...,T2*female,T2*black, ... T2*INTERACTION
# 1 , 2, ... 122, 123 , 124 , ... 240
#coef(ols_ira)[1]
95% confidence
# Extraer los coeficientes para sex y los errores estándar de las últimas dos regresiones
coefs = Dict("T2*fem" => coef(ols_ira)[122],
"T2*black" => coef(ols_ira)[123],
"T2*agelt35" => coef(ols_ira)[125],
"T2*f(dep)1" => coef(ols_ira)[132])
ses = Dict("T2*fem" => stderror(ols_ira)[122],
"T2*black" => stderror(ols_ira)[123],
"T2*agelt35" => stderror(ols_ira)[125],
"T2*f(dep)1" => stderror(ols_ira)[132])
# Gráfico de dispersión con barras de error
scatter_plot = scatter(coefs, yerr=1.96 .* collect(values(ses)), legend=false,
xlabel="Coefficient", ylabel="Value", title="Coefficients using IRA(95%)",
markershape=:circle, markercolor=:blue, markerstrokecolor=:black,
markersize=8, markeralpha=0.8)
# Línea horizontal en y=0
hline!(scatter_plot, [0], color="gray", linestyle=:dash)
90% confidence
# Extraer los coeficientes para sex y los errores estándar de las últimas dos regresiones
coefs = Dict("T2*fem" => coef(ols_ira)[122],
"T2*black" => coef(ols_ira)[123],
"T2*agelt35" => coef(ols_ira)[125],
"T2*f(dep)1" => coef(ols_ira)[132])
ses = Dict("T2*fem" => stderror(ols_ira)[122],
"T2*black" => stderror(ols_ira)[123],
"T2*agelt35" => stderror(ols_ira)[125],
"T2*f(dep)1" => stderror(ols_ira)[132])
# Gráfico de dispersión con barras de error
scatter_plot = scatter(coefs, yerr=1.645 .* collect(values(ses)), legend=false,
xlabel="Coefficient", ylabel="Value", title="Coefficients using IRA(90%)",
markershape=:circle, markercolor=:blue, markerstrokecolor=:black,
markersize=8, markeralpha=0.8)
# Línea horizontal en y=0
hline!(scatter_plot, [0], color="gray", linestyle=:dash)
No Significant Interaction Effects:
Our analysis found no statistically significant interaction effects between the incentive program and other factors at the 90% or 95% confidence level. This suggests there’s no evidence of heterogeneity (variation in effects) within subgroups defined by these factors.
Marginally Positive Effect for Dependents = 1:
However, there’s a trend worth mentioning. Individuals with one dependent (e.g., one child) showed a marginally positive (but not statistically significant) effect. This could tentatively suggest they might be slightly more likely to remain unemployed when receiving the incentive.
Similar Trend for Under 35:
A similar trend was observed for individuals under 35.
Negative Estimates for Women and Black People:
Interestingly, women and Black people displayed negative point estimates. This means they might be more inclined to seek employment sooner upon receiving the incentive program’s benefits.
Key Points:
No statistically significant interaction effects were found between the incentive program and other factors. Individuals with one dependent and those under 35 showed a slight (not significant) tendency to remain unemployed longer with the incentive. Women and Black people seemed more likely to seek employment sooner when offered the incentive.
A crash course in good and bad controls#
In this section, we will explore different scenarios where we need to decide whether the inclusion of a control variable, denoted by Z, will help (or not) to improve the estimation of the average treatment effect (ATE) of treatment X on outcome Y. The effect of observed variables will be represented by a continuous line, while that of unobserved variables will be represented by and discontinuous line.
# import Pkg; Pkg.add("GraphPlot")
# import Pkg; Pkg.add("Compose")
# import Pkg; Pkg.add("Cairo")
# import Pkg; Pkg.add("Fontconfig")
# import Pkg; Pkg.add("Colors")
# import Pkg; Pkg.add("Graphs")
# Pkg.add("CausalInference")
Resolving package versions...
No Changes to `~/.julia/environments/v1.10/Project.toml`
No Changes to `~/.julia/environments/v1.10/Manifest.toml`
Precompiling project...
? MLJ
# Libraries
using Graphs, GraphPlot
using Compose, Cairo, Fontconfig
using Colors
using DataFrames, GLM
import Random
Good control (Blocking back-door paths)#
Model 1
We will assume that X measures whether or not the student attends the extra tutoring session, that affects the student’s grade (Y). Then, we have another observable variable, as hours of the student sleep (Z), that impacts X and Y. Theory says that when controlling by Z, we block the back-door path from X to Y. Thus, we see that in the second regression, the coefficient of X is closer to the real one (3.03339 ≈ 3).
# Define the graph with 3 nodes
g = SimpleDiGraph(3)
add_edge!(g, 1, 2) # X -> Y
add_edge!(g, 3, 1) # Z -> X
add_edge!(g, 3, 2) # Z -> Y
# Node labels for X, Y, Z
nodelabel = ["X", "Y", "Z"]
# Define node colors based on labels
nodefillc_dict = Dict("X" => colorant"turquoise",
"Y" => colorant"turquoise",
"Z" => colorant"turquoise")
nodefillc = [nodefillc_dict[label] for label in nodelabel]
# Create the graph plot
g_plot = gplot(g,
nodefillc = nodefillc,
nodestrokec = "black",
nodestrokelw = 0,
nodelabel = nodelabel,
NODESIZE = 0.15,
NODELABELSIZE = 10,
edgestrokec = "black")
# Save and display the plot
draw(PNG("causal_discovery_1.png", 13cm, 13cm), g_plot)
g_plot
# Set seed
Random.seed!(24)
# Generate data
n = 1000
Z = rand(Normal(0, 1), n)
X = 5 * Z + rand(Normal(0, 1), n)
Y = 3 * X + 1.5 * Z + rand(Normal(0, 1), n)
# Create dataframe
df = DataFrame(Z = Z, X = X, Y = Y)
Z | X | Y | |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 0.121698 | -0.646629 | -2.17627 |
2 | -0.0329104 | 2.15083 | 5.12964 |
3 | -1.86593 | -8.56013 | -28.4291 |
4 | -0.402004 | -3.37725 | -10.4183 |
5 | 0.298821 | 2.03012 | 5.97373 |
6 | 1.88838 | 8.48632 | 29.592 |
7 | -0.998066 | -4.56851 | -14.6436 |
8 | -1.11586 | -4.5261 | -15.5884 |
9 | -0.558549 | -2.6828 | -10.8751 |
10 | 0.376278 | 2.72978 | 8.67289 |
11 | 1.73838 | 7.59541 | 25.222 |
12 | -0.222023 | -1.21483 | -3.54479 |
13 | 1.05565 | 5.00518 | 15.7835 |
14 | -0.293296 | -2.40651 | -8.40434 |
15 | -0.653213 | -4.72798 | -14.8385 |
16 | -0.426636 | -1.59728 | -5.77415 |
17 | -0.875714 | -6.23685 | -20.3148 |
18 | -0.625727 | -2.71617 | -7.12168 |
19 | -1.90824 | -10.5502 | -35.426 |
20 | -0.753929 | -1.1718 | -5.8239 |
21 | -1.46926 | -5.98103 | -21.324 |
22 | 1.36688 | 4.56468 | 15.0488 |
23 | 2.25611 | 11.473 | 37.9483 |
24 | 0.617481 | 2.54765 | 8.29642 |
25 | 0.39143 | 0.59849 | 1.8718 |
26 | -2.14499 | -11.1892 | -37.0807 |
27 | 0.833313 | 4.25772 | 15.3056 |
28 | -0.947989 | -4.69399 | -17.6018 |
29 | -0.0380226 | 0.180093 | -1.74691 |
30 | 0.581013 | 3.39011 | 12.4434 |
⋮ | ⋮ | ⋮ | ⋮ |
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.021246 0.032701 0.65 0.5160 -0.0429246 0.0854166
X 3.29515 0.00623332 528.63 <1e-99 3.28291 3.30738
──────────────────────────────────────────────────────────────────────────
# Correct regression
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + Z
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.0102651 0.0315753 0.33 0.7452 -0.0516965 0.0722268
X 3.03339 0.0307871 98.53 <1e-99 2.97297 3.0938
Z 1.35458 0.156251 8.67 <1e-16 1.04796 1.6612
──────────────────────────────────────────────────────────────────────────
Model 2
We will assume that X stands for the police salaries that affect the crime rate (Y). Then, we have another observable variable, as the policemen’s supply (Z), that impacts X but not Y. And, additionally, we know that there is an unobservable variable (U), as the preference for maintaining civil order, that affects Z and Y. The theory says that when controlling by Z, we block (some) of the unobservable variable’s back-door path from X to Y. Thus, we see that in the second regression, the coefficient of X is equal to the real one (0.5).
# Define the graph with 4 nodes: U, Z, X, Y
g = SimpleDiGraph(4)
add_edge!(g, 1, 4) # U -> Y
add_edge!(g, 1, 2) # U -> Z
add_edge!(g, 2, 3) # Z -> X
add_edge!(g, 3, 4) # X -> Y
# Node labels for U, Z, X, Y
nodelabel = ["U", "Z", "X", "Y"]
# Define node colors based on labels
nodefillc_dict = Dict("U" => colorant"lightgrey",
"Z" => colorant"turquoise",
"X" => colorant"turquoise",
"Y" => colorant"turquoise")
nodefillc = [nodefillc_dict[label] for label in nodelabel]
# Create the graph plot
g_plot = gplot(g,
nodefillc = nodefillc,
nodestrokec = "black",
nodestrokelw = 0,
nodelabel = nodelabel,
NODESIZE = 0.15,
NODELABELSIZE = 10,
edgestrokec = "black")
# Save and display the plot
draw(PNG("causal_graph.png", 13cm, 13cm), g_plot)
g_plot
# Set seed
Random.seed!(24)
# Generate data
n = 1000
U = rand(Normal(0, 1), n)
Z = 7 * U + rand(Normal(0, 1), n)
X = 2 * Z + rand(Normal(0, 1), n)
Y = 0.5 * X + 0.2 * U + rand(Normal(0, 1), n)
# Create dataframe
df = DataFrame(Z = Z, X = X, Y = Y, U = U)
Z | X | Y | U | |
---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | |
1 | -0.403233 | -1.2254 | -2.2511 | 0.121698 |
2 | 2.08501 | 2.89654 | 1.13287 | -0.0329104 |
3 | -12.292 | -24.5338 | -12.1042 | -1.86593 |
4 | -4.18126 | -8.04607 | -4.90671 | -0.402004 |
5 | 2.62777 | 4.69066 | 2.44994 | 0.298821 |
6 | 12.2631 | 25.8266 | 13.0174 | 1.88838 |
7 | -6.56464 | -12.5702 | -6.84891 | -0.998066 |
8 | -6.75783 | -13.8519 | -7.02647 | -1.11586 |
9 | -3.7999 | -9.58863 | -4.76023 | -0.558549 |
10 | 3.48233 | 6.8838 | 4.36792 | 0.376278 |
11 | 11.0722 | 21.9725 | 10.9841 | 1.73838 |
12 | -1.65888 | -2.88502 | -1.31234 | -0.222023 |
13 | 7.11649 | 13.4175 | 7.98643 | 1.05565 |
14 | -2.9931 | -6.73107 | -3.86902 | -0.293296 |
15 | -6.03441 | -11.7435 | -5.76025 | -0.653213 |
16 | -2.45055 | -5.24346 | -2.72531 | -0.426636 |
17 | -7.98828 | -16.2672 | -8.1003 | -0.875714 |
18 | -3.96763 | -5.96982 | -2.35263 | -0.625727 |
19 | -14.3667 | -29.6464 | -15.1005 | -1.90824 |
20 | -2.67965 | -6.53693 | -4.24367 | -0.753929 |
21 | -8.91955 | -19.0161 | -8.64774 | -1.46926 |
22 | 7.29844 | 13.9013 | 6.88151 | 1.36688 |
23 | 15.9852 | 32.1156 | 16.2189 | 2.25611 |
24 | 3.78261 | 7.29248 | 2.91277 | 0.617481 |
25 | 1.38135 | 2.25189 | 1.29255 | 0.39143 |
26 | -15.4791 | -31.254 | -16.412 | -2.14499 |
27 | 5.92434 | 13.1312 | 8.51129 | 0.833313 |
28 | -6.58996 | -15.2778 | -6.27984 | -0.947989 |
29 | 0.104048 | -2.02206 | -0.578553 | -0.0380226 |
30 | 4.55213 | 10.5058 | 3.95193 | 0.581013 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) -0.0337287 0.0307735 -1.10 0.2733 -0.0941169 0.0266596
X 0.514806 0.00210354 244.73 <1e-99 0.510678 0.518934
────────────────────────────────────────────────────────────────────────────
# Controlling by the confounder Z
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + Z
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) -0.0327959 0.0306374 -1.07 0.2847 -0.092917 0.0273253
X 0.417904 0.0307391 13.60 <1e-38 0.357583 0.478225
Z 0.195056 0.0617315 3.16 0.0016 0.073917 0.316194
──────────────────────────────────────────────────────────────────────────
Bad Control (M-bias)#
Model 7
Let us suppose that X stands for a job training program aimed at reducing unemployment. Then, there is a first unobserved confounder, which could be the planning effort and good design of the job program (U1) that impacts directly on the participation in job training programs (X) and the proximity of job programs (that would be the bad control Z). Furthermore, we have another unobserved confounder (U2), as the soft skills of the unemployed, that affects the employment status of individuals (Y) and the likelihood of beeing in a job training program that is closer (Z). That is why including Z in the second regression makes X coefficient value further to the real one.
# Define the graph with 5 nodes: U1, U2, Z, X, Y
g = SimpleDiGraph(5)
add_edge!(g, 1, 3) # U1 -> Z
add_edge!(g, 1, 4) # U1 -> X
add_edge!(g, 2, 3) # U2 -> Z
add_edge!(g, 2, 5) # U2 -> Y
add_edge!(g, 4, 5) # X -> Y
# Node labels for U1, U2, Z, X, Y
nodelabel = ["U1", "U2", "Z", "X", "Y"]
# Define node colors based on labels
nodefillc_dict = Dict("U1" => colorant"lightgrey",
"U2" => colorant"lightgrey",
"Z" => colorant"turquoise",
"X" => colorant"turquoise",
"Y" => colorant"turquoise")
nodefillc = [nodefillc_dict[label] for label in nodelabel]
# Create the graph plot
g_plot = gplot(g,
nodefillc = nodefillc,
nodestrokec = "black",
nodestrokelw = 0,
nodelabel = nodelabel,
NODESIZE = 0.15,
NODELABELSIZE = 10,
edgestrokec = "black")
# Save and display the plot
draw(PNG("causal_graph.png", 13cm, 13cm), g_plot)
g_plot
# Set seed
Random.seed!(24)
# Generate data
n = 1000
U1 = rand(Normal(0, 1), n)
U2 = rand(Normal(0, 1), n)
Z = 0.3 * U1 + 0.9 * U2 + rand(Normal(0, 1), n)
X = 4 * U1 + rand(Normal(0, 1), n)
Y = 3 * X + U2 + rand(Normal(0, 1), n)
# Create dataframe
df = DataFrame(Z = Z, X = X, Y = Y, U1 = U1, U2 = U2)
Z | X | Y | U1 | U2 | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | -1.51203 | -1.17595 | -5.82291 | 0.121698 | -1.25512 |
2 | 0.800487 | -0.440459 | 1.23247 | -0.0329104 | 2.31538 |
3 | 0.182939 | -6.92784 | -20.038 | -1.86593 | 0.769514 |
4 | -1.03465 | -2.41129 | -8.13353 | -0.402004 | -1.36723 |
5 | 0.00718558 | 1.24013 | 4.10467 | 0.298821 | 0.53602 |
6 | 1.00698 | 7.27991 | 19.8046 | 1.88838 | -0.955557 |
7 | 0.639282 | -4.35645 | -12.425 | -0.998066 | 0.421824 |
8 | 0.276855 | -4.34078 | -11.7458 | -1.11586 | 1.0532 |
9 | -2.05745 | -2.0884 | -6.33028 | -0.558549 | 0.109943 |
10 | 0.79557 | 2.35588 | 7.04324 | 0.376278 | 0.848389 |
11 | -0.637144 | 6.60372 | 17.6775 | 1.73838 | -1.09647 |
12 | 0.27188 | -0.713512 | -1.33155 | -0.222023 | -0.104714 |
13 | -0.744564 | 5.28917 | 15.6213 | 1.05565 | -0.273081 |
14 | -1.67887 | -1.61801 | -5.69842 | -0.293296 | -0.940029 |
15 | -1.18641 | -2.37069 | -6.84398 | -0.653213 | -1.46192 |
16 | 0.0119644 | -1.7248 | -5.99998 | -0.426636 | 0.535899 |
17 | -2.22583 | -3.29441 | -11.8443 | -0.875714 | -1.85828 |
18 | 2.14893 | -1.74549 | -2.56299 | -0.625727 | 0.412462 |
19 | -2.39359 | -7.52863 | -25.1647 | -1.90824 | -1.00899 |
20 | 0.934262 | -3.84013 | -7.58954 | -0.753929 | 2.59785 |
21 | -0.389081 | -4.72286 | -12.8162 | -1.46926 | 1.36527 |
22 | -2.32821 | 5.12497 | 12.7869 | 1.36688 | -2.26971 |
23 | 0.995178 | 8.73438 | 25.0159 | 2.25611 | 0.192415 |
24 | -0.573286 | 1.61296 | 5.24715 | 0.617481 | -0.539758 |
25 | -1.61618 | 1.65404 | 2.68183 | 0.39143 | -1.35866 |
26 | -1.35704 | -8.93594 | -28.2335 | -2.14499 | -0.464224 |
27 | 1.61451 | 5.11229 | 15.2319 | 0.833313 | 0.0911555 |
28 | -2.34086 | -2.24332 | -6.4752 | -0.947989 | 0.0459587 |
29 | -1.90838 | 0.287992 | 2.68483 | -0.0380226 | 0.370206 |
30 | 2.01237 | 0.906884 | 2.54461 | 0.581013 | 0.485038 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) -0.0121698 0.0470352 -0.26 0.7959 -0.104469 0.0801295
X 2.98862 0.0110309 270.93 <1e-99 2.96698 3.01027
───────────────────────────────────────────────────────────────────────────
# Controlling by the confounder Z
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + Z
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 0.0022004 0.0409946 0.05 0.9572 -0.0782452 0.082646
X 2.95344 0.00981324 300.96 <1e-99 2.93418 2.9727
Z 0.526533 0.0295594 17.81 <1e-61 0.468528 0.584539
───────────────────────────────────────────────────────────────────────────
Neutral Control (possibly good for precision)#
Model 8
In this scenario, we will assume that X represents the implementation of a new government policy to provide subsidies and guidance for small companies. There is another variable, Z, that stands for the % inflation rate. And both X and Z affect Y, which represents the GDP growth rate of the country. Then, even if Z does not impact X, its inclusion improves the precision of the ATE estimator (8.63339 is closer to 8.6).
# Define the graph with 3 nodes: Z, X, Y
g = SimpleDiGraph(3)
add_edge!(g, 1, 3) # Z -> Y
add_edge!(g, 2, 3) # X -> Y
# Node labels for Z, X, Y
nodelabel = ["Z", "X", "Y"]
# Define node colors based on labels
nodefillc_dict = Dict("Z" => colorant"turquoise",
"X" => colorant"turquoise",
"Y" => colorant"turquoise")
nodefillc = [nodefillc_dict[label] for label in nodelabel]
# Create the graph plot
g_plot = gplot(g,
nodefillc = nodefillc,
nodestrokec = "black",
nodestrokelw = 0,
nodelabel = nodelabel,
NODESIZE = 0.15,
NODELABELSIZE = 10,
edgestrokec = "black")
# Save and display the plot
draw(PNG("causal_graph.png", 13cm, 13cm), g_plot)
g_plot
# Set seed
Random.seed!(24)
# Generate data
n = 1000
Z = rand(Normal(0, 1), n)
X = rand(Normal(0, 1), n)
Y = 8.6 * X + 5 * Z + rand(Normal(0, 1), n)
# Create dataframe
df = DataFrame(Z = Z, X = X, Y = Y)
Z | X | Y | |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 0.121698 | -1.25512 | -10.6044 |
2 | -0.0329104 | 2.31538 | 18.4743 |
3 | -1.86593 | 0.769514 | -2.66167 |
4 | -0.402004 | -1.36723 | -13.4518 |
5 | 0.298821 | 0.53602 | 5.539 |
6 | 1.88838 | -0.955557 | 2.52455 |
7 | -0.998066 | 0.421824 | -0.803581 |
8 | -1.11586 | 1.0532 | 3.14196 |
9 | -0.558549 | 0.109943 | -3.83607 |
10 | 0.376278 | 0.848389 | 9.09667 |
11 | 1.73838 | -1.09647 | -0.909568 |
12 | -0.222023 | -0.104714 | -1.57793 |
13 | 1.05565 | -0.273081 | 2.11428 |
14 | -0.293296 | -0.940029 | -10.2956 |
15 | -0.653213 | -1.46192 | -15.5133 |
16 | -0.426636 | 0.535899 | 2.13319 |
17 | -0.875714 | -1.85828 | -20.6504 |
18 | -0.625727 | 0.412462 | 2.38397 |
19 | -1.90824 | -1.00899 | -19.1316 |
20 | -0.753929 | 2.59785 | 17.3942 |
21 | -1.46926 | 1.36527 | 3.21796 |
22 | 1.36688 | -2.26971 | -13.3806 |
23 | 2.25611 | 0.192415 | 13.0805 |
24 | 0.617481 | -0.539758 | -1.82726 |
25 | 0.39143 | -1.35866 | -10.2382 |
26 | -2.14499 | -0.464224 | -15.013 |
27 | 0.833313 | 0.0911555 | 6.23298 |
28 | -0.947989 | 0.0459587 | -6.44253 |
29 | -0.0380226 | 0.370206 | 0.763502 |
30 | 0.581013 | 0.485038 | 8.47793 |
⋮ | ⋮ | ⋮ | ⋮ |
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) -0.0115351 0.167397 -0.07 0.9451 -0.340025 0.316955
X 8.5183 0.163177 52.20 <1e-99 8.19809 8.83851
──────────────────────────────────────────────────────────────────────────
# Controlling by the confounder Z
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + Z
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 0.0102651 0.0315753 0.33 0.7452 -0.0516965 0.0722268
X 8.63339 0.0307871 280.42 <1e-99 8.57297 8.6938
Z 5.02152 0.0305298 164.48 <1e-99 4.96161 5.08143
───────────────────────────────────────────────────────────────────────────
Bad Controls (Bias amplification)#
Model 10
Let us assume that X measures the implementation of a housing program for young adults buying their first house, which impacts the average housing prices (Y). There is another observable variable, Z, that measures the expenditure of the program and affects only X. Also, there is an unobservable variable (U) that represents the preference of young adults to move from their parent’s house and impacts only X and Y. Therefore, the inclusion of Z will “amplify the bias” of (•) on X, so the ATE estimator will be worse. We can see that in the second regression, the estimator (0.833083) is much farther from the real value (0.8).
# Define the graph with 4 nodes: U, Z, X, Y
g = SimpleDiGraph(4)
add_edge!(g, 1, 3) # U -> X
add_edge!(g, 1, 4) # U -> Y
add_edge!(g, 2, 3) # Z -> X
add_edge!(g, 3, 4) # X -> Y
# Node labels for U, Z, X, Y
nodelabel = ["U", "Z", "X", "Y"]
# Define node colors based on labels
nodefillc_dict = Dict("U" => colorant"lightgrey",
"Z" => colorant"turquoise",
"X" => colorant"turquoise",
"Y" => colorant"turquoise")
nodefillc = [nodefillc_dict[label] for label in nodelabel]
# Create the graph plot
g_plot = gplot(g,
nodefillc = nodefillc,
nodestrokec = "black",
nodestrokelw = 0,
nodelabel = nodelabel,
NODESIZE = 0.15,
NODELABELSIZE = 10,
edgestrokec = "black")
# Save and display the plot
draw(PNG("causal_graph.png", 13cm, 13cm), g_plot)
g_plot
# Set seed
Random.seed!(24)
# Generate data
n = 1000
U = rand(Normal(0, 1), n)
Z = rand(Normal(0, 1), n)
X = 3 * Z + 6 * U + rand(Normal(0, 1), n)
Y = 0.8 * X + 0.2 * U + rand(Normal(0, 1), n)
# Create dataframe
df = DataFrame(U = U, Z = Z, X = X, Y = Y)
U | Z | X | Y | |
---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | |
1 | 0.121698 | -1.25512 | -3.4541 | -4.40168 |
2 | -0.0329104 | 2.31538 | 5.4752 | 4.06476 |
3 | -1.86593 | 0.769514 | -8.83687 | -6.90681 |
4 | -0.402004 | -1.36723 | -6.19727 | -5.84149 |
5 | 0.298821 | 0.53602 | 2.83611 | 2.3735 |
6 | 1.88838 | -0.955557 | 9.76405 | 7.91532 |
7 | -0.998066 | 0.421824 | -4.16386 | -3.89489 |
8 | -1.11586 | 1.0532 | -3.87183 | -3.19797 |
9 | -0.558549 | 0.109943 | -5.0103 | -3.97415 |
10 | 0.376278 | 0.848389 | 4.72197 | 4.7036 |
11 | 1.73838 | -1.09647 | 6.96902 | 5.57311 |
12 | -0.222023 | -0.104714 | -1.21355 | -0.840665 |
13 | 1.05565 | -0.273081 | 4.69919 | 5.03703 |
14 | -0.293296 | -0.940029 | -5.32473 | -4.76326 |
15 | -0.653213 | -1.46192 | -7.97976 | -6.27229 |
16 | -0.426636 | 0.535899 | -1.29447 | -1.13916 |
17 | -0.875714 | -1.85828 | -11.1198 | -8.86252 |
18 | -0.625727 | 0.412462 | -0.551544 | 0.191041 |
19 | -1.90824 | -1.00899 | -15.3895 | -12.5889 |
20 | -0.753929 | 2.59785 | 2.09235 | 0.698675 |
21 | -1.46926 | 1.36527 | -5.8968 | -3.85711 |
22 | 1.36688 | -2.26971 | 0.69661 | 0.488127 |
23 | 2.25611 | 0.192415 | 14.2591 | 11.5684 |
24 | 0.617481 | -0.539758 | 1.81287 | 0.716825 |
25 | 0.39143 | -1.35866 | -2.23822 | -1.62397 |
26 | -2.14499 | -0.464224 | -14.5583 | -12.4317 |
27 | 0.833313 | 0.0911555 | 6.55582 | 7.19036 |
28 | -0.947989 | 0.0459587 | -7.64789 | -4.75927 |
29 | -0.0380226 | 0.370206 | -1.34767 | -0.645662 |
30 | 0.581013 | 0.485038 | 6.34273 | 3.77322 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
# Wrong regression, not controlling by the confounder Z
lm(@formula(Y ~ X), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X
Coefficients:
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) -0.0317809 0.0309595 -1.03 0.3049 -0.0925341 0.0289723
X 0.827152 0.00444388 186.13 <1e-99 0.818432 0.835872
────────────────────────────────────────────────────────────────────────────
lm(@formula(Y ~ X + Z), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
Y ~ 1 + X + Z
Coefficients:
─────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) -0.0349805 0.0308707 -1.13 0.2574 -0.0955594 0.0255984
X 0.833083 0.00489468 170.20 <1e-99 0.823478 0.842688
Z -0.0946024 0.0332626 -2.84 0.0045 -0.159875 -0.0293297
─────────────────────────────────────────────────────────────────────────────