Lab 1 - Julia Code

Lab 1 - Julia Code#

Math Demonstrations#

1. Prove the Frisch-Waugh-Lovell theorem#

Given the model:

\[ \begin{align*} y &= D \beta_1 + W \beta_2 + \mu \end{align*} \]

where \(y\) is an \(n \times 1\) vector, \(D\) is an \(n \times k_1\) matrix, \(\beta_1\) is a \(k_1 \times 1\) vector, \(W\) is an \(n \times k_2\) matrix, \(\beta_2\) is a \(k_2 \times 1\) vector, and \(\mu\) is an \(n \times 1\) vector of error terms.

We can construct the following equation:

\[ \begin{align*} \epsilon_y &= \epsilon_D \phi + \xi \end{align*} \]

Running \(y\) on \(W\), we get:

\[ \begin{align*} y &= W\hat{\alpha}_1 + \epsilon_y \iff \epsilon_y = y - W\hat{\alpha}_1 \end{align*} \]

Similarly, running \(D\) on \(W\) gives us:

\[ \begin{align*} D &= W\hat{\alpha}_2 + \epsilon_D \iff \epsilon_D = D - W\hat{\alpha}_2 \end{align*} \]

Running \(\epsilon_y\) on \(\epsilon_D\):

\[\begin{split} \begin{align*} y - W \hat{\alpha}_1 &= (D - W \hat{\alpha}_2) \phi + \xi \\ y &= W \hat{\alpha}_1 + (D - W \hat{\alpha}_2) \phi + \xi \\ y &= W \hat{\alpha}_1 + D \phi - W \hat{\alpha}_2 \phi + \xi \\ y &= D \phi + W (\hat{\alpha}_1 - \hat{\alpha}_2 \phi) + \xi \end{align*} \end{split}\]

Comparing the original model with this, we can see that:

\[\begin{split} \begin{align*} \beta_1 &= \phi \\ \beta_2 &= \hat{\alpha}_1 - \hat{\alpha}_2 \phi \\ \mu &= \xi \end{align*} \end{split}\]

2. Show that the Conditional Expectation Function minimizes expected squared error#

Given the model:

\[ \begin{align*} Y &= m(X) + e \end{align*} \]

where \(m(X)\) represents the conditional expectation of \(Y\) on \(X\). Let’s define an arbitrary model:

\[ \begin{align*} Y &= g(X) + w \end{align*} \]

where \(g(X)\) represents any function of \(X\).

Working with the expected squared error from the arbitrary model:

\[\begin{split} \begin{align*} E[(Y-g(X))^2] &= E[(Y-m(X) + m(X)-g(X))^2] \\ &= E[(Y-m(X))^2 + 2(Y-m(X))(m(X)-g(X)) + (m(X)-g(X))^2] \\ &= E[e^2] + 2E[(Y-m(X))(m(X)-g(X))] + E[(m(X)-g(X))^2] \end{align*} \end{split}\]

Using the law of iterated expectations:

\[ \begin{align*} E[(Y-g(X))^2] &= E[e^2] + 2E[E[(Y-m(X))(m(X)-g(X)) \mid X]] + E[(m(X)-g(X))^2] \end{align*} \]

Since \(m(X)\) and \(g(X)\) are functions of \(X\), the term \((m(X)-g(X))\) can be thought of as constant when conditioning on \(X\). Thus:

\[ \begin{align*} E[(Y-g(X))^2] &= E[e^2] + 2E[E[Y-m(X) \mid X](m(X)-g(X))] + E[(m(X)-g(X))^2] \end{align*} \]

It is important to note that \(E[Y-m(X) \mid X] = 0\) by definition of \(m(X)\), so we get:

\[ \begin{align*} E[(Y-g(X))^2] &= E[e^2] + E[(m(X)-g(X))^2] \end{align*} \]

Because the second term in the equation is always non-negative, it is clear that the function is minimized when \(g(X)\) equals \(m(X)\). In which case:

\[ \begin{align*} E[(Y-g(X))^2] &= E[e^2] \end{align*} \]

Replication 1 - Code#

In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:

What is the difference in predicted wages between men and women with the same job-relevant characteristics?

Thus, we analyze if there is a difference in the payment of men and women (gender wage gap). The gender wage gap may partly reflect discrimination against women in the labor market or may partly reflect a selection effect, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).

To investigate the gender wage gap, we consider the following log-linear regression model

\[\begin{split} \begin{align} \log(Y) &= \beta'X + \epsilon\\ &= \beta_1 D + \beta_2' W + \epsilon, \end{align} \end{split}\]

where \(D\) is the indicator of being female (\(1\) if female and \(0\) otherwise) and the \(W\)’s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.

Data Analysis#

We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set.

using Pkg

Pkg.add("DataFrames")
Pkg.add("Dates")
Pkg.add("Plots")
Pkg.add("CategoricalArrays")

using DataFrames
using Dates
using Plots
using Statistics,RData  #upload data of R format 
using CategoricalArrays # categorical data 

    Updating registry at `C:\Users\Matias Villalba\.julia\registries\General.toml`

   Resolving package versions...

  No Changes to `C:\Users\Matias Villalba\.julia\environments\v1.10\Project.toml`
  No Changes to `C:\Users\Matias Villalba\.julia\environments\v1.10\Manifest.toml`

Precompiling

rdata_read = load("../../../data/wage2015_subsample_inference.RData")
data = rdata_read["data"]
names(data)
println("Number of Rows : ", size(data)[1],"\n","Number of Columns : ", size(data)[2],) #rows and columns

Number of Rows : 5150
Number of Columns : 20

Variable description

occ : occupational classification
ind : industry classification
lwage : log hourly wage
sex : gender (1 female) (0 male)
shs : some high school
hsg : High school graduated
scl : Some College
clg: College Graduate
ad: Advanced Degree
ne: Northeast
mw: Midwest
so: South
we: West
exp1: experience

describe(data)

20 rows × 7 columns

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	wage	23.4104	3.02198	19.2308	528.846	0	Float64
2	lwage	2.97079	1.10591	2.95651	6.2707	0	Float64
3	sex	0.444466	0.0	0.0	1.0	0	Float64
4	shs	0.023301	0.0	0.0	1.0	0	Float64
5	hsg	0.243883	0.0	0.0	1.0	0	Float64
6	scl	0.278058	0.0	0.0	1.0	0	Float64
7	clg	0.31767	0.0	0.0	1.0	0	Float64
8	ad	0.137087	0.0	0.0	1.0	0	Float64
9	mw	0.259612	0.0	0.0	1.0	0	Float64
10	so	0.296505	0.0	0.0	1.0	0	Float64
11	we	0.216117	0.0	0.0	1.0	0	Float64
12	ne	0.227767	0.0	0.0	1.0	0	Float64
13	exp1	13.7606	0.0	10.0	47.0	0	Float64
14	exp2	3.01893	0.0	1.0	22.09	0	Float64
15	exp3	8.23587	0.0	1.0	103.823	0	Float64
16	exp4	25.118	0.0	1.0	487.968	0	Float64
17	occ		10		1e+05	0	CategoricalValue{String, UInt16}
18	occ2		1		22	0	CategoricalValue{String, UInt8}
19	ind		370		1e+05	0	CategoricalValue{String, UInt8}
20	ind2		2		22	0	CategoricalValue{String, UInt8}

To start our (causal) analysis, we compare the sample means given gender:

Z = select(data, ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"])

data_female = filter(row -> row.sex == 1, data)
Z_female = select(data_female,["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] )

data_male = filter(row -> row.sex == 0, data)
Z_male = select(data_male,["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] )

means = DataFrame( variables = names(Z), All = describe(Z, :mean)[!,2], Men = describe(Z_male,:mean)[!,2], Female = describe(Z_female,:mean)[!,2])

12 rows × 4 columns

	variables	All	Men	Female
	String	Float64	Float64	Float64
1	lwage	2.97079	2.98783	2.94948
2	sex	0.444466	0.0	1.0
3	shs	0.023301	0.0318071	0.0126693
4	hsg	0.243883	0.294303	0.180865
5	scl	0.278058	0.273331	0.283967
6	clg	0.31767	0.293953	0.347313
7	ad	0.137087	0.106606	0.175186
8	ne	0.227767	0.22195	0.235037
9	mw	0.259612	0.259	0.260376
10	so	0.296505	0.298148	0.294452
11	we	0.216117	0.220902	0.210135
12	exp1	13.7606	13.784	13.7313

In particular, the table above shows that the difference in average logwage between men and women is equal to \(0,038\)

mean(Z_female[:,:lwage]) - mean(Z_male[:,:lwage])

-0.03834473367441493

Thus, the unconditional gender wage gap is about \(3,8\)% for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience.

This unconditional (predictive) effect of gender equals the coefficient \(\beta\) in the univariate ols regression of \(Y\) on \(D\):

\[ \begin{align} \log(Y) &=\beta D + \epsilon. \end{align} \]

We verify this by running an ols regression in Julia.

#install all the package that we can need
Pkg.add("GLM") # package to run models 
Pkg.add("StatsPlots")
Pkg.add("MLBase")
Pkg.add("Tables")
Pkg.add("CovarianceMatrices") # robust standar error 
# Load the installed packages
using DataFrames
using CSV
using Tables
using GLM
using CovarianceMatrices

   Resolving package versions...
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Project.toml`
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Manifest.toml`
   Resolving package versions...
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Project.toml`
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Manifest.toml`
   Resolving package versions...
   Installed MLBase ─ v0.9.2
    Updating `C:\Users\Alexander\.julia\environments\v1.7\Project.toml`
  [f0e99cf1] + MLBase v0.9.2
    Updating `C:\Users\Alexander\.julia\environments\v1.7\Manifest.toml`
  [f0e99cf1] ↑ MLBase v0.9.0 ⇒ v0.9.2
Precompiling project...
  ✓ MLBase
  ✓ Lasso
  2 dependencies successfully precompiled in 4 seconds (323 already precompiled)
   Resolving package versions...
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Project.toml`
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Manifest.toml`
   Resolving package versions...
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Project.toml`
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Manifest.toml`
[ Info: Precompiling CovarianceMatrices [60f91f6f-d783-54cb-84f9-544141854719]

nocontrol_model = lm(@formula(lwage ~ sex), data)
nocontrol_est = GLM.coef(nocontrol_model)[2]
nocontrol_se = GLM.coeftable(nocontrol_model).cols[2][2]

nocontrol_se1 = stderror(HC1(), nocontrol_model)[2]
println("The estimated gender coefficient is ", nocontrol_est ," and the corresponding robust standard error is " ,nocontrol_se1)

The estimated gender coefficient is -0.03834473367440746 and the corresponding robust standard error is 0.015905023733117172

Next, we run an ols regression of \(Y\) on \((D,W)\) to control for the effect of covariates summarized in \(W\):

\[ \begin{align} \log(Y) &=\beta_1 D + \beta_2' W + \epsilon. \end{align} \]

Here, we are considering the flexible model from the previous lab. Hence, \(W\) controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

Let us run the ols regression with controls.

Ols regression with controls#

flex = @formula(lwage ~ sex + (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))
control_model = lm(flex , data)
control_est = GLM.coef(control_model)[2]
control_se = GLM.coeftable(control_model).cols[2][2]
control_se1 = stderror( HC0(), control_model)[2]

0.015000474421752112

control_model

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

lwage ~ 1 + sex + exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + occ2 + ind2 + mw + so + we + exp1 & shs + exp1 & hsg + exp1 & scl + exp1 & clg + exp1 & occ2 + exp1 & ind2 + exp1 & mw + exp1 & so + exp1 & we + exp2 & shs + exp2 & hsg + exp2 & scl + exp2 & clg + exp2 & occ2 + exp2 & ind2 + exp2 & mw + exp2 & so + exp2 & we + exp3 & shs + exp3 & hsg + exp3 & scl + exp3 & clg + exp3 & occ2 + exp3 & ind2 + exp3 & mw + exp3 & so + exp3 & we + exp4 & shs + exp4 & hsg + exp4 & scl + exp4 & clg + exp4 & occ2 + exp4 & ind2 + exp4 & mw + exp4 & so + exp4 & we

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                        Coef.  Std. Error      t  Pr(>|t|)    Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)       3.86026       0.428619    9.01    <1e-18    3.01998     4.70055
sex              -0.0695532     0.015218   -4.57    <1e-05   -0.0993874  -0.039719
exp1             -0.0677247     0.151976   -0.45    0.6559   -0.365665    0.230215
exp2              1.63629       1.69093     0.97    0.3332   -1.67868     4.95127
exp3             -0.915474      0.688025   -1.33    0.1834   -2.26431     0.433363
exp4              0.142936      0.0907569   1.57    0.1153   -0.0349885   0.32086
shs              -0.123309      0.906832   -0.14    0.8918   -1.90111     1.65449
hsg              -0.528902      0.197756   -2.67    0.0075   -0.916593   -0.141212
scl              -0.292058      0.126016   -2.32    0.0205   -0.539105   -0.0450112
clg              -0.0411641     0.0703862  -0.58    0.5587   -0.179153    0.0968245
occ2: 2           0.16134       0.129724    1.24    0.2137   -0.0929781   0.415657
occ2: 3           0.210151      0.168677    1.25    0.2129   -0.120532    0.540835
occ2: 4           0.070857      0.183717    0.39    0.6997   -0.28931     0.431024
occ2: 5          -0.396008      0.18854    -2.10    0.0357   -0.76563    -0.026385
occ2: 6          -0.231061      0.186966   -1.24    0.2166   -0.597599    0.135476
occ2: 7           0.314725      0.194152    1.62    0.1051   -0.0658997   0.69535
occ2: 8          -0.187542      0.169299   -1.11    0.2680   -0.519443    0.14436
occ2: 9          -0.339027      0.16723    -2.03    0.0427   -0.666873   -0.0111811
occ2: 10          0.0209545     0.156498    0.13    0.8935   -0.285852    0.327761
occ2: 11         -0.642418      0.30909    -2.08    0.0377   -1.24837    -0.036463
occ2: 12         -0.0674774     0.252049   -0.27    0.7889   -0.561605    0.426651
occ2: 13         -0.232978      0.231538   -1.01    0.3144   -0.686896    0.22094
occ2: 14          0.256201      0.322673    0.79    0.4272   -0.376382    0.888784
occ2: 15         -0.193858      0.259508   -0.75    0.4551   -0.702611    0.314894
occ2: 16         -0.0551256     0.147066   -0.37    0.7078   -0.34344     0.233189
occ2: 17         -0.415609      0.136114   -3.05    0.0023   -0.682454   -0.148764
occ2: 18         -0.482217      1.04435    -0.46    0.6443   -2.52962     1.56518
occ2: 19         -0.257941      0.332522   -0.78    0.4380   -0.909832    0.39395
occ2: 20         -0.30102       0.234102   -1.29    0.1986   -0.759965    0.157925
occ2: 21         -0.427181      0.220649   -1.94    0.0529   -0.859751    0.00538904
occ2: 22         -0.869453      0.297522   -2.92    0.0035   -1.45273    -0.286176
ind2: 3          -1.24737       0.645494   -1.93    0.0534   -2.51282     0.018092
ind2: 4          -0.0948281     0.463602   -0.20    0.8379   -1.0037      0.81404
ind2: 5          -0.529386      0.434599   -1.22    0.2232   -1.38139     0.322623
ind2: 6          -0.622169      0.434723   -1.43    0.1524   -1.47442     0.230082
ind2: 7          -0.50475       0.502477   -1.00    0.3152   -1.48983     0.48033
ind2: 8          -0.729544      0.467401   -1.56    0.1186   -1.64586     0.186771
ind2: 9          -0.802533      0.425246   -1.89    0.0592   -1.63621     0.0311395
ind2: 10         -0.580584      0.480878   -1.21    0.2274   -1.52332     0.362151
ind2: 11         -0.985235      0.448157   -2.20    0.0280   -1.86382    -0.106647
ind2: 12         -0.737578      0.424326   -1.74    0.0822   -1.56945     0.0942913
ind2: 13         -1.01833       0.482654   -2.11    0.0349   -1.96455    -0.0721096
ind2: 14         -0.586017      0.415903   -1.41    0.1589   -1.40137     0.229339
ind2: 15         -0.380136      0.590852   -0.64    0.5200   -1.53847     0.778198
ind2: 16         -0.570391      0.438658   -1.30    0.1936   -1.43036     0.289575
ind2: 17         -0.820184      0.425985   -1.93    0.0542   -1.6553      0.0149362
ind2: 18         -0.76136       0.423829   -1.80    0.0725   -1.59225     0.0695337
ind2: 19         -0.881282      0.456567   -1.93    0.0536   -1.77636     0.0137945
ind2: 20         -0.909902      0.44842    -2.03    0.0425   -1.78901    -0.0307984
ind2: 21         -0.758653      0.44058    -1.72    0.0851   -1.62239     0.105081
ind2: 22         -0.404077      0.432873   -0.93    0.3506   -1.2527      0.444548
mw                0.110683      0.0814463   1.36    0.1742   -0.0489879   0.270355
so                0.0224244     0.0743855   0.30    0.7631   -0.123404    0.168253
we               -0.0215659     0.0841591  -0.26    0.7978   -0.186556    0.143424
exp1 & shs       -0.191998      0.195541   -0.98    0.3262   -0.575346    0.191349
exp1 & hsg       -0.0173433     0.0572279  -0.30    0.7619   -0.129536    0.0948491
exp1 & scl       -0.0664505     0.043373   -1.53    0.1256   -0.151481    0.01858
exp1 & clg       -0.0550346     0.0310279  -1.77    0.0762   -0.115863    0.00579393
exp1 & occ2: 2   -0.0736239     0.0501108  -1.47    0.1418   -0.171863    0.0246157
exp1 & occ2: 3   -0.0714859     0.0637688  -1.12    0.2623   -0.196501    0.0535296
exp1 & occ2: 4   -0.0723997     0.0747715  -0.97    0.3330   -0.218985    0.0741859
exp1 & occ2: 5    0.0946732     0.0794005   1.19    0.2332   -0.0609873   0.250334
exp1 & occ2: 6   -0.0348928     0.0712136  -0.49    0.6242   -0.174503    0.104718
exp1 & occ2: 7   -0.227934      0.078486   -2.90    0.0037   -0.381802   -0.074066
exp1 & occ2: 8   -0.0727459     0.0645883  -1.13    0.2601   -0.199368    0.0538762
exp1 & occ2: 9    0.0274143     0.0669517   0.41    0.6822   -0.103841    0.15867
exp1 & occ2: 10   0.00756285    0.0581715   0.13    0.8966   -0.106479    0.121605
exp1 & occ2: 11   0.101422      0.100509    1.01    0.3130   -0.0956214   0.298466
exp1 & occ2: 12  -0.0862744     0.0874768  -0.99    0.3241   -0.257768    0.0852193
exp1 & occ2: 13   0.00671485    0.0761825   0.09    0.9298   -0.142637    0.156067
exp1 & occ2: 14  -0.136915      0.0974458  -1.41    0.1601   -0.327953    0.0541221
exp1 & occ2: 15  -0.0400425     0.0898931  -0.45    0.6560   -0.216273    0.136188
exp1 & occ2: 16  -0.0539314     0.0520926  -1.04    0.3006   -0.156056    0.0481934
exp1 & occ2: 17   0.0147277     0.0467903   0.31    0.7530   -0.0770023   0.106458
exp1 & occ2: 18   0.10741       0.471844    0.23    0.8199   -0.817616    1.03244
exp1 & occ2: 19   0.0047165     0.106074    0.04    0.9645   -0.203237    0.21267
exp1 & occ2: 20   0.0243156     0.0743274   0.33    0.7436   -0.121399    0.170031
exp1 & occ2: 21   0.0791776     0.0696947   1.14    0.2560   -0.0574551   0.21581
exp1 & occ2: 22   0.109325      0.0880828   1.24    0.2146   -0.063357    0.282006
exp1 & ind2: 3    0.475889      0.222748    2.14    0.0327    0.0392024   0.912576
exp1 & ind2: 4    0.0147304     0.15711     0.09    0.9253   -0.293276    0.322737
exp1 & ind2: 5    0.125699      0.153163    0.82    0.4119   -0.174569    0.425966
exp1 & ind2: 6    0.154027      0.152429    1.01    0.3123   -0.144801    0.452856
exp1 & ind2: 7    0.102925      0.178694    0.58    0.5647   -0.247396    0.453245
exp1 & ind2: 8    0.235767      0.16892     1.40    0.1629   -0.0953924   0.566926
exp1 & ind2: 9    0.135908      0.148949    0.91    0.3616   -0.156098    0.427914
exp1 & ind2: 10   0.151258      0.164434    0.92    0.3577   -0.171107    0.473622
exp1 & ind2: 11   0.317489      0.159002    2.00    0.0459    0.0057729   0.629204
exp1 & ind2: 12   0.259109      0.151059    1.72    0.0864   -0.0370341   0.555252
exp1 & ind2: 13   0.339609      0.166924    2.03    0.0420    0.0123634   0.666855
exp1 & ind2: 14   0.144141      0.147799    0.98    0.3295   -0.145612    0.433894
exp1 & ind2: 15  -0.0568181     0.234985   -0.24    0.8089   -0.517494    0.403858
exp1 & ind2: 16   0.0847295     0.155043    0.55    0.5848   -0.219223    0.388682
exp1 & ind2: 17   0.172887      0.151328    1.14    0.2533   -0.123784    0.469557
exp1 & ind2: 18   0.15654       0.149417    1.05    0.2948   -0.136385    0.449464
exp1 & ind2: 19   0.15161       0.162085    0.94    0.3496   -0.166149    0.46937
exp1 & ind2: 20   0.132663      0.156688    0.85    0.3972   -0.174516    0.439842
exp1 & ind2: 21   0.21909       0.155505    1.41    0.1589   -0.0857694   0.52395
exp1 & ind2: 22   0.114581      0.152343    0.75    0.4520   -0.184079    0.413241
exp1 & mw        -0.0279931     0.0296572  -0.94    0.3453   -0.0861345   0.0301484
exp1 & so        -0.00996775    0.0266868  -0.37    0.7088   -0.0622858   0.0423503
exp1 & we         0.00630768    0.0301417   0.21    0.8342   -0.0527835   0.0653989
exp2 & shs        1.90051       1.45025     1.31    0.1901   -0.94263     4.74364
exp2 & hsg        0.117164      0.550973    0.21    0.8316   -0.962989    1.19732
exp2 & scl        0.621792      0.462999    1.34    0.1793   -0.285892    1.52948
exp2 & clg        0.409675      0.380217    1.08    0.2813   -0.335721    1.15507
exp2 & occ2: 2    0.663217      0.552322    1.20    0.2299   -0.419581    1.74602
exp2 & occ2: 3    0.641546      0.710278    0.90    0.3664   -0.750918    2.03401
exp2 & occ2: 4    0.974842      0.865535    1.13    0.2601   -0.721994    2.67168
exp2 & occ2: 5   -0.977882      0.973799   -1.00    0.3153   -2.88696     0.9312
exp2 & occ2: 6    0.105086      0.800227    0.13    0.8955   -1.46372     1.67389
exp2 & occ2: 7    3.14071       0.938942    3.34    0.0008    1.29996     4.98146
exp2 & occ2: 8    0.671088      0.719208    0.93    0.3508   -0.738881    2.08106
exp2 & occ2: 9    0.0231977     0.762914    0.03    0.9757   -1.47246     1.51885
exp2 & occ2: 10  -0.269229      0.640527   -0.42    0.6743   -1.52495     0.986491
exp2 & occ2: 11  -1.08165       1.00576    -1.08    0.2822   -3.05339     0.890081
exp2 & occ2: 12   0.832374      0.934125    0.89    0.3729   -0.998929    2.66368
exp2 & occ2: 13  -0.220981      0.772846   -0.29    0.7749   -1.73611     1.29414
exp2 & occ2: 14   0.751116      0.927255    0.81    0.4180   -1.06672     2.56895
exp2 & occ2: 15  -0.0326858     0.940912   -0.03    0.9723   -1.87729     1.81192
exp2 & occ2: 16   0.363581      0.550955    0.66    0.5093   -0.716537    1.4437
exp2 & occ2: 17  -0.265929      0.486113   -0.55    0.5844   -1.21893     0.687071
exp2 & occ2: 18  -2.56088       5.17009    -0.50    0.6204  -12.6966      7.57482
exp2 & occ2: 19  -0.129176      1.06169    -0.12    0.9032   -2.21056     1.95221
exp2 & occ2: 20  -0.33233       0.722907   -0.46    0.6457   -1.74955     1.08489
exp2 & occ2: 21  -0.91          0.685411   -1.33    0.1843   -2.25371     0.433714
exp2 & occ2: 22  -0.855054      0.827941   -1.03    0.3018   -2.47819     0.768082
exp2 & ind2: 3   -5.93689       2.40679    -2.47    0.0137  -10.6553     -1.2185
exp2 & ind2: 4   -1.10534       1.7102     -0.65    0.5181   -4.4581      2.24741
exp2 & ind2: 5   -2.01492       1.69192    -1.19    0.2337   -5.33184     1.302
exp2 & ind2: 6   -2.22777       1.68169    -1.32    0.1853   -5.52464     1.06909
exp2 & ind2: 7   -1.46481       2.01379    -0.73    0.4670   -5.41274     2.48312
exp2 & ind2: 8   -2.94799       1.85954    -1.59    0.1130   -6.59353     0.697541
exp2 & ind2: 9   -1.77962       1.64712    -1.08    0.2800   -5.00872     1.44948
exp2 & ind2: 10  -2.19733       1.77386    -1.24    0.2155   -5.6749      1.28024
exp2 & ind2: 11  -3.87768       1.76374    -2.20    0.0280   -7.3354     -0.419966
exp2 & ind2: 12  -3.16904       1.68194    -1.88    0.0596   -6.46639     0.128306
exp2 & ind2: 13  -3.9652        1.81307    -2.19    0.0288   -7.51963    -0.410767
exp2 & ind2: 14  -2.07833       1.64904    -1.26    0.2076   -5.31118     1.15452
exp2 & ind2: 15   0.191169      2.60754     0.07    0.9416   -4.92078     5.30311
exp2 & ind2: 16  -1.32659       1.71856    -0.77    0.4402   -4.69574     2.04257
exp2 & ind2: 17  -2.20029       1.68372    -1.31    0.1913   -5.50113     1.10055
exp2 & ind2: 18  -2.20062       1.65666    -1.33    0.1841   -5.44842     1.04718
exp2 & ind2: 19  -1.93085       1.78767    -1.08    0.2802   -5.43548     1.57377
exp2 & ind2: 20  -1.94673       1.7244     -1.13    0.2590   -5.32732     1.43387
exp2 & ind2: 21  -3.11274       1.72379    -1.81    0.0710   -6.49214     0.266666
exp2 & ind2: 22  -1.85783       1.68495    -1.10    0.2703   -5.1611      1.44543
exp2 & mw         0.200561      0.317291    0.63    0.5273   -0.421472    0.822594
exp2 & so         0.0544354     0.281566    0.19    0.8467   -0.49756     0.606431
exp2 & we         0.00127174    0.320787    0.00    0.9968   -0.627615    0.630159
exp3 & shs       -0.672124      0.442663   -1.52    0.1290   -1.53994     0.195693
exp3 & hsg       -0.0179937     0.208318   -0.09    0.9312   -0.426389    0.390402
exp3 & scl       -0.199788      0.185519   -1.08    0.2816   -0.563488    0.163912
exp3 & clg       -0.102523      0.164365   -0.62    0.5328   -0.424752    0.219706
exp3 & occ2: 2   -0.20394       0.221139   -0.92    0.3565   -0.637471    0.22959
exp3 & occ2: 3   -0.236962      0.287037   -0.83    0.4091   -0.799683    0.325759
exp3 & occ2: 4   -0.436696      0.352017   -1.24    0.2148   -1.12681     0.253415
exp3 & occ2: 5    0.38853       0.411886    0.94    0.3456   -0.418951    1.19601
exp3 & occ2: 6    0.0484737     0.329353    0.15    0.8830   -0.597205    0.694152
exp3 & occ2: 7   -1.39493       0.405011   -3.44    0.0006   -2.18893    -0.600926
exp3 & occ2: 8   -0.20539       0.289573   -0.71    0.4782   -0.773082    0.362302
exp3 & occ2: 9   -0.090966      0.314335   -0.29    0.7723   -0.707203    0.525271
exp3 & occ2: 10   0.185475      0.257556    0.72    0.4715   -0.319451    0.690401
exp3 & occ2: 11   0.393155      0.381776    1.03    0.3032   -0.355296    1.14161
exp3 & occ2: 12  -0.220256      0.366021   -0.60    0.5474   -0.93782     0.497308
exp3 & occ2: 13   0.0950356     0.290437    0.33    0.7435   -0.474351    0.664422
exp3 & occ2: 14  -0.144393      0.334162   -0.43    0.6657   -0.799501    0.510714
exp3 & occ2: 15   0.147708      0.364519    0.41    0.6853   -0.566913    0.862328
exp3 & occ2: 16  -0.0378548     0.215129   -0.18    0.8603   -0.459604    0.383894
exp3 & occ2: 17   0.15105       0.187808    0.80    0.4213   -0.217138    0.519238
exp3 & occ2: 18   1.40844       1.88525     0.75    0.4550   -2.28748     5.10437
exp3 & occ2: 19   0.0923425     0.404231    0.23    0.8193   -0.700131    0.884816
exp3 & occ2: 20   0.180699      0.265208    0.68    0.4957   -0.339227    0.700626
exp3 & occ2: 21   0.377908      0.255303    1.48    0.1389   -0.1226      0.878417
exp3 & occ2: 22   0.285506      0.298421    0.96    0.3388   -0.299532    0.870544
exp3 & ind2: 3    2.66658       0.98075     2.72    0.0066    0.743872    4.58929
exp3 & ind2: 4    0.729843      0.687981    1.06    0.2888   -0.618908    2.07859
exp3 & ind2: 5    0.994225      0.684243    1.45    0.1463   -0.347199    2.33565
exp3 & ind2: 6    1.06414       0.680095    1.56    0.1177   -0.269148    2.39743
exp3 & ind2: 7    0.708909      0.833796    0.85    0.3952   -0.925705    2.34352
exp3 & ind2: 8    1.23409       0.748347    1.65    0.0992   -0.233001    2.70119
exp3 & ind2: 9    0.828731      0.66759     1.24    0.2145   -0.480045    2.13751
exp3 & ind2: 10   1.04482       0.706672    1.48    0.1393   -0.340577    2.43021
exp3 & ind2: 11   1.68776       0.716215    2.36    0.0185    0.283655    3.09186
exp3 & ind2: 12   1.37345       0.683557    2.01    0.0446    0.0333676   2.71352
exp3 & ind2: 13   1.63767       0.72593     2.26    0.0241    0.214519    3.06082
exp3 & ind2: 14   1.01629       0.671452    1.51    0.1302   -0.300056    2.33264
exp3 & ind2: 15   0.187948      1.02997     0.18    0.8552   -1.83125     2.20715
exp3 & ind2: 16   0.688968      0.696803    0.99    0.3228   -0.677078    2.05501
exp3 & ind2: 17   1.00855       0.683699    1.48    0.1402   -0.331803    2.34891
exp3 & ind2: 18   1.06056       0.672523    1.58    0.1149   -0.257887    2.37901
exp3 & ind2: 19   0.895987      0.72256     1.24    0.2150   -0.520555    2.31253
exp3 & ind2: 20   0.976894      0.695582    1.40    0.1603   -0.386758    2.34055
exp3 & ind2: 21   1.44152       0.699648    2.06    0.0394    0.069898    2.81314
exp3 & ind2: 22   0.968788      0.68285     1.42    0.1560   -0.369903    2.30748
exp3 & mw        -0.0625771     0.124129   -0.50    0.6142   -0.305926    0.180772
exp3 & so        -0.0115842     0.108422   -0.11    0.9149   -0.224139    0.200971
exp3 & we        -0.0124875     0.125138   -0.10    0.9205   -0.257813    0.232838
exp4 & shs        0.0777418     0.0475427   1.64    0.1021   -0.0154632   0.170947
exp4 & hsg        0.000491255   0.0265964   0.02    0.9853   -0.0516497   0.0526322
exp4 & scl        0.021076      0.0245289   0.86    0.3903   -0.0270117   0.0691637
exp4 & clg        0.00786949    0.0227528   0.35    0.7295   -0.0367363   0.0524753
exp4 & occ2: 2    0.0176389     0.0289257   0.61    0.5420   -0.0390683   0.0743462
exp4 & occ2: 3    0.0303057     0.0376552   0.80    0.4210   -0.0435153   0.104127
exp4 & occ2: 4    0.0584146     0.0457704   1.28    0.2019   -0.0313159   0.148145
exp4 & occ2: 5   -0.0515181     0.0549489  -0.94    0.3485   -0.159243    0.0562063
exp4 & occ2: 6   -0.0170182     0.0440847  -0.39    0.6995   -0.103444    0.0694076
exp4 & occ2: 7    0.190535      0.0558757   3.41    0.0007    0.0809939   0.300077
exp4 & occ2: 8    0.0196522     0.0379084   0.52    0.6042   -0.0546653   0.0939697
exp4 & occ2: 9    0.0190014     0.0421099   0.45    0.6518   -0.0635528   0.101556
exp4 & occ2: 10  -0.0333347     0.0338825  -0.98    0.3252   -0.0997595   0.0330901
exp4 & occ2: 11  -0.0465914     0.0479018  -0.97    0.3308   -0.1405      0.0473175
exp4 & occ2: 12   0.0110212     0.0470536   0.23    0.8148   -0.0812249   0.103267
exp4 & occ2: 13  -0.0136895     0.0358988  -0.38    0.7030   -0.0840673   0.0566883
exp4 & occ2: 14   0.00555824    0.0400331   0.14    0.8896   -0.0729245   0.084041
exp4 & occ2: 15  -0.0327444     0.0462379  -0.71    0.4789   -0.123391    0.0579026
exp4 & occ2: 16  -0.00897062    0.0275729  -0.33    0.7449   -0.0630259   0.0450847
exp4 & occ2: 17  -0.0256735     0.0239306  -1.07    0.2834   -0.0725881   0.0212412
exp4 & occ2: 18  -0.212137      0.2204     -0.96    0.3358   -0.64422     0.219946
exp4 & occ2: 19  -0.0169398     0.0513428  -0.33    0.7415   -0.117595    0.083715
exp4 & occ2: 20  -0.0296125     0.0323353  -0.92    0.3598   -0.0930042   0.0337791
exp4 & occ2: 21  -0.0524577     0.0317251  -1.65    0.0983   -0.114653    0.00973765
exp4 & occ2: 22  -0.0350646     0.0360687  -0.97    0.3310   -0.105775    0.0356463
exp4 & ind2: 3   -0.385179      0.132907   -2.90    0.0038   -0.645735   -0.124623
exp4 & ind2: 4   -0.120948      0.089958   -1.34    0.1789   -0.297306    0.0554102
exp4 & ind2: 5   -0.144105      0.0897994  -1.60    0.1086   -0.320152    0.0319425
exp4 & ind2: 6   -0.152611      0.0892689  -1.71    0.0874   -0.327618    0.0223961
exp4 & ind2: 7   -0.100199      0.11194    -0.90    0.3708   -0.319652    0.119253
exp4 & ind2: 8   -0.160966      0.097978   -1.64    0.1005   -0.353047    0.0311143
exp4 & ind2: 9   -0.117808      0.0877821  -1.34    0.1796   -0.2899      0.0542842
exp4 & ind2: 10  -0.148284      0.0918416  -1.61    0.1065   -0.328335    0.0317665
exp4 & ind2: 11  -0.232296      0.0944506  -2.46    0.0139   -0.417462   -0.0471307
exp4 & ind2: 12  -0.187291      0.0899985  -2.08    0.0375   -0.363728   -0.0108538
exp4 & ind2: 13  -0.215562      0.0946011  -2.28    0.0227   -0.401022   -0.0301012
exp4 & ind2: 14  -0.148352      0.0884992  -1.68    0.0937   -0.32185     0.0251456
exp4 & ind2: 15  -0.0532195     0.131382   -0.41    0.6854   -0.310786    0.204347
exp4 & ind2: 16  -0.104434      0.0916252  -1.14    0.2544   -0.28406     0.0751928
exp4 & ind2: 17  -0.142735      0.0899315  -1.59    0.1125   -0.319041    0.0335711
exp4 & ind2: 18  -0.154625      0.0885883  -1.75    0.0810   -0.328298    0.019048
exp4 & ind2: 19  -0.126959      0.0948784  -1.34    0.1809   -0.312963    0.059045
exp4 & ind2: 20  -0.146855      0.0911188  -1.61    0.1071   -0.325489    0.0317784
exp4 & ind2: 21  -0.203262      0.0920972  -2.21    0.0274   -0.383814   -0.0227102
exp4 & ind2: 22  -0.148095      0.0897937  -1.65    0.0992   -0.324131    0.0279408
exp4 & mw         0.00624394    0.0158699   0.39    0.6940   -0.0248681   0.037356
exp4 & so         0.000314457   0.0136275   0.02    0.9816   -0.0264016   0.0270305
exp4 & we         0.00176845    0.0159602   0.11    0.9118   -0.0295206   0.0330575
────────────────────────────────────────────────────────────────────────────────────

println("Coefficient for OLS with controls " , control_est, "robust standard error:", control_se1)

Coefficient for OLS with controls -0.06955320329637253robust standard error:0.015000474421752112

The estimated regression coefficient \(\beta_1\approx-0.0696\) measures how our linear prediction of wage changes if we set the gender variable \(D\) from 0 to 1, holding the controls \(W\) fixed. We can call this the predictive effect (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size \(4\)% for women increases to about \(7\)% after controlling for worker characteristics.

Next, we are using the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols.

Partialling-Out using ols#

# models
# model for Y
flex_y = @formula(lwage ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))
flex_d = @formula(sex ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))

# partialling-out the linear effect of W from Y
t_Y = residuals(lm(flex_y, data))

# partialling-out the linear effect of W from D
t_D = residuals(lm(flex_d, data))

data_res = DataFrame(t_Y = t_Y, t_D = t_D )
# regression of Y on D after partialling-out the effect of W

partial_fit = lm(@formula(t_Y ~ t_D), data_res)

partial_est = GLM.coef(partial_fit)[2]

# standard error
partial_se = GLM.coeftable(partial_fit).cols[2][2]

partial_se1 = stderror( HC0(), partial_fit)[2]

#condifence interval
GLM.confint(partial_fit)[2,:]

2-element Vector{Float64}:
 -0.09867142357486275
 -0.04043498301882939

println("Coefficient for D via partiallig-out ", partial_est, " robust standard error:", control_se1 )

Coefficient for D via partiallig-out -0.06955320329684607 robust standard error:0.015000474421752112

Again, the estimated coefficient measures the linear predictive effect (PE) of \(D\) on \(Y\) after taking out the linear effect of \(W\) on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls.

We know that the partialling-out approach works well when the dimension of \(W\) is low in relation to the sample size \(n\). When the dimension of \(W\) is relatively high, we need to use variable selection or penalization for regularization purposes.

In the following, we illustrate the partialling-out approach using lasso instead of ols.

Summarize the results#

DataFrame(modelos = [ "Without controls", "full reg", "partial reg" ], 
Estimate = [nocontrol_est,control_est, partial_est], 
StdError = [nocontrol_se1,control_se1, partial_se1])

3 rows × 3 columns

	modelos	Estimate	StdError
	String	Float64	Float64
1	Without controls	-0.0383447	0.015905
2	full reg	-0.0695532	0.0150005
3	partial reg	-0.0695532	0.0150005

It it worth to notice that controlling for worker characteristics increases the gender wage gap from less that 4% to 7%. The controls we used in our analysis include 5 educational attainment indicators (less than high school graduates, high school graduates, some college, college graduate, and advanced degree), 4 region indicators (midwest, south, west, and northeast); a quartic term (first, second, third, and fourth power) in experience and 22 occupation and 23 industry indicators.

Keep in mind that the predictive effect (PE) does not only measures discrimination (causal effect of being female), it also may reflect selection effects of unobserved differences in covariates between men and women in our sample.

Next we try “extra” flexible model, where we take interactions of all controls, giving us about 1000 controls.

“Extra” flexible model#

import Pkg
Pkg.add("StatsModels")
Pkg.add("Combinatorics")
Pkg.add("IterTools")
# we have to configure the package internaly with the itertools package, this because 
#julia dont iunderstand (a formula) ^2, it takes as an entire term not as interactions 
#between variables

   Resolving package versions...
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Project.toml`
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Manifest.toml`
   Resolving package versions...
    Updating `C:\Users\Alexander\.julia\environments\v1.7\Project.toml`
  [861a8166] + Combinatorics v1.0.2
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Manifest.toml`
   Resolving package versions...
    Updating `C:\Users\Alexander\.julia\environments\v1.7\Project.toml`
  [c8e1da08] + IterTools v1.4.0
  No Changes to `C:\Users\Alexander\.julia\environments\v1.7\Manifest.toml`

#this code fix the problem mencioned above
using StatsModels, Combinatorics, IterTools

combinations_upto(x, n) = Iterators.flatten(combinations(x, i) for i in 1:n)
expand_exp(args, deg::ConstantTerm) =
    tuple(((&)(terms...) for terms in combinations_upto(args, deg.n))...)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.Schema, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.FullRank, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

extra_flex = @formula(lwage ~  sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2)

control_fit = lm(extra_flex, data)
control_est = GLM.coef(control_fit)[2]

println("Number of Extra-Flex Controls: ", size(modelmatrix(control_fit))[2] -1) #minus the intercept
println("Coefficient for OLS with extra flex controls ", control_est)

#std error
control_se = GLM.stderror(control_fit)[2];

Number of Extra-Flex Controls: 979
Coefficient for OLS with extra flex controls -0.061270463792332606

Cross-Validation in Lasso Regression - Manual Implementation Task#

# import Pkg; Pkg.add("ScikitLearn")
# import Pkg; Pkg.add("StatsBase")
# import Pkg; Pkg.add("CSVFiles")
# import Pkg; Pkg.add("RDatasets")
# import Pkg; Pkg.add("MLJ")

The following package names could not be resolved:
 * StandardScaler (not found in project, manifest or registry)
   Suggestions: StandardizedRestrictedBoltzmannMachines

Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Pkg/src/Types.jl:70
 [2] ensure_resolved(ctx::Pkg.Types.Context, manifest::Pkg.Types.Manifest, pkgs::Vector{Pkg.Types.PackageSpec}; registry::Bool)
   @ Pkg.Types /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Pkg/src/Types.jl:1030
 [3] ensure_resolved
   @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Pkg/src/Types.jl:981 [inlined]
 [4] add(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; preserve::Pkg.Types.PreserveLevel, platform::Base.BinaryPlatforms.Platform, kwargs::@Kwargs{io::IJulia.IJuliaStdio{Base.PipeEndpoint}})
   @ Pkg.API /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Pkg/src/API.jl:267
 [5] add(pkgs::Vector{Pkg.Types.PackageSpec}; io::IJulia.IJuliaStdio{Base.PipeEndpoint}, kwargs::@Kwargs{})
   @ Pkg.API /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Pkg/src/API.jl:159
 [6] add(pkgs::Vector{Pkg.Types.PackageSpec})
   @ Pkg.API /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Pkg/src/API.jl:148
 [7] add
   @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Pkg/src/API.jl:147 [inlined]
 [8] add(pkg::String)
   @ Pkg.API /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Pkg/src/API.jl:146
 [9] top-level scope
   @ In[94]:1

using Pkg
Pkg.add("Transformers")

   Resolving package versions...
   Installed ShowCases ──────────── v0.1.0
   Installed ContextVariablesX ──── v0.1.3
   Installed Accessors ──────────── v0.1.36
   Installed ZipFile ────────────── v0.9.4
   Installed HTML_Entities ──────── v1.0.1
   Installed TimerOutputs ───────── v0.5.23
   Installed NNlibCUDA ──────────── v0.2.6
   Installed NNlib ──────────────── v0.8.21
   Installed Pidfile ────────────── v1.3.0
   Installed Optimisers ─────────── v0.2.15
   Installed FuncPipelines ──────── v0.2.3
   Installed InitialValues ──────── v0.3.1
   Installed CEnum ──────────────── v0.4.2
   Installed FunctionWrappers ───── v1.1.3
   Installed BFloat16s ──────────── v0.2.0
   Installed GPUArrays ──────────── v8.8.1
   Installed ZygoteRules ────────── v0.2.5
   Installed PrettyPrint ────────── v0.2.0
   Installed HuggingFaceApi ─────── v0.1.0
   Installed Static ─────────────── v0.7.8
   Installed RandomNumbers ──────── v1.5.3
   Installed Tricks ─────────────── v0.1.8
   Installed RealDot ────────────── v0.1.0
   Installed TupleTools ─────────── v1.5.0
   Installed Fetch ──────────────── v0.1.3
   Installed IRTools ────────────── v0.4.12
   Installed LLVM ───────────────── v4.17.1
   Installed FLoopsBase ─────────── v0.1.1
   Installed Functors ───────────── v0.3.0
   Installed ProgressLogging ────── v0.1.4
   Installed IfElse ─────────────── v0.1.1
   Installed MicroCollections ───── v0.1.4
   Installed Zygote ─────────────── v0.6.69
   Installed DoubleArrayTries ───── v0.0.3
   Installed StringEncodings ────── v0.3.7
   Installed DefineSingletons ───── v0.1.2
   Installed OneHotArrays ───────── v0.2.0
   Installed NameResolution ─────── v0.1.5
   Installed MLUtils ────────────── v0.3.1
   Installed UnsafeAtomicsLLVM ──── v0.1.1
   Installed Random123 ──────────── v1.7.0
   Installed StringViews ────────── v1.3.3
   Installed ValSplit ───────────── v0.1.1
   Installed BinaryProvider ─────── v0.5.10
   Installed ChainRules ─────────── v1.63.0
   Installed CommonSubexpressions ─ v0.3.0
   Installed BytePairEncoding ───── v0.3.1
   Installed MLStyle ────────────── v0.4.17
   Installed BangBang ───────────── v0.3.37
   Installed FLoops ─────────────── v0.2.1
   Installed ArgCheck ───────────── v2.3.0
   Installed DiffRules ──────────── v1.15.1
   Installed PrimitiveOneHot ────── v0.1.4
   Installed JSON3 ──────────────── v1.14.0
   Installed DataDeps ───────────── v0.7.8
   Installed DiffResults ────────── v1.1.0
   Installed Flux ───────────────── v0.13.13
   Installed NeuralAttentionlib ─── v0.1.5
   Installed OhMyArtifacts ──────── v0.3.1
   Installed LLVMExtra_jll ──────── v0.0.18+0
   Installed CompositionsBase ───── v0.1.2
   Installed KernelAbstractions ─── v0.9.18
   Installed Strided ────────────── v1.2.3
   Installed StrTables ──────────── v1.0.1
   Installed AbstractTrees ──────── v0.4.5
   Installed PartialFunctions ───── v1.2.0
   Installed UnsafeAtomics ──────── v0.2.1
   Installed JuliaVariables ─────── v0.2.4
   Installed SparseInverseSubset ── v0.1.2
   Installed SimpleTraits ───────── v0.9.4
   Installed GPUArraysCore ──────── v0.1.5
   Installed Adapt ──────────────── v3.7.2
   Installed Atomix ─────────────── v0.1.0
   Installed StructWalk ─────────── v0.2.1
   Installed InternedStrings ────── v0.7.0
   Installed Transducers ────────── v0.4.80
   Installed Baselet ────────────── v0.1.1
   Installed GPUCompiler ────────── v0.17.3
   Installed FoldsThreads ───────── v0.1.2
   Installed StructArrays ───────── v0.6.18
   Installed ForwardDiff ────────── v0.10.36
   Installed SplittablesBase ────── v0.1.15
   Installed LightXML ───────────── v0.9.1
   Installed WordTokenizers ─────── v0.5.6
   Installed Pickle ─────────────── v0.3.2
   Installed CUDA ───────────────── v3.13.1
   Installed TextEncodeBase ─────── v0.5.12
   Installed Transformers ───────── v0.1.25
    Updating `~/.julia/environments/v1.10/Project.toml`
⌃ [21ca0261] + Transformers v0.1.25
    Updating `~/.julia/environments/v1.10/Manifest.toml`
  [1520ce14] + AbstractTrees v0.4.5
  [7d9f7c33] + Accessors v0.1.36
⌅ [79e6a3ab] ↓ Adapt v4.0.4 ⇒ v3.7.2
  [dce04be8] + ArgCheck v2.3.0
  [a9b6321e] + Atomix v0.1.0
⌅ [ab4f0b2a] + BFloat16s v0.2.0
⌅ [198e06fe] + BangBang v0.3.37
  [9718e550] + Baselet v0.1.1
  [b99e7846] + BinaryProvider v0.5.10
⌅ [a4280ba5] + BytePairEncoding v0.3.1
⌅ [fa961155] + CEnum v0.4.2
⌅ [052768ef] + CUDA v3.13.1
  [082447d4] + ChainRules v1.63.0
  [bbf7d656] + CommonSubexpressions v0.3.0
  [a33af91c] + CompositionsBase v0.1.2
  [6add18c4] + ContextVariablesX v0.1.3
⌃ [124859b0] + DataDeps v0.7.8
  [244e2a9f] + DefineSingletons v0.1.2
  [163ba53b] + DiffResults v1.1.0
  [b552c78f] + DiffRules v1.15.1
⌅ [abbaa0e5] + DoubleArrayTries v0.0.3
  [cc61a311] + FLoops v0.2.1
  [b9860ae5] + FLoopsBase v0.1.1
⌃ [bb354801] + Fetch v0.1.3
⌅ [587475ba] + Flux v0.13.13
  [9c68100b] + FoldsThreads v0.1.2
  [f6369f11] + ForwardDiff v0.10.36
  [9ed96fbb] + FuncPipelines v0.2.3
  [069b7b12] + FunctionWrappers v1.1.3
⌅ [d9f16b24] + Functors v0.3.0
⌅ [0c68f7d7] + GPUArrays v8.8.1
⌅ [46192b85] + GPUArraysCore v0.1.5
⌅ [61eb1bfa] + GPUCompiler v0.17.3
  [7693890a] + HTML_Entities v1.0.1
  [3cc741c3] + HuggingFaceApi v0.1.0
  [7869d1d1] + IRTools v0.4.12
  [615f187c] + IfElse v0.1.1
  [22cec73e] + InitialValues v0.3.1
  [7d512f48] + InternedStrings v0.7.0
  [0f8b85d8] + JSON3 v1.14.0
  [b14d175d] + JuliaVariables v0.2.4
  [63c18a36] + KernelAbstractions v0.9.18
⌅ [929cbde3] + LLVM v4.17.1
  [9c8b4983] + LightXML v0.9.1
  [d8e11817] + MLStyle v0.4.17
⌅ [f1d291b0] + MLUtils v0.3.1
⌅ [128add7d] + MicroCollections v0.1.4
⌅ [872c559c] + NNlib v0.8.21
⌃ [a00861dc] + NNlibCUDA v0.2.6
  [71a1bf82] + NameResolution v0.1.5
⌅ [12afc1b8] + NeuralAttentionlib v0.1.5
  [cf8be1f4] + OhMyArtifacts v0.3.1
⌃ [0b1bfda6] + OneHotArrays v0.2.0
⌅ [3bd65402] + Optimisers v0.2.15
  [570af359] + PartialFunctions v1.2.0
⌃ [fbb45041] + Pickle v0.3.2
  [fa939f87] + Pidfile v1.3.0
  [8162dcfd] + PrettyPrint v0.2.0
  [13d12f88] + PrimitiveOneHot v0.1.4
  [33c8b6b6] + ProgressLogging v0.1.4
  [74087812] + Random123 v1.7.0
  [e6cf234a] + RandomNumbers v1.5.3
  [c1ae055f] + RealDot v0.1.0
  [605ecd9f] + ShowCases v0.1.0
  [699a6c99] + SimpleTraits v0.9.4
  [dc90abb0] + SparseInverseSubset v0.1.2
  [171d559e] + SplittablesBase v0.1.15
⌅ [aedffcd0] + Static v0.7.8
  [9700d1a9] + StrTables v1.0.1
⌅ [5e0ebb24] + Strided v1.2.3
  [69024149] + StringEncodings v0.3.7
  [354b36f9] + StringViews v1.3.3
  [09ab397b] + StructArrays v0.6.18
  [31cdf514] + StructWalk v0.2.1
⌅ [f92c20c0] + TextEncodeBase v0.5.12
  [a759f4b9] + TimerOutputs v0.5.23
⌃ [28d57a85] + Transducers v0.4.80
⌃ [21ca0261] + Transformers v0.1.25
  [410a4b4d] + Tricks v0.1.8
  [9d95972d] + TupleTools v1.5.0
  [013be700] + UnsafeAtomics v0.2.1
⌃ [d80eeb9a] + UnsafeAtomicsLLVM v0.1.1
  [0625e100] + ValSplit v0.1.1
  [796a5d58] + WordTokenizers v0.5.6
⌅ [a5390f91] + ZipFile v0.9.4
  [e88e6eb3] + Zygote v0.6.69
  [700de1a5] + ZygoteRules v0.2.5
⌅ [dad2f222] + LLVMExtra_jll v0.0.18+0
        Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated -m`
    Building DataDeps ─────→ `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/e299d8267135ef2f9c941a764006697082c1e7e8/build.log`
    Building HTML_Entities → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/c4144ed3bc5f67f595622ad03c0e39fa6c70ccc7/build.log`
Precompiling project...
  ✓ Tricks
  ✓ StringViews
  ✓ FunctionWrappers
  ✓ InitialValues
  ✓ CEnum
  ✓ StrTables
  ✓ PrettyPrint
  ✓ FuncPipelines
  ✓ ArgCheck
  ✓ UnsafeAtomics
  ✓ ShowCases
  ✓ RealDot
  ✓ InternedStrings
  ✓ Pidfile
  ✓ CompositionsBase
  ✓ BFloat16s
  ✓ AbstractTrees
  ✓ DefineSingletons
  ✓ IfElse
  ✓ Adapt
  ✓ ProgressLogging
  ✓ IRTools
  ✓ Functors
  ✓ TupleTools
  ✓ TimerOutputs
  ✓ SimpleTraits
  ✓ CommonSubexpressions
  ✓ Baselet
  ✓ PartialFunctions
  ✗ BinaryProvider
  ✓ ZipFile
  ✓ SparseInverseSubset
  ✓ RandomNumbers
  ✓ DiffResults
  ✓ LLVMExtra_jll
  ✓ ContextVariablesX
  ✓ StringEncodings
  ✓ StructWalk
  ✓ LightXML
  ✓ StructArrays
  ✓ SplittablesBase
  ✓ DiffRules
  ✓ ZygoteRules
  ✓ ValSplit
  ✓ HTML_Entities
  ✓ NameResolution
  ✓ Atomix
  ? Lathe
  ✓ OhMyArtifacts
  ✓ CompositionsBase → CompositionsBaseInverseFunctionsExt
  ✓ Static
  ✓ OffsetArrays → OffsetArraysAdaptExt
  ✓ Adapt → AdaptStaticArraysExt
  ✓ Strided
  ✓ DataDeps
  ✓ Optimisers
  ✓ FLoopsBase
  ✓ Random123
  ✓ JSON3
  ✓ BangBang
  ✓ Accessors
  ✓ GPUArraysCore
  ✓ DoubleArrayTries
  ✓ ForwardDiff
  ✓ LLVM
  ✓ MLStyle
  ✓ WordTokenizers
  ✓ Interpolations
  ✓ Pickle
  ✓ HuggingFaceApi
  ✓ Fetch
  ✓ Accessors → AccessorsStructArraysExt
  ✓ MicroCollections
  ✓ StructArrays → StructArraysGPUArraysCoreExt
  ✓ UnsafeAtomicsLLVM
  ✓ ForwardDiff → ForwardDiffStaticArraysExt
  ✓ GPUArrays
  ✓ Accessors → AccessorsStaticArraysExt
  ✓ Interpolations → InterpolationsUnitfulExt
  ✓ StructArrays → StructArraysStaticArraysExt
  ✓ JuliaVariables
  ✓ Accessors → AccessorsUnitfulExt
  ✓ StructArrays → StructArraysSparseArraysExt
  ✓ StructArrays → StructArraysAdaptExt
  ✓ KernelDensity
  ✓ Transducers
  ✓ KernelAbstractions
  ✓ Transducers → TransducersDataFramesExt
  ✓ FoldsThreads
  ✓ NNlib
  ✓ FLoops
  ✓ ChainRules
  ✓ PrimitiveOneHot
  ✓ TextEncodeBase
  ✓ GPUCompiler
  ✓ MLUtils
  ✓ BytePairEncoding
  ✓ OneHotArrays
  ✓ Zygote
  ✓ Zygote → ZygoteDistancesExt
  ✓ Zygote → ZygoteColorsExt
  ✓ CUDA
  ✓ NNlibCUDA
  ✓ NeuralAttentionlib
  ✓ Flux
  ✓ Transformers
  ✓ StatsPlots
  105 dependencies successfully precompiled in 123 seconds. 279 already precompiled.
  1 dependency failed but may be precompilable after restarting julia
  2 dependencies had output during precompilation:
┌ BFloat16s
│  WARNING: could not import Printf.ini_hex into BFloat16s
│  WARNING: could not import Printf.ini_HEX into BFloat16s
└  
┌ Lathe
│  WARNING: Method definition PowerLog(Float64, Float64) in module models at /Users/valerie/.julia/packages/Lathe/ZRJQf/src/models/toolbox.jl:14 overwritten on the same line (check for duplicate calls to `include`).
│  ERROR: Method overwriting is not permitted during Module precompilation. Use `__precompile__(false)` to opt-out of precompilation.
└  

using Pkg
using RData
using DataFrames
using RDatasets
using Random
using StatsBase
using GLM
using PyCall
using Statistics
using Plots
using CSV
using Transformers

import PyPlot
import DataFrames
import StatsModels
import Random
import ScikitLearn: cross_val_score

Precompiling Transformers
  ✗ BinaryProvider
  0 dependencies successfully precompiled in 2 seconds. 183 already precompiled.
[ Info: Precompiling Transformers [21ca0261-441d-5938-ace7-c90938fde4d4]
┌ Warning: Module HTTP with build ID fafbfcfd-fb7e-58a8-0000-0286b56ffd08 is missing from the cache.
│ This may mean HTTP [cd3eb016-35fb-5094-929b-558a96fad6f3] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:1948
[ Info: Skipping precompilation since __precompile__(false). Importing Transformers [21ca0261-441d-5938-ace7-c90938fde4d4].
[ Info: Precompiling ZygoteColorsExt [e68c091a-8ea5-5ca7-be4f-380657d4ad79]
┌ Warning: Module Colors with build ID fafbfcfd-6f69-f41e-0000-2644de38854d is missing from the cache.
│ This may mean Colors [5ae59095-9a9b-59fe-a467-6f913c188581] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:1948
[ Info: Skipping precompilation since __precompile__(false). Importing ZygoteColorsExt [e68c091a-8ea5-5ca7-be4f-380657d4ad79].
Precompiling TextEncodeBase
  ✗ BinaryProvider
  0 dependencies successfully precompiled in 3 seconds. 47 already precompiled.
[ Info: Precompiling TextEncodeBase [f92c20c0-9f2a-4705-8116-881385faba05]
┌ Warning: Module HTTP with build ID fafbfcfd-fb7e-58a8-0000-0286b56ffd08 is missing from the cache.
│ This may mean HTTP [cd3eb016-35fb-5094-929b-558a96fad6f3] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:1948
[ Info: Skipping precompilation since __precompile__(false). Importing TextEncodeBase [f92c20c0-9f2a-4705-8116-881385faba05].
Precompiling WordTokenizers
  ✗ BinaryProvider
  0 dependencies successfully precompiled in 1 seconds. 9 already precompiled.
[ Info: Precompiling WordTokenizers [796a5d58-b03d-544a-977e-18100b691f6e]
┌ Warning: Module HTTP with build ID fafbfcfd-fb7e-58a8-0000-0286b56ffd08 is missing from the cache.
│ This may mean HTTP [cd3eb016-35fb-5094-929b-558a96fad6f3] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:1948
[ Info: Skipping precompilation since __precompile__(false). Importing WordTokenizers [796a5d58-b03d-544a-977e-18100b691f6e].
Precompiling DataDeps
  ✗ BinaryProvider
  0 dependencies successfully precompiled in 27 seconds. 6 already precompiled.
[ Info: Precompiling DataDeps [124859b0-ceae-595e-8997-d05f6a7a8dfe]
┌ Warning: Module HTTP with build ID fafbfcfd-fb7e-58a8-0000-0286b56ffd08 is missing from the cache.
│ This may mean HTTP [cd3eb016-35fb-5094-929b-558a96fad6f3] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:1948
[ Info: Skipping precompilation since __precompile__(false). Importing DataDeps [124859b0-ceae-595e-8997-d05f6a7a8dfe].
[ Info: Precompiling Fetch [bb354801-46f6-40b6-9c3d-d42d7a74c775]
┌ Warning: Module HTTP with build ID fafbfcfd-fb7e-58a8-0000-0286b56ffd08 is missing from the cache.
│ This may mean HTTP [cd3eb016-35fb-5094-929b-558a96fad6f3] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:1948
[ Info: Skipping precompilation since __precompile__(false). Importing Fetch [bb354801-46f6-40b6-9c3d-d42d7a74c775].
Precompiling BytePairEncoding
  ✗ BinaryProvider
  0 dependencies successfully precompiled in 1 seconds. 48 already precompiled.
[ Info: Precompiling BytePairEncoding [a4280ba5-8788-555a-8ca8-4a8c3d966a71]
┌ Warning: Module TextEncodeBase with build ID ffffffff-ffff-ffff-0000-2ecbcb9e8568 is missing from the cache.
│ This may mean TextEncodeBase [f92c20c0-9f2a-4705-8116-881385faba05] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:1948
[ Info: Skipping precompilation since __precompile__(false). Importing BytePairEncoding [a4280ba5-8788-555a-8ca8-4a8c3d966a71].
WARNING: could not import ScikitLearn.cross_val_score into Main

1. Data Preparation

Load the March Supplement of the U.S. Current Population Survey, year 2015. (wage2015_subsample_inference.Rdata)

rdata_read = RData.load("../../data/wage2015_subsample_inference.RData")
data = rdata_read["data"]
names(data)

20-element Vector{String}:
 "wage"
 "lwage"
 "sex"
 "shs"
 "hsg"
 "scl"
 "clg"
 "ad"
 "mw"
 "so"
 "we"
 "ne"
 "exp1"
 "exp2"
 "exp3"
 "exp4"
 "occ"
 "occ2"
 "ind"
 "ind2"

# a quick description of the data
describe(data)

20 rows × 7 columns

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	wage	23.4104	3.02198	19.2308	528.846	0	Float64
2	lwage	2.97079	1.10591	2.95651	6.2707	0	Float64
3	sex	0.444466	0.0	0.0	1.0	0	Float64
4	shs	0.023301	0.0	0.0	1.0	0	Float64
5	hsg	0.243883	0.0	0.0	1.0	0	Float64
6	scl	0.278058	0.0	0.0	1.0	0	Float64
7	clg	0.31767	0.0	0.0	1.0	0	Float64
8	ad	0.137087	0.0	0.0	1.0	0	Float64
9	mw	0.259612	0.0	0.0	1.0	0	Float64
10	so	0.296505	0.0	0.0	1.0	0	Float64
11	we	0.216117	0.0	0.0	1.0	0	Float64
12	ne	0.227767	0.0	0.0	1.0	0	Float64
13	exp1	13.7606	0.0	10.0	47.0	0	Float64
14	exp2	3.01893	0.0	1.0	22.09	0	Float64
15	exp3	8.23587	0.0	1.0	103.823	0	Float64
16	exp4	25.118	0.0	1.0	487.968	0	Float64
17	occ		10		1e+05	0	CategoricalValue{String, UInt16}
18	occ2		1		22	0	CategoricalValue{String, UInt8}
19	ind		370		1e+05	0	CategoricalValue{String, UInt8}
20	ind2		2		22	0	CategoricalValue{String, UInt8}

# flexible model
# flex = @formula(lwage ~ sex + shs + hsg + scl + clg + occ2 + ind2 + mw + so + we + (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + occ2 + ind2 + mw + so + we))
# flex_results_0 = lm(flex, data)

# # Get exogenous variables (X matrix) from the flexible model
# X = Matrix(modelmatrix(flex_results_0))

# # Set endogenous variable (response variable)
# y = data[!, "lwage"]

function partitionTrainTest(data, at = 0.7)
    n = nrow(data)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    data[train_idx,:], data[test_idx,:]
end

iris = dataset("datasets", "iris")
X_train,X_test = partitionTrainTest(iris, 0.7) # 70% train

(105×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species    
     │ Float64      Float64     Float64      Float64     Cat…       
─────┼──────────────────────────────────────────────────────────────
│         6.1         3.0          4.9         1.8  virginica
│         5.5         2.6          4.4         1.2  versicolor
│         5.6         2.7          4.2         1.3  versicolor
│         5.3         3.7          1.5         0.2  setosa
│         6.2         2.2          4.5         1.5  versicolor
│         7.7         3.8          6.7         2.2  virginica
│         6.5         2.8          4.6         1.5  versicolor
│         6.6         2.9          4.6         1.3  versicolor
│         5.5         3.5          1.3         0.2  setosa
│         6.0         2.9          4.5         1.5  versicolor
│         5.5         2.3          4.0         1.3  versicolor
  ⋮  │      ⋮           ⋮            ⋮           ⋮           ⋮
│         5.8         2.7          5.1         1.9  virginica
│         7.6         3.0          6.6         2.1  virginica
│         6.5         3.0          5.5         1.8  virginica
│         6.7         3.1          4.4         1.4  versicolor
│         5.6         2.9          3.6         1.3  versicolor
│         6.8         3.0          5.5         2.1  virginica
│         6.1         3.0          4.6         1.4  versicolor
│         5.2         3.4          1.4         0.2  setosa
│         6.5         3.0          5.8         2.2  virginica
│         6.2         2.8          4.8         1.8  virginica
                                                     84 rows omitted, 45×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species    
     │ Float64      Float64     Float64      Float64     Cat…       
─────┼──────────────────────────────────────────────────────────────
│         6.1         2.8          4.7         1.2  versicolor
│         5.0         3.0          1.6         0.2  setosa
│         6.3         2.3          4.4         1.3  versicolor
│         4.4         3.0          1.3         0.2  setosa
│         4.6         3.2          1.4         0.2  setosa
│         5.0         3.2          1.2         0.2  setosa
│         6.7         3.0          5.2         2.3  virginica
│         5.1         3.8          1.9         0.4  setosa
│         4.8         3.0          1.4         0.1  setosa
│         5.2         3.5          1.5         0.2  setosa
│         5.1         3.8          1.6         0.2  setosa
  ⋮  │      ⋮           ⋮            ⋮           ⋮           ⋮
│         5.7         3.8          1.7         0.3  setosa
│         5.6         3.0          4.5         1.5  versicolor
│         6.3         2.5          5.0         1.9  virginica
│         4.8         3.1          1.6         0.2  setosa
│         6.0         3.4          4.5         1.6  versicolor
│         4.8         3.4          1.6         0.2  setosa
│         6.7         3.0          5.0         1.7  versicolor
│         5.9         3.0          5.1         1.8  virginica
│         5.0         3.5          1.6         0.6  setosa
│         6.3         3.4          5.6         2.4  virginica
                                                     24 rows omitted)

scaler = StandardScaler()

# Fit and transform on training data
X_train = MLJ.transform(scaler, X_train)

# Transform test data using the same scaler
X_test = MLJ.transform(scaler, X_test)

UndefVarError: `StandardScaler` not defined

Stacktrace:
 [1] top-level scope
   @ In[102]:1

2. Define a Range of Alpha (Lambda in our equation) Values

We create a list or array of alpha values to iterate over. These will be the different regularization parameters we test. We started testing from 0.1 to 0.5 and found that the MSE in cross-validation was reducing when the alpha value was incrementing. Therefore, we tried with higher values.

# Create alphas using a loop
alphas = [45 + i * 5 for i in 0:4]

3. Partition the Dataset for k-Fold Cross-Validation

We divide the dataset into 5 subsets (or folds). Since we are working with a regression task (predicting the log of wage), we use the K-Fold cross-validator from sklearn. We ensure the data is shuffled by adding ‘shuffle=True’ and set a random state for a reproducible output.

Source: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

# Create KFold
kf = KFold(n_splits = 5, shuffle = true, random_state = 24)

4. Lasso Regression Implementation

Implement a function to fit a Lasso Regression model given a training dataset and an alpha value. The function should return the model’s coefficients and intercept.

function lasso_regression(X_train, y_train, alpha, iterations=100, learning_rate=0.01)
    """
    Fits a Lasso Regression model

    Args:
        X_train: Training features
        y_train: Target values
        alpha: Regularization parameter (L1 penalty)
        iterations: Number of iterations for gradient descent (default: 100)
        learning_rate: Learning rate for gradient descent (default: 0.01)

    Returns:
        W: Model coefficients (weights)
        b: Model intercept (bias)
    """
    m, n = size(X_train)
    W = zeros(n)
    b = 0.0

    for _ in 1:iterations
        Y_pred = X_train * W .+ b
        dW = zeros(n)
        for j in 1:n
            if W[j] > 0
                dW[j] = (-2 * dot(X_train[:, j], y_train - Y_pred) + alpha) / m
            else
                dW[j] = (-2 * dot(X_train[:, j], y_train - Y_pred) - alpha) / m
            end
        end
        db = -2 * sum(y_train - Y_pred) / m
        W .-= learning_rate * dW
        b -= learning_rate * db
    end

    return W, b
end

5. Cross-Validation Loop and 6. Selection of Optimal Alpha

We immplement a for loop to fit the lasso regression. Also, we find the best value of alpha that reduces the average MSE for each fold.

function cross_validate_lasso(X_train, y_train, alphas, kf)
    avg_mse_values = Float64[]
    best_alpha = nothing
    min_avg_mse = Inf

    for alpha in alphas
        mse_list = Float64[]
        for (train_index, val_index) in kf
            X_train_fold, X_val_fold = X_train[train_index, :], X_train[val_index, :]
            y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]

            # Train Lasso regression model with the current alpha
            W, b = lasso_regression(X_train_fold, y_train_fold, alpha)
            y_pred_val = X_val_fold * W .+ b

            # Calculate MSE for this fold
            mse_fold = mean((y_val_fold .- y_pred_val).^2)
            push!(mse_list, mse_fold)
        end

        # Calculate average MSE across all folds
        avg_mse = mean(mse_list)
        push!(avg_mse_values, avg_mse)
        println("Alpha=$(round(alpha, digits=1)), Average MSE: $(round(avg_mse, digits=5))")

        # Update the best alpha and minimum average MSE
        if avg_mse < min_avg_mse
            min_avg_mse = avg_mse
            best_alpha = alpha
        end
    end

    println("Best Alpha: $(round(best_alpha, digits=1)), Minimum Average MSE: $(round(min_avg_mse, digits=5))")

    # Plotting the cross-validated MSE for each alpha value
    plt = plot(alphas, avg_mse_values, marker = :o, linestyle = :solid, xlabel = "Alpha", ylabel = "Average MSE",
               title = "Cross-Validated MSE for Different Alpha Values", legend = false)
    xticks!(plt, alphas)
    grid!(plt, true)
    display(plt)
end

7. Model Training and Evaluation

# Train Lasso regression model with the best alpha
W, b = lasso_regression(X_train, y_train, best_alpha)

# Make predictions on test data
y_pred = X_test * W .+ b

lasso_corr = cor(y_test, y_pred)
lasso_mae = mean(abs.(y_test .- y_pred))
lasso_mse = mean((y_test .- y_pred).^2)

# Print results
println("Correlation: $(round(lasso_corr, digits=4))")
println("MAE: $(round(lasso_mae, digits=4))")
println("MSE: $(round(lasso_mse, digits=4))")

8. Report Results

We began by selecting the parameters for a flexible model, one that includes interactions. After selecting the parameters, we found that the alpha value that produced the lowest mean squared error (MSE) in cross-validation was 60. We then trained the model using all the available training data with the best alpha value. Using the fitted betas, we predicted the lwage using the test X vector. The resulting MSE in the test data was 0.3569, which was lower than in the training data (which was 0.41906). This indicates that the model is not overfitting and that we have found a good correlation score (R square) of 50%.

Replication#

Wage

using Statistics
using Plots


wage_sorted = sort(data.wage)
cumulative_sum = cumsum(wage_sorted)
deciles = [percentile(cumulative_sum, p) for p in 0:10:100]
population_fraction = collect(0:0.1:1)

# Calcular la fracción acumulativa de salarios
salary_fraction = [percentile(deciles ./ sum(data.wage), p) for p in 0:10:100]

# Graficar la curva de Lorenz
plot(population_fraction, salary_fraction, linewidth=2, color=:blue, label="")
plot!([0, 1], [0, 1], linestyle=:dash, linewidth=1, color=:red, label="Equality line",
    title="Lorenz Curve for Higher Education Salaries",
    xlabel="Cumulative fraction of the population",
    ylabel="Cumulative fraction of salaries",
    legend=:bottomright)

LwAGE

using Plots
using StatsPlots

# Configurar el estilo
default(; legend = false)

# Crear el histograma de los salarios
histogram(data.lwage, bins=30, color=:lightblue, edgecolor=:black, normed=true, xlabel="Lwage", ylabel="Density",
    title="Histogram of Salaries")

# Crear la función de densidad
density!(data.lwage, color=:red, linewidth=1)

Sex

import Pkg
Pkg.add("PyPlot")
Pkg.add("Distributions")

using DataFrames
using Statistics
using Plots

# Calcular las proporciones de sexos
proportions = combine(groupby(data, :sex), nrow => :count)

# Calcular el total de observaciones
total_obs = sum(proportions.count)

# Asignar etiquetas 'Male' y 'Female'
proportions[!, :sex_label] = ifelse.(proportions.sex .== 0, "Male", "Female")

# Calcular las proporciones
proportions[!, :percentage] = proportions.count ./ total_obs * 100

# Graficar las proporciones sin leyenda
display(bar(proportions.sex_label, proportions[!, :percentage],
    xlabel = "Sex", ylabel = "Percentage",
    title = "Proportion of Males and Females",
    ylims = (0, 100),
    bar_width = 0.5,
    fmt = :png))

# Mostrar el número total de observaciones
println("Total observations: ", total_obs)

Dummies college

using DataFrames
using Statistics
using Plots

# Crear la variable 'Education_Status'
data[!, "Education_Status"] .= ifelse.(data.scl .== 1, "Some College",
                                      ifelse.(data.clg .== 1, "College Graduate", "Advanced Degree"))

# Calcular la frecuencia de educación
edu_freq = combine(groupby(data, :Education_Status), nrow => :Frequency)

# Calcular el total de observaciones
total_obs = sum(edu_freq.Frequency)

# Calcular el porcentaje
edu_freq[!, :Percentage] = edu_freq.Frequency / total_obs * 100

# Definir colores y etiquetas
# Crear el gráfico de pastel
pie(edu_freq.Frequency, labels=string.(labels, ": ", round.(edu_freq.Percentage, digits=1), "%"), startangle=90, 
    title="Education Status Distribution", 
    legend=:topright, 
    fmt=:png, 
    aspect_ratio=1,
    pct=true)

# Añadir el número total de observaciones
annotate!(1, -1.2, text("Total observations: $total_obs", halign=:center, valign=:center))

Experience Higher Education

using DataFrames
using Plots

plt = plot(size=(800, 600))
boxplot!(data.exp1, box=:box, whiskerwidth=0.2, title="Distribution of Experience in Individuals with Higher Education",
    ylabel="Experience (exp1)", legend=false)

COEFFICIENTS FOR DIFFERENT MODELS

using DataFrames
using GLM
using Statistics
using Plots


# Ajustar los modelos
flex_y = @formula(lwage ~ exp1+exp2+exp3+exp4+scl+clg+ad+occ2+ind2+mw+so+we+(exp1+exp2+exp3+exp4)*(scl+clg+ad+occ2+ind2+mw+so+we) ) # modelo para Y
flex_d = @formula(sex ~ exp1+exp2+exp3+exp4+scl+clg+ad+occ2+ind2+mw+so+we+(exp1+exp2+exp3+exp4)*(scl+clg+ad+occ2+ind2+mw+so+we))  # modelo para D

# Ajustar el efecto lineal parcial de W desde Y
t_Y = residuals(lm(flex_y, data))
# Ajustar el efecto lineal parcial de W desde D
t_D = residuals(lm(flex_d, data))
# Crear un nuevo DataFrame con las variables relevantes para t_Y y t_D
residuals_df = DataFrame(t_Y=t_Y, t_D=t_D)

# Ajustar la regresión lineal entre t_Y y t_D
partial_fit = lm(@formula(t_D ~ t_Y), residuals_df)

# Ajustar los modelos
basic_fit = lm(@formula(lwage ~ sex + exp1 + scl + clg + ad + mw + so + we + occ2 + ind2), data)

control_fit = lm(@formula(lwage ~ sex + exp1+exp2+exp3+exp4+scl+clg+ad+occ2+ind2+mw+so+we+(exp1+exp2+exp3+exp4)*(scl+clg+ad+occ2+ind2+mw+so+we)), data)

using Plots

# Extraer los coeficientes para sex y los errores estándar de las últimas dos regresiones
coefs = Dict("Basic" => coef(basic_fit)[2],
             "Controls" => coef(control_fit)[2],
             "Partialling out" => coef(partial_fit)[2])  # El coeficiente de interés es el segundo en partial_fit

ses = Dict("Basic" => stderror(basic_fit)[2],
           "Controls" => stderror(control_fit)[2],
           "Partialling out" => stderror(partial_fit)[2])  # El error estándar correspondiente al coeficiente de interés

# Gráfico de dispersión con barras de error
scatter_plot = scatter(coefs, yerr=1.96 .* collect(values(ses)), legend=false, 
    xlabel="Model", ylabel="Coefficient", title="Effect of sex on lwage",
    markershape=:circle, markercolor=:blue, markerstrokecolor=:black,
    markersize=8, markeralpha=0.8)

# Línea horizontal en y=0
hline!(scatter_plot, [0], color="gray", linestyle=:dash)

# Mostrar el gráfico
display(scatter_plot)

#The coefficient associated with the gender variable, which indicates the prediction of being female on salary, is initially 
# negative. This suggests that, on average, women have lower salaries than men. However, after adding these controls, 
# such as work experience or educational level, the negative coefficient associated with the gender variable becomes 
# less negative.
# 
# This change in the gender coefficient could be explained by the fact that the control variables are capturing
# some of the variability in salaries that was previously incorrectly attributed to gender. This suggests 
# that additional factors, beyond gender, are influencing salaries, and the impact of gender on salaries 
# is less pronounced once these other variables are taken into account.Besides, both FWL and including control 
# variables in the regression model yield coefficient estimates for the variable of interest that reflect its net
# impact on the dependent variable, once the effects of other explanatory variables have been taken into account.

Women and male graph (quartic)#

import Pkg
Pkg.add("StatsModels")

import Pkg
Pkg.add("Loess")

using DataFrames
using Statistics
using Plots

# Supongamos que tienes un DataFrame llamado data con las columnas lwage, exp4 y sex

# Creamos un DataFrame separado para hombres y mujeres
data_male = filter(row -> row.sex == 0, data)


# Calculamos las medias de lwage para cada valor único de exp4 para hombres
mean_lwage_male = combine(groupby(data_male, :exp4), :lwage => mean => :mean_lwage)

# Ordenar el DataFrame por exp4
sort!(mean_lwage_male, :exp4)

# Crear el gráfico para hombres
plot(mean_lwage_male.exp4, mean_lwage_male.mean_lwage, color=:blue, label="Mean lwage (Male)",
    xlabel="Experiencia (exp4)", ylabel="Media de lwage (lwage)", title="Media de lwage por exp4 (Hombres)", marker=:circle)
plot!(mean_lwage_male.exp4, mean_lwage_male.mean_lwage, color=:blue, linewidth=2, linestyle=:solid, label="")

using Pkg
Pkg.add("Loess")

using DataFrames
using Plots
using Loess

# Supongamos que tienes un DataFrame llamado data con las columnas lwage, exp4 y sex

# Creamos un DataFrame separado solo para varones
data_male = filter(row -> row.sex == 0, data)

# Extraemos las variables independientes y dependientes para el modelo LOESS
exp4_male = data_male.exp4
lwage_male = data_male.lwage

# Ajustamos un modelo LOESS para varones con un span de ** para menos suavizamiento
loess_model_male = loess(exp4_male, lwage_male, span=0.95)

# Generamos predicciones para el modelo LOESS
exp4_range_male = range(minimum(exp4_male), maximum(exp4_male), length=500)
lwage_pred_male = predict(loess_model_male, exp4_range_male)

# Creamos el gráfico para varones con la línea suavizada
plot(exp4_range_male, lwage_pred_male, color=:red, label="Smoothed lwage (Male)")
xlabel!("Experiencia (exp4)")
ylabel!("lwage")
title!("Línea suavizada de lwage por exp4 (Varones)")
xlims!(0, 40)  # Limitar el eje x de 0 a 40

data_female = filter(row -> row.sex == 1, data)
# Calculamos las medias de lwage para cada valor único de exp4 para mujeres
mean_lwage_female = combine(groupby(data_female, :exp4), :lwage => mean => :mean_lwage)

# Ordenar el DataFrame por exp4
sort!(mean_lwage_female, :exp4)

# Crear el gráfico para mujeres
plot(mean_lwage_female.exp4, mean_lwage_female.mean_lwage, color=:red, label="Mean lwage (Female)",
    xlabel="Experiencia (exp4)", ylabel="Media de lwage (lwage)", title="Media de lwage por exp4 (Mujeres)", marker=:circle)
plot!(mean_lwage_female.exp4, mean_lwage_female.mean_lwage, color=:red, linewidth=2, linestyle=:solid, label="")

using DataFrames
using Plots
using Loess

# Supongamos que tienes un DataFrame llamado data con las columnas lwage, exp4 y sex

# Creamos un DataFrame separado solo para mujeres
data_female = filter(row -> row.sex == 1, data)

# Extraemos las variables independientes y dependientes para el modelo LOESS
exp4_female = data_female.exp4
lwage_female = data_female.lwage

# Ajustamos un modelo LOESS para mujeres con un span de 0.95 para menos suavizamiento
loess_model_female = loess(exp4_female, lwage_female, span=0.95)

# Generamos predicciones para el modelo LOESS
exp4_range_female = range(minimum(exp4_female), maximum(exp4_female), length=500)
lwage_pred_female = predict(loess_model_female, exp4_range_female)

# Creamos el gráfico para mujeres con la línea suavizada
plot(exp4_range_female, lwage_pred_female, color=:red, label="Smoothed lwage (Female)")
xlabel!("Experiencia (exp4)")
ylabel!("lwage")
title!("Línea suavizada de lwage por exp4 (Mujeres)")
xlims!(0, 40)  # Limitar el eje x de 0 a 40

using DataFrames
using Plots
using Loess

# Supongamos que tienes un DataFrame llamado data con las columnas lwage, exp4 y sex

# Creamos un DataFrame separado solo para mujeres
data_female = filter(row -> row.sex == 1, data)

# Calculamos las medias de lwage para cada valor único de exp4 para mujeres
mean_lwage_female = combine(groupby(data_female, :exp4), :lwage => mean => :mean_lwage)

# Ordenar el DataFrame por exp4
sort!(mean_lwage_female, :exp4)

# Extraemos las variables independientes y dependientes para el modelo LOESS
exp4_female = data_female.exp4
lwage_female = data_female.lwage

# Ajustamos un modelo LOESS para mujeres con un span de 0.95 para menos suavizamiento
loess_model_female = loess(exp4_female, lwage_female, span=0.95)

# Generamos predicciones para el modelo LOESS
exp4_range_female = range(minimum(exp4_female), maximum(exp4_female), length=500)
lwage_pred_female = predict(loess_model_female, exp4_range_female)

# Creamos el gráfico para mujeres con la línea suavizada y los puntos unidos
plot(exp4_range_female, lwage_pred_female, color=:red, label="Smoothed lwage (Female)", linewidth=2)
scatter!(mean_lwage_female.exp4, mean_lwage_female.mean_lwage, color=:red, label="", marker=:circle, seriestype=:line)
xlabel!("Experiencia (exp4)")
ylabel!("lwage")
title!("Media de lwage y línea suavizada por exp4 (Mujeres)")
xlims!(0, 40)  # Limitar el eje x de 0 a 40