Lab 5 - Python Code#

Authors: Valerie Dube, Erzo Garay, Juan Marcos Guerrero y Matias Villalba

Replication and Data analysis#

# Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import hdmpy as hdm

from sklearn.base import BaseEstimator
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.linear_model import LassoCV, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

1. Descriptives#

1.1. Descriptive table#

# Import data and see first observations
df = pd.read_csv("../../data/processed_esti.csv")
df.head()
y w gender_female gender_male gender_transgender ethnicgrp_asian ethnicgrp_black ethnicgrp_mixed_multiple ethnicgrp_other ethnicgrp_white partners1 postlaunch msm age imd_decile
0 1 1 0 1 0 0 0 1 0 0 0 1 0 27 5
1 0 0 0 1 0 0 0 0 0 1 0 0 0 19 6
2 0 1 0 1 0 0 1 0 0 0 0 1 0 26 4
3 0 0 1 0 0 0 0 0 0 1 1 0 0 20 2
4 1 1 1 0 0 1 0 0 0 0 0 1 0 24 3
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1739 entries, 0 to 1738
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   y                         1739 non-null   int64
 1   w                         1739 non-null   int64
 2   gender_female             1739 non-null   int64
 3   gender_male               1739 non-null   int64
 4   gender_transgender        1739 non-null   int64
 5   ethnicgrp_asian           1739 non-null   int64
 6   ethnicgrp_black           1739 non-null   int64
 7   ethnicgrp_mixed_multiple  1739 non-null   int64
 8   ethnicgrp_other           1739 non-null   int64
 9   ethnicgrp_white           1739 non-null   int64
 10  partners1                 1739 non-null   int64
 11  postlaunch                1739 non-null   int64
 12  msm                       1739 non-null   int64
 13  age                       1739 non-null   int64
 14  imd_decile                1739 non-null   int64
dtypes: int64(15)
memory usage: 203.9 KB
control = df[df['w'] == 0].drop('y', axis=1)
treatment = df[df['w'] == 1].drop('y', axis=1)
def get_descriptive_stats(group, column):
    if column == 'age':
        count = group[column].count()
    else:
        count = (group[column] == 1).sum()
    mean = group[column].mean()
    std = group[column].std()
    return count, mean, std

variables = df.columns.drop(['w', 'y'])
control_stats = {var: get_descriptive_stats(control, var) for var in variables}
treatment_stats = {var: get_descriptive_stats(treatment, var) for var in variables}

control_df = pd.DataFrame(control_stats, index=['count', 'mean', 'std']).T
treatment_df = pd.DataFrame(treatment_stats, index=['count', 'mean', 'std']).T

control_df.columns = pd.MultiIndex.from_product([['Control'], control_df.columns])
treatment_df.columns = pd.MultiIndex.from_product([['Treatment'], treatment_df.columns])

combined_df = pd.concat([control_df, treatment_df], axis=1)
formatted_table = combined_df[['Control', 'Treatment']].round(2)

print("Table 1: Descriptive Statistics and Balance\n")
print(formatted_table)
Table 1: Descriptive Statistics and Balance

                         Control              Treatment             
                           count   mean   std     count   mean   std
gender_female              475.0   0.58  0.49     541.0   0.59  0.49
gender_male                342.0   0.42  0.49     377.0   0.41  0.49
gender_transgender           1.0   0.00  0.03       3.0   0.00  0.06
ethnicgrp_asian             45.0   0.06  0.23      66.0   0.07  0.26
ethnicgrp_black             76.0   0.09  0.29      74.0   0.08  0.27
ethnicgrp_mixed_multiple    76.0   0.09  0.29      78.0   0.08  0.28
ethnicgrp_other             14.0   0.02  0.13       9.0   0.01  0.10
ethnicgrp_white            607.0   0.74  0.44     694.0   0.75  0.43
partners1                  239.0   0.29  0.46     277.0   0.30  0.46
postlaunch                 387.0   0.47  0.50     512.0   0.56  0.50
msm                        113.0   0.14  0.35     114.0   0.12  0.33
age                        818.0  23.05  3.59     921.0  23.16  3.54
imd_decile                  37.0   3.48  1.49      36.0   3.46  1.47

The observations for each variable are generally balanced between the control and treatment groups. Additionally, most participants are white, with an average age of approximately 23. The mean IMD decile scores are around 3.5, indicating that participants in both groups tend to come from more deprived areas.

1.2. Descriptive graphs#

ax = sns.barplot(
    data=df.groupby('w')['gender_male'].value_counts(normalize=True).to_frame().set_axis(['prop'], axis=1),
    x="gender_male",
    y="prop",
    hue="w",
    )

ax.yaxis.set_major_formatter("{x:.0%}")
ax.set_xlabel("Gender (male=1, female or transgender=0)")
ax.legend(title='')
new_labels = ['Control', 'Treatment']  # Replace with your desired labels
for t, l in zip(ax.get_legend().texts, new_labels):
    t.set_text(l)

plt.show()
../../_images/3277bbd9be5cbea7d8a6baa496ab971511654d71a3066c43918604b2d17e6ac8.png

As we saw in section 1.1., there is a similar percentage of males and females participants in each treatment group in the sample.

ax = sns.barplot(
    data=df.groupby('w')['partners1'].value_counts(normalize=True).to_frame().set_axis(['prop'], axis=1),
    x="partners1",
    y="prop",
    hue="w",
    )

ax.yaxis.set_major_formatter("{x:.0%}")
ax.set_xlabel("Number of partners (one partner=1)")
ax.legend(title='')
new_labels = ['Control', 'Treatment']  # Replace with your desired labels
for t, l in zip(ax.get_legend().texts, new_labels):
    t.set_text(l)

plt.show()
../../_images/ebd0012693260dca030cb9f8131b9a7c27813b206030d96288ffed5b43c2b39b.png

In a similar manner, we note an equal percentage of participants with 2 or more sexual partners as well as those with 1 sexual partner in the last 12 months from the beginning of the study, per treatment or control group.

ax = sns.histplot(data=df, x="age", hue="w", element="poly", stat="percent", common_norm=False)

new_labels = ['Control', 'Treatment']  # Replace with your desired labels
for t, l in zip(ax.get_legend().texts, new_labels):
    t.set_text(l)

plt.show()
../../_images/b688ee229920a21be30c973133074013a68f46631ecc5590eff91a75f628b2f6.png

We can see a higher percentage of participants aged between 23 and 27 in the treatment group. Also, there is a higher perceptange of participants aged between 21 and 22 and 27 and 29 in the control group.

ax = sns.histplot(data=df, x="imd_decile", hue="w", element="poly", discrete=True, stat="percent", common_norm=False)

new_labels = ['Control', 'Treatment']  # Replace with your desired labels
for t, l in zip(ax.get_legend().texts, new_labels):
    t.set_text(l)

plt.show()
../../_images/64da9d82bb47ad20e36d05cbe934c8a92b02ab6f0a0e3f4cfcfbbba1226fafd8.png

In the case of the IMD decile, an index that measures poverty in the UK, we can see the same proportion of participants in each group.

2. Linear Regression analysis#

2.1. Regression 1: \(y = \beta_0 + \beta_1 T + \epsilon\)#

model_1 = "y ~ w"
est_1 = smf.ols(formula=model_1 , data=df).fit()

print(est_1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.077
Model:                            OLS   Adj. R-squared:                  0.076
Method:                 Least Squares   F-statistic:                     144.5
Date:                Wed, 07 Aug 2024   Prob (F-statistic):           4.96e-32
Time:                        21:44:28   Log-Likelihood:                -1112.9
No. Observations:                1739   AIC:                             2230.
Df Residuals:                    1737   BIC:                             2241.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.2115      0.016     13.174      0.000       0.180       0.243
w              0.2652      0.022     12.021      0.000       0.222       0.308
==============================================================================
Omnibus:                    38241.926   Durbin-Watson:                   1.995
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              217.327
Skew:                           0.532   Prob(JB):                     6.43e-48
Kurtosis:                       1.634   Cond. No.                         2.69
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We found that the adjusted R-squared is 0.076 and the ATE is approximately 0.27. This means that the treatment explains only 7.6% of the increase in the STI tests between the control and treatment groups. Additionally, receiving the treatment (i.e. being invited to use the internet-based sexual health service) increases, on average, the probability of taking an STI test by 26.5%.

2.2. Regression 2: \(y = \beta_0 + \beta_1 T + \beta_2 X + \epsilon\)#

model_2 = "y ~ w + age + gender_female + ethnicgrp_white + ethnicgrp_black + ethnicgrp_mixed_multiple + partners1 + imd_decile + msm"
est_2 = smf.ols(formula=model_2 , data=df).fit()

print(est_2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.102
Method:                 Least Squares   F-statistic:                     22.94
Date:                Wed, 07 Aug 2024   Prob (F-statistic):           3.06e-37
Time:                        21:44:28   Log-Likelihood:                -1084.3
No. Observations:                1739   AIC:                             2189.
Df Residuals:                    1729   BIC:                             2243.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                   -0.1807      0.088     -2.042      0.041      -0.354      -0.007
w                            0.2622      0.022     12.044      0.000       0.220       0.305
age                          0.0149      0.003      4.787      0.000       0.009       0.021
gender_female                0.0905      0.025      3.629      0.000       0.042       0.139
ethnicgrp_white              0.0499      0.041      1.209      0.227      -0.031       0.131
ethnicgrp_black             -0.0292      0.054     -0.538      0.591      -0.136       0.077
ethnicgrp_mixed_multiple    -0.0315      0.054     -0.587      0.557      -0.137       0.074
partners1                   -0.0661      0.024     -2.714      0.007      -0.114      -0.018
imd_decile                  -0.0045      0.007     -0.609      0.543      -0.019       0.010
msm                          0.0033      0.037      0.089      0.929      -0.069       0.075
==============================================================================
Omnibus:                     2495.421   Durbin-Watson:                   1.997
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              193.943
Skew:                           0.524   Prob(JB):                     7.69e-43
Kurtosis:                       1.743   Cond. No.                         215.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Compared to the previous regression, when we include additional variables (age, gender, ethnic group, number of sexual partners, the indicator for Randomised after SH:24 made publicly available, the indicator for men who have sex with men, and the socioeconomic level measured by deciles), we observe an increase in the adjusted R-squared of approximately 0.3 percentage points (from 0.1). Additionally, the ATE decreases by 0.003 percentage points, indicating a negative bias.

It is worth mentioning that we omit gender_male to avoid collinearity with gender_female. Similarly, we exclude the less relevant ethnic group (i.e. the one with fewer observations) to prevent the same issue.

2.3. Regression 3: \(y = \beta_0 + \beta_1 T + \beta_2 X + \epsilon\) (Double Lasso variable selection)#

# Assuming df is already defined
X = df[["gender_female", "ethnicgrp_white", "ethnicgrp_black", "gender_transgender", "msm", "partners1", "age", "imd_decile"]]
w = df['w']
y = df['y']

# Fit LassoCV model to find the optimal alpha (lambda)
cv_model3a = LassoCV(alphas=None, cv=10, max_iter=10000).fit(X, w)
cv_model3b = LassoCV(alphas=None, cv=10, max_iter=10000).fit(X, y)

# Fit Lasso model with the optimal alpha
model3a = Lasso(alpha=cv_model3a.alpha_).fit(X, w)
model3b = Lasso(alpha=cv_model3b.alpha_).fit(X, y)

# Identify non-zero coefficients from Lasso models
significant_vars3a = X.columns[model3a.coef_ != 0]
significant_vars3b = X.columns[model3b.coef_ != 0]

# Combine the significant variables from both models
significant_vars = pd.Index(np.unique(significant_vars3a.append(significant_vars3b)))

# Extract the significant variables for regression
X_significant = X[significant_vars]

# Add 'w' to the significant variables
X_significant_with_w = X_significant.copy()
X_significant_with_w['w'] = w

# Add constant term for intercept in the new significant variable set
X_significant_with_w = sm.add_constant(X_significant_with_w)

# Fit linear regression model on significant variables plus 'w' against 'y'
model3 = sm.OLS(y, X_significant_with_w).fit()

# Print summary of the model
print("Summary for the final model with dependent variable 'y':")
print(model3.summary())
Summary for the final model with dependent variable 'y':
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.106
Model:                            OLS   Adj. R-squared:                  0.103
Method:                 Least Squares   F-statistic:                     34.38
Date:                Wed, 07 Aug 2024   Prob (F-statistic):           2.06e-39
Time:                        21:44:28   Log-Likelihood:                -1084.5
No. Observations:                1739   AIC:                             2183.
Df Residuals:                    1732   BIC:                             2221.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              -0.2026      0.081     -2.508      0.012      -0.361      -0.044
age                 0.0149      0.003      4.824      0.000       0.009       0.021
ethnicgrp_white     0.0710      0.025      2.805      0.005       0.021       0.121
gender_female       0.0888      0.022      3.977      0.000       0.045       0.133
imd_decile         -0.0042      0.007     -0.569      0.570      -0.019       0.010
partners1          -0.0660      0.024     -2.725      0.006      -0.114      -0.019
w                   0.2626      0.022     12.078      0.000       0.220       0.305
==============================================================================
Omnibus:                     2501.289   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              194.167
Skew:                           0.524   Prob(JB):                     6.87e-43
Kurtosis:                       1.743   Cond. No.                         177.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We observe that the ATE using double lasso is akin to the OLS with confounders, along with the adjusted R-squared.

2.4. Coefficient Comparison#

# Assuming model1, model2, model3 are your fitted statsmodels regression models
# Extract the coefficients for resW
b1 = est_1.params['w']
b2 = est_2.params['w']
b3 = model3.params['w']

# Extract the standard errors for resW
se1 = est_1.bse['w']
se2 = est_2.bse['w']
se3 = model3.bse['w']

# Labels for the models
model_labels = ['OLS', 'OLS w. controls', 'Double Lasso']

# Coefficients and standard errors
coefficients = [b1, b2, b3]
errors = [se1, se2, se3]

# Create the figure and axis
fig, ax = plt.subplots()

# Plot the coefficients with error bars
ax.errorbar(model_labels, coefficients, yerr=errors, fmt='o', capsize=5, capthick=2, marker='s', markersize=7, linestyle='None')

# Add title and labels
ax.set_title('Comparison of Coefficients from Different Models')
ax.set_xlabel('Model')
ax.set_ylabel('Coefficient')

# Add grid for better readability
ax.grid(True)

# Show the plot
plt.show()
C:\Users\Matias Villalba\AppData\Local\Temp\ipykernel_10508\2861833790.py:23: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "o" (-> marker='o'). The keyword argument will take precedence.
  ax.errorbar(model_labels, coefficients, yerr=errors, fmt='o', capsize=5, capthick=2, marker='s', markersize=7, linestyle='None')
../../_images/461c8be82b8ccdaabf22fca00253d6ad761da2bd832fe80fc6b56be23380ce38.png

In general, the three ATEs are very similar, but we consider that the models that include the cofounders (OLS with controls and DL) are better estimated.

3. Non-Linear Methods DML#

def dml(X, D, y, modely, modeld, *, nfolds, classifier=False, time = None, clu = None, cluster = True):
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds
    yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y
    # out-of-fold predictions for D
    # use predict or predict_proba dependent on classifier or regressor for D
    if classifier:
        Dhat = cross_val_predict(modeld, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    else:
        Dhat = cross_val_predict(modeld, X, D, cv=cv, n_jobs=-1)
    # calculate outcome and treatment residuals
    resy = y - yhat
    resD = D - Dhat

    if cluster:
      # final stage ols clustered
      dml_data = pd.concat([clu, pd.Series(time), pd.Series(resy, name = 'resy'), pd.Series(resD, name = 'resD')], axis=1)

    else:
      # final stage ols nonclustered
      dml_data = pd.concat([pd.Series(resy, name = 'resy'), pd.Series(resD, name = 'resD')], axis=1)

    if cluster:
      # clustered standard errors
      ols_mod = smf.ols(formula = 'resy ~ 1 + resD', data = dml_data).fit(cov_type='cluster', cov_kwds={"groups": dml_data['CountyCode']})

    else:
      # regular ols
      ols_mod = smf.ols(formula = 'resy ~ 1 + resD', data = dml_data).fit()

    point = ols_mod.params[1]
    stderr = ols_mod.bse[1]
    epsilon = ols_mod.resid

    return point, stderr, yhat, Dhat, resy, resD, epsilon
def summary(point, stderr, yhat, Dhat, resy, resD, epsilon, X, D, y, *, name):
    '''
    Convenience summary function that takes the results of the DML function
    and summarizes several estimation quantities and performance metrics.
    '''
    return pd.DataFrame({'estimate': point, # point estimate
                         'stderr': stderr, # standard error
                         'lower': point - 1.96*stderr, # lower end of 95% confidence interval
                         'upper': point + 1.96*stderr, # upper end of 95% confidence interval
                         'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y
                         'rmse D': np.sqrt(np.mean(resD**2)) # RMSE of model that predicts treatment D
                         }, index=[name])
df['intercept'] = 1
y = df["y"]
d = df["w"]
x = df[df.columns[~df.columns.isin(['y','w'])]]

3.1. DML with RLasso#

class RLasso(BaseEstimator):

    def __init__(self, *, post=True):
        self.post = post

    def fit(self, X, y):
        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)
        return self

    def predict(self, X):
        return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])

lasso_model = lambda: RLasso(post=False)
# DML with RLasso:
modely = make_pipeline(StandardScaler(), RLasso(post=False))
modeld = make_pipeline(StandardScaler(), RLasso(post=False))
result_RLasso = dml(x,d,y, modely, modeld, nfolds=10, classifier=False, cluster = False)
table_RLasso = summary(*result_RLasso, x,d,y, name = 'Lasso')
print(table_RLasso)
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\Matias Villalba\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\process_executor.py", line 463, in _process_worker
    r = call_item()
        ^^^^^^^^^^^
  File "C:\Users\Matias Villalba\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Matias Villalba\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Matias Villalba\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\parallel.py", line 129, in __call__
    return self.function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Matias Villalba\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 1378, in _fit_and_predict
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Matias Villalba\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Matias Villalba\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "C:\Users\Matias Villalba\AppData\Local\Temp\ipykernel_10508\3396256270.py", line 7, in fit
NameError: name 'hdmpy' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
Cell In[19], line 4
      2 modely = make_pipeline(StandardScaler(), RLasso(post=False))
      3 modeld = make_pipeline(StandardScaler(), RLasso(post=False))
----> 4 result_RLasso = dml(x,d,y, modely, modeld, nfolds=10, classifier=False, cluster = False)
      5 table_RLasso = summary(*result_RLasso, x,d,y, name = 'Lasso')
      6 print(table_RLasso)

Cell In[15], line 3, in dml(X, D, y, modely, modeld, nfolds, classifier, time, clu, cluster)
      1 def dml(X, D, y, modely, modeld, *, nfolds, classifier=False, time = None, clu = None, cluster = True):
      2     cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds
----> 3     yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y
      4     # out-of-fold predictions for D
      5     # use predict or predict_proba dependent on classifier or regressor for D
      6     if classifier:

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    207 try:
    208     with config_context(
    209         skip_parameter_validation=(
    210             prefer_skip_nested_validation or global_skip_validation
    211         )
    212     ):
--> 213         return func(*args, **kwargs)
    214 except InvalidParameterError as e:
    215     # When the function is just a wrapper around an estimator, we allow
    216     # the function to delegate validation to the estimator, but we replace
    217     # the name of the estimator by the name of the function in the error
    218     # message to avoid confusion.
    219     msg = re.sub(
    220         r"parameter of \w+ must be",
    221         f"parameter of {func.__qualname__} must be",
    222         str(e),
    223     )

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py:1293, in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, params, pre_dispatch, method)
   1290 # We clone the estimator to make sure that all the folds are
   1291 # independent, and that it is pickle-able.
   1292 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
-> 1293 predictions = parallel(
   1294     delayed(_fit_and_predict)(
   1295         clone(estimator),
   1296         X,
   1297         y,
   1298         train,
   1299         test,
   1300         routed_params.estimator.fit,
   1301         method,
   1302     )
   1303     for train, test in splits
   1304 )
   1306 inv_test_indices = np.empty(len(test_indices), dtype=int)
   1307 inv_test_indices[test_indices] = np.arange(len(test_indices))

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\parallel.py:67, in Parallel.__call__(self, iterable)
     62 config = get_config()
     63 iterable_with_config = (
     64     (_with_config(delayed_func, config), args, kwargs)
     65     for delayed_func, args, kwargs in iterable
     66 )
---> 67 return super().__call__(iterable_with_config)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\parallel.py:2007, in Parallel.__call__(self, iterable)
   2001 # The first item from the output is blank, but it makes the interpreter
   2002 # progress until it enters the Try/Except block of the generator and
   2003 # reach the first `yield` statement. This starts the aynchronous
   2004 # dispatch of the tasks to the workers.
   2005 next(output)
-> 2007 return output if self.return_generator else list(output)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\parallel.py:1650, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1647     yield
   1649     with self._backend.retrieval_context():
-> 1650         yield from self._retrieve()
   1652 except GeneratorExit:
   1653     # The generator has been garbage collected before being fully
   1654     # consumed. This aborts the remaining tasks if possible and warn
   1655     # the user if necessary.
   1656     self._exception = True

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\parallel.py:1754, in Parallel._retrieve(self)
   1747 while self._wait_retrieval():
   1748 
   1749     # If the callback thread of a worker has signaled that its task
   1750     # triggered an exception, or if the retrieval loop has raised an
   1751     # exception (e.g. `GeneratorExit`), exit the loop and surface the
   1752     # worker traceback.
   1753     if self._aborting:
-> 1754         self._raise_error_fast()
   1755         break
   1757     # If the next job is not ready for retrieval yet, we just wait for
   1758     # async callbacks to progress.

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\parallel.py:1789, in Parallel._raise_error_fast(self)
   1785 # If this error job exists, immediatly raise the error by
   1786 # calling get_result. This job might not exists if abort has been
   1787 # called directly or if the generator is gc'ed.
   1788 if error_job is not None:
-> 1789     error_job.get_result(self.timeout)

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\parallel.py:745, in BatchCompletionCallBack.get_result(self, timeout)
    739 backend = self.parallel._backend
    741 if backend.supports_retrieve_callback:
    742     # We assume that the result has already been retrieved by the
    743     # callback thread, and is stored internally. It's just waiting to
    744     # be returned.
--> 745     return self._return_or_raise()
    747 # For other backends, the main thread needs to run the retrieval step.
    748 try:

File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\parallel.py:763, in BatchCompletionCallBack._return_or_raise(self)
    761 try:
    762     if self.status == TASK_ERROR:
--> 763         raise self._result
    764     return self._result
    765 finally:

NameError: name 'hdmpy' is not defined

The message treatment providing information about Internet-accessed sexually transmitted infection testing predicts an increase in the probability that a person will get tested by 25.91 percentage points compared to receiving information about nearby clinics offering in-person testing. By providing both groups with information about testing, we mitigate the potential reminder effect, as both groups are equally prompted to consider testing. This approach allows us to isolate the impact of the type of information “Internet-accessed testing” versus “in-person clinic testing” on the likelihood of getting tested. Through randomized assignment, we establish causality rather than mere correlation, confirming that the intervention’s effect is driven by the unique advantages of Internet-accessed testing, such as increased privacy, reduced embarrassment, and convenience

3.2. DML with Regression Trees#

modely = make_pipeline(StandardScaler(), DecisionTreeRegressor(ccp_alpha=0.001, min_samples_leaf=5, random_state=123))
modeld = make_pipeline(StandardScaler(), DecisionTreeRegressor(ccp_alpha=0.001, min_samples_leaf=5, random_state=123))
result_RT = dml(x,d,y, modely, modeld, nfolds=10, classifier=False, cluster = False)
table_RT= summary(*result_RT, x,d,y, name = 'RT')
print(table_RT)
    estimate    stderr     lower     upper    rmse y    rmse D
RT  0.249831  0.021872  0.206961  0.292701  0.472259  0.499638
C:\Users\Matias Villalba\AppData\Local\Temp\ipykernel_23460\2498640246.py:30: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  point = ols_mod.params[1]
C:\Users\Matias Villalba\AppData\Local\Temp\ipykernel_23460\2498640246.py:31: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  stderr = ols_mod.bse[1]

The message treatment providing information about Internet-accessed sexually transmitted infection testing predicts an increase in the probability that a person will get tested by 24.98 percentage points compared to receiving information about nearby clinics offering in-person testing. By providing both groups with information about testing, we mitigate the potential reminder effect, as both groups are equally prompted to consider testing. This approach allows us to isolate the impact of the type of information “Internet-accessed testing” versus “in-person clinic testing” on the likelihood of getting tested. Through randomized assignment, we establish causality rather than mere correlation, confirming that the intervention’s effect is driven by the unique advantages of Internet-accessed testing, such as increased privacy, reduced embarrassment, and convenience

3.3. DML Boosting Trees#

modely = make_pipeline(StandardScaler(), GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=123))
modeld = make_pipeline(StandardScaler(), GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=123))
result_GBT = dml(x,d,y, modely, modeld, nfolds=10, classifier=False, cluster = False)
table_GBT = summary(*result_GBT, x,d,y, name = 'GBT')
print(table_GBT)
     estimate    stderr     lower     upper    rmse y    rmse D
GBT  0.246373  0.021669  0.203902  0.288843  0.475893  0.508383
C:\Users\Matias Villalba\AppData\Local\Temp\ipykernel_23460\2498640246.py:30: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  point = ols_mod.params[1]
C:\Users\Matias Villalba\AppData\Local\Temp\ipykernel_23460\2498640246.py:31: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  stderr = ols_mod.bse[1]

The message treatment providing information about Internet-accessed sexually transmitted infection testing predicts an increase in the probability that a person will get tested by 24.64 percentage points compared to receiving information about nearby clinics offering in-person testing. By providing both groups with information about testing, we mitigate the potential reminder effect, as both groups are equally prompted to consider testing. This approach allows us to isolate the impact of the type of information “Internet-accessed testing” versus “in-person clinic testing” on the likelihood of getting tested. Through randomized assignment, we establish causality rather than mere correlation, confirming that the intervention’s effect is driven by the unique advantages of Internet-accessed testing, such as increased privacy, reduced embarrassment, and convenience

3.4. DML with Random Forest#

modely = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))
modeld = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))
result_RF = dml(x,d,y, modely, modeld, nfolds=10, classifier=False, cluster = False)
table_RF = summary(*result_RF, x,d,y, name = 'RF')
print(table_RF)
    estimate    stderr     lower     upper    rmse y    rmse D
RF  0.241558  0.021379  0.199655  0.283461  0.472297  0.511585
C:\Users\Matias Villalba\AppData\Local\Temp\ipykernel_23460\2498640246.py:30: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  point = ols_mod.params[1]
C:\Users\Matias Villalba\AppData\Local\Temp\ipykernel_23460\2498640246.py:31: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  stderr = ols_mod.bse[1]

The message treatment providing information about Internet-accessed sexually transmitted infection testing predicts an increase in the probability that a person will get tested by 24.16 percentage points compared to receiving information about nearby clinics offering in-person testing. By providing both groups with information about testing, we mitigate the potential reminder effect, as both groups are equally prompted to consider testing. This approach allows us to isolate the impact of the type of information “Internet-accessed testing” versus “in-person clinic testing” on the likelihood of getting tested. Through randomized assignment, we establish causality rather than mere correlation, confirming that the intervention’s effect is driven by the unique advantages of Internet-accessed testing, such as increased privacy, reduced embarrassment, and convenience

3.6.#

# Extract coefficients and standard errors
coefficients = [result_RLasso[0], result_RT[0], result_GBT[0], result_RF[0]]
standard_errors = [result_RLasso[1], result_RT[1], result_GBT[1], result_RF[1]]
methods = ['RLasso', 'Regression Trees', 'Boosting Trees', 'Regression Forest']

# Plotting the coefficients with error bars
plt.figure(figsize=(10, 6))
plt.errorbar(methods, coefficients, yerr=standard_errors, fmt='o', capsize=5, capthick=2)
plt.xlabel('Regression Methods')
plt.ylabel('Coefficients')
plt.title('Comparison of Coefficients Across Regression Methods')
plt.grid(True)
plt.show()
../../_images/5829b328f948c674c110e4d1c6e24452ad1d72f62e94a7e9e8c05d1ec06aa300.png
results_table = pd.concat([table_RLasso, table_RT, table_GBT, table_RF], axis=0)
print(results_table)
       estimate    stderr     lower     upper    rmse y    rmse D
Lasso  0.259067  0.021722  0.216492  0.301642  0.471041  0.500225
RT     0.249831  0.021872  0.206961  0.292701  0.472259  0.499638
GBT    0.246373  0.021669  0.203902  0.288843  0.475893  0.508383
RF     0.241558  0.021379  0.199655  0.283461  0.472297  0.511585

To choose the best model, we must compare the RMSEs of the outcome variable Y. In this case, the model with the lowest RMSE for Y is generated by Lasso (0.471041), whereas the lowest for the treatment is generated by Regression Trees (0.4983734). Therefore, DML could be employed with Y cleaned using Lasso and the treatment using Regression Trees.