Enhancing Survey Microdata with Administrative Records: A Novel Approach to Microsimulation Dataset Construction
Enhancing Survey Microdata with Administrative Records: A Novel Approach to Microsimulation Dataset Construction
Authors: Nikhil Woodruff, Max Ghenis
Abstract: We combine the demographic detail of the Current Population Survey (CPS) with the tax precision of the IRS Public Use File (PUF) to create an enhanced microsimulation dataset. Our method uses quantile regression forests to transfer income and tax variables from the PUF to demographically-similar CPS households. We create a synthetic CPS-structured dataset using PUF tax information, stack it alongside the original CPS records, then use dropout-regularized gradient descent to reweight households toward administrative targets from IRS Statistics of Income, Census population estimates, and program participation data. This preserves the CPS’s granular demographic and geographic information while leveraging the PUF’s tax reporting accuracy. The enhanced dataset provides a foundation for analyzing federal tax policy, state tax systems, and benefit programs. We release both the enhanced dataset and our open-source enhancement procedure to support transparent policy analysis.
Seminar Notes
Venue
SGE 2025
Objective
To use public use CPS and IRS PUF data to make a combined analytical dataset
Importance
Policy analysis - taxes and benefits jointly affect household incentives. First open-dataset with administrative-quality tax data
Data & Key Variables
CPS ASEC - limited tax information, detailed demographics. Income topcoded
IRS PUF - no demographics or state identifiers. Not updated frequently (using 2015 data here)
Methodology
Use machine learning (quantile regression forests) to combine CPS and IRS PUF, starting with CPS demographics and program data
Impute PUF tax variables, predict housing costs from ACS, estimate prior year earnings
Results
Matching and OLS very susceptible to overfitting, random forest much better
Database outperforms CPS iin 63% of targets and IRS PUF on 71% of targets
PolicyEngine.org - Interactive database & microimpute package