Associations between long-term exposure to air pollution and kidney function utilizing electronic healthcare records: a cross-sectional study | Environmental Health
Study population
We defined the sampling frame for this study to include patients with available electronic health records (EHRs) containing information on kidney health. Specifically, we include those with available serum creatinine laboratory test results and ICD codes relevant to CKD (available in Supplemental Table 1). We then separate this population into two distinct groups. The first group consists of all individuals with reported lab values for serum creatinine. We will use this group to analyze associations between air pollution and eGFRcr continuously. The second group is restricted to patients with two measures of eGFRcr < 60 mL/min per 1.73 m2 > 90 days apart and/or an ICD code indicating a physician diagnosis of CKD III-V. This group we consider as a positive case for CKD. These positive cases will be matched with controls for analysis.
To accomplish this, we utilize data from EHRs in the in Environmental Protection Agency’s Clinical and Archived Records Research for Environmental Studies (EPA CARES) [19, 20]. Our sampling frame from the EPA CARES population is a random sample of 19,989 individuals (504,406 unique visits), who were seen at a UNCHCS affiliated hospital or clinic from January 1st, 2004, to December 31st, 2016. Any participants with implausible demographic information (e.g., older than 110 years, BMI above 50, etc.) were removed prior to any analysis as it is likely these values were introduced during errors in entering information in electronic health records. Additionally, we removed any individuals who did not reside in North Carolina (n = 403). For both groups (continuous outcome and binary), we linked the 1-year average PM2.5, O3, and NO2 prior to the date of the (group 1) serum creatinine laboratory tests or (group 2) the second eGFRcr value < 60 mL/min per 1.73 m2 or ICD code, whichever occurs earliest.
Assessment of renal function
For the eGFR analyses, we use serum creatinine levels to assess kidney function as our first outcome. We estimated eGFRcr using the 2021 CKD-EPI equation for serum creatinine:
$$\texteGFR_\textcr = 142 \times min(\textS_\textcr/\upkappa,\ 1)^\upalpha \times max(\textS_\textcr/\upkappa,\ 1)^\text-1.200 \times \text0.9938^\textAge \times \text1.012\ [if\ female]$$
This equation was updated in 2021 to no longer include race in estimates of eGFRcr. Here, Scr is serum creatinine in mg/dL, κ is 0.7 for females and 0.9 for males, α is -0.241 for females and -0.302 for males, the min and max represent the minimum or maximum of the specified measurement or 1 [21]. We used the Tukey method to remove outliers, eliminating serum creatinine values more than 1.5 standard deviations above Q3 or those 1.5 standard deviations below Q1 before we calculate eGFRcr [22]. Information on UACR and serum cystatin C were not included in this analysis as they were not available in the data. The analyses focused on eGFRcr as a continuous outcome includes all recorded measures. Individuals may be included multiple times in the same dataset.
For our second outcome of interest, we designate an e-phenotype using similar methods described in previous research focused on kidney health utilizing EHRs [23, 24]. We consider a positive case of CKD stage III-V if a patient presents with two eGFRcr measures < 60 mL/min per 1.73 m2 greater than 90 days apart or has an ICD code indicating physician diagnosis. If patient data contains both types of diagnoses, we take the earliest diagnostic date. If a patient only has eGFRcr measures, we take the second measure as the diagnostic date. ICD-9 codes include 585.3 – 585.6 and ICD-10 codes include N-18.3 – N18.6. We use this method as many people living with CKD are unknowingly living with CKD and may not be diagnosed by a physician. Using similar methods, Paik et al. 2021 achieved positive predictive values > 80% [23]. One-year annual air pollutant averages for the preceding 365 days are linked to the exact serum creatinine laboratory date as exposures.
Matching identified CKD cases and controls
For our CKD analyses, to ensure that our sampling was robust against bias, we matched each case to four controls (1:4) who were never diagnosed or identified as having CKD by e-phenotyping. We performed this matching using the ‘MatchIt’ package in RStudio. This package allowed for matching cases and controls based on designated input variables to produce more robust results with less sensitivity to assumptions. We matched according to propensity scores based on diagnosis date, age, race, and sex. For controls, who do not have a diagnosis date, we match on the closest hospital visit (Supplementary Fig. 2). We matched on the ‘optimal’ controls using the propensity score generating by matching variables. To ensure that we were not selecting matches from different geographic regions of the state, potentially introducing confounding, we compared cases and control percentages taken from the eight climate divisions of the state (see Supplementary information) [25]. This matching was only done for patients with identified CKD. We then calculate differences in dates between cases and controls to ensure that we are sampling from similar time frames.
Exposure assessment
For PM2.5 and NO2 data, we used an ensemble model constructed by Di et al. that incorporates satellite aerosol measures, land-use regression, chemical transport models, and meteorological data [26]. This model incorporates three machine learning algorithms that predict pollutant concentrations in 1 × 1 km grids for the entirety of the contiguous Unites States. This model has been cross validated with an R2 of 0.89 (for the US Middle Atlantic Region) and shows accurate performance up to concentrations of approximately 60 µg/m3 or less [26]. The CARES patient data has the primary addresses of patients which we link to the appropriate 1 × 1 km grid. Where primary addresses were not successfully geocoded, we matched patients to the 1 × 1 km grid cell of the centroid of their primary residence ZIP code. O3 data come from the 12 km2 Community Multiscale Air Quality Modeling System (CMAQ) model; specifically, we use averaged 8-h maximum concentrations for O3 and averaged 24-h for NO2 [27]. CMAQ utilizes hourly measured pollutant data along with meteorological information to estimate pollutant concentrations at the census tract level. For all three pollutants, we estimate annual averages for all included patients.
Covariates
We chose covariates based on previously published research examining associations between air pollution and renal function [28]. We include individual-level sociodemographic information of age, race (Caucasian, African American, other), and sex as factors. We created the ‘other’ race category as there were too few patients of other racial backgrounds that were not Caucasian or African American to include separately in models. Clinical diagnosis of both diabetes and hypertension were included in descriptive statistics based on ICD-9 and ICD-10 codes (250.x and E11.x for diabetes and 401.9 and I10 for primary hypertension) (full list of ICD codes available in Supplementary information). However, these were excluded from models as they are both likely mediators of kidney function and onset of CKD. Event-specific instances of these diseases (e.g., pregnancy induced hypertension) were not included in this analysis. We adjusted for the following 2010 census/2013 5-year ACS variables at the block group level: income, percent older housing (built before 1979), percent living in poverty, urbanicity, and percent of the population on public assistance, all as continuous covariates. Education (percent with a bachelor’s degree or higher) and median price of housing were included in descriptive statistics but excluded from final models due to high collinearity (r >|0.7|) with income. Lastly, climate zones (identified from climatechange.nc.gov) were included as regional adjustment for unmeasured factors that differ between regions in North Carolina as a factor in our models. Smoking status and body mass index (BMI) were not included in the main analyses as they were not recorded for a large portion of patients and only reported as secondary analyses.
Statistical analyses
We analyzed associations between eGFRcr and air pollutants using linear mixed models, presenting unadjusted and fully adjusted models, with a random intercept for patient ID. We first calculated descriptive statistics for patients. Following this we calculated Pearson correlations between PM2.5, O3, and NO2 to examine the relationship between the exposures of interest. To make exposures more comparable we then calculate interquartile range (IQR) for use in the models. We controlled for the continuous census block group covariates including average income, percent older housing (built before 1979), percent living in poverty, urbanicity, and percent of the population on public assistance. Demographic covariates included age, sex, and race. As patients were more likely to be sampled from geographic regions closer to hospitals near the flagship UNC Chapel Hill hospital, we control for the climate zones (as identified by the NC Climate Division, map available in Supplementary information) in North Carolina. There are eight climate zones in North Carolina, however due to too few observations we only include zones 3–8 in our analyses (n = 15 patients removed).
We calculated eGFRcr as a continuous outcome along with serum creatinine pre-transformation as a secondary outcome. 1-year average concentrations for PM2.5, O3, and NO2 were matched to the date the laboratory test for serum creatinine was completed. We then calculated IQRs for each pollutant during the 1-year period to make them more comparable. Our fully adjusted models included age and race, census block group information (median income, % older housing, % poverty, urbanicity, % on public assistance), geographic region, and exposures (PM2.5, O3, and NO2) along with unrestricted natural cubic spline adjustment for long-term temporal variations with the number of splines based on the Aikake information criteria. We present only the results of multipollutant models, information on single pollutant models is available in Supplementary table S3.
We conducted unconditional multiple logistic regression to estimate odds ratios between first indication date of CKD and air quality for 1-year prior to diagnosis comparing our cases and controls (results of conditional are available in Table S4) [29]. The census block group, demographic, and comorbidity covariates included in our multiple logistic regression models were the same as those included in the linear mixed models. All analyses and visualizations were completed using SAS software version 9.4 and RStudio 4.0.3 [30, 31]. In RStudio we used the package ‘matchit’ for matching cases and controls for our multiple logistic regression models [32].
Sensitivity analysis
Body mass index (BMI) and smoking status were not reported for all patients in the random sample and were used in secondary analyses to ensure that their inclusion did not alter the linear mixed models described previously. BMI is available for n = 18,639 (n = 4,834 patients) serum creatinine lab measures and smoking status is available for n = 30,913 (n = 5,532 patients). Smoking status was separated into five categories including current, current/former, former, never/former, and never. Smoking status was attached to the same day serum creatinine tests were taken, or if it was not assessed that day, then the nearest prior date where smoking status was available. We ran two additional fully adjusted models with the same covariates in addition to BMI (continuous) and smoking status. For both analyses including BMI and smoking status, we calculate associations with and without the additional confounder to ensure that differences seen are not driven by underlying characteristics of the new sampling frames. We ran the fully adjusted models comparing cases of CKD to the entire random sample (available in Supplementary data). For patients without CKD, we ran two models, attaching an exposure date as both first appearance in the hospital system and median visit to ensure that results were consistent at different time points. Lastly, we include only those with street-level geocoded addresses (some patients were coded at zip code) to ensure the most accurate assignment of exposures.
Stratified analysis
We also stratified by individuals who were exposed to 1-year PM2.5 averages ≥ 12 µg/m3 and those < 12 µg/m3 [33]. This threshold was chosen based on current (2022) National Ambient Air Quality Standards (NAAQS) primary standard for PM2.5. To estimate associations of air pollution on patients with low functioning kidneys we stratify by those patients with a measure of an eGFRcr < 60 mL/min/1.73m2.
link