ML-HAPPG#

ML-HAPPG (Machine Learning for Hourly Air Pollution Prediction – Global)#

A Data-Driven Supervised Machine Learning Approach to Estimating Global Ambient Air Pollution Concentrations With Associated Prediction Intervals#

Addressing the global challenge of ambient air pollution requires scalable and high-resolution data. However, many regions lack comprehensive monitoring infrastructure. This research introduces a machine learning framework that extends air quality estimation globally, using remote sensing, meteorological reanalysis, and emissions datasets to produce hourly pollutant concentrations at a 0.25 degree spatial resolution across the globe.

This work builds on the LightGBM-based framework introduced in A framework for scalable ambient air pollution concentration estimation, adapting it for a global context. LightGBM, a fast and accurate gradient-boosted decision tree algorithm, enables this large-scale application with high predictive performance and interpretability.

Dataset Purpose#

This dataset empowers diverse stakeholders, including researchers, policymakers, urban planners, and public health authorities, providing a robust basis for conducting air quality assessments and interventions at unprecedented resolution. The improved granularity facilitates more precise studies into air pollution impacts on human health, urban resilience, and environmental justice, surpassing the capabilities of conventional lower-resolution approaches. The framework’s computational efficiency and scalability further demonstrate its potential applicability for similar pollution estimation challenges globally, especially in regions with limited observational infrastructure.

The dataset is readily accessible and stored online, ensuring rapid retrieval and ease of use for various analytical and operational purposes. Users can confidently perform high-resolution analyses supported by validated machine-learning-driven estimates, thereby enhancing informed decision-making and targeted interventions aimed at reducing air pollution exposure and promoting sustainable urban environments.

For questions regarding this dataset please reach out to Liam J Berrisford.

Models Datasets#

All of the models used in the research are included within this dataset. The models included have been trained to predict the mean air pollution concentration and the 5th / 50th (median) / 95th percentile of the air pollution concentration. The model is saved as the LightGBM booster within a .txt file, and the model parameters are kept within a .json file. The code that is used to recreate a LightGBM object within Python that can be used to make predictions is available via the Environment Insights Python package. The directory structure provides details about whether all of the monitoring stations were used for the models, or whether all of the feature vectors were used for the respective model.

Air Pollution (Target Vector) Dataset Overview#

The dataset includes hourly estimates for key ambient air pollutants: Nitrogen Dioxide (NO~2~), Ozone (O~3~), Particulate Matter with a diameter of 10 μm or less (PM~10~), Particulate Matter with a diameter of 2.5 μm or less (PM~2.5~), and Sulphur Dioxide (SO~2~). Rigorous validation was performed to assess the accuracy of model predictions for forecasting air pollution concentrations, estimating values at previously unmeasured locations, and capturing extreme pollution episodes.

Estimates were produced at the mean, as well as at the 0.05, 0.5, and 0.95 quantiles, to capture both the typical and extreme variations in air pollution concentrations. The mean provides a central tendency estimate, useful for general assessments and long-term policy planning. Quantile-based predictions, specifically at the 0.05, 0.5, and 0.95 quantiles, offer deeper insights into the variability and uncertainty inherent in pollution concentration estimates. These quantile predictions enable stakeholders to understand the range and extremes of pollutant levels, supporting risk assessments and targeted interventions, such as public health advisories or emergency response planning during pollution peaks.

Feature Vector Dataset Overview#

The dataset developed in this study includes a comprehensive set of feature vectors used to estimate ambient air pollution concentrations across the globe. Feature vectors represent environmental conditions and phenomena known to influence air pollutant concentrations, including meteorological variables (e.g., wind speed and temperature), emissions from various human activities (e.g., traffic intensity, industrial processes) and remotely sensed air pollution measurements. Each feature was selected based on established scientific evidence linking these variables to the formation, dispersion, and accumulation of air pollutants. Incorporating such a diverse and detailed set of features enables the machine learning model to robustly capture complex spatial and temporal variations in air quality, ultimately improving the accuracy and applicability of pollution estimates.

Data Description#

Data description

Data type

Point Estimates

Projection

EPSG:4326 WGS 84 (latitude/longitude)

Horizontal coverage

Global

Horizontal resolution

~0.25° (approx. 25 km at equator)

Vertical coverage

Surface only

Vertical resolution

Single layer

Temporal coverage

2022

Temporal resolution

Hourly

File format

NetCDF

Update frequency

Static

Coordinate Variables#

Name

Units

Description

Timestamp

N/A

Time coordinate (hourly resolution)

Longitude

Degrees East

Longitude of centroid (EPSG:4326)

Latitude

Degrees North

Latitude of centroid (EPSG:4326)

Each NetCDF file is indexed by (Timestamp, Latitude, Longitude).

Data Variables#

Output Variables#

Name

Units

Description

Global Model Grid ID

Unique identifier for each grid in the Global model, synthetic monitoring station locations are grid centroids.

no2 Prediction 0.05 Quantile

µg/m^3^

Estimated 5th percentile of modelled NO~2~ concentration.

no2 Prediction 0.5 Quantile

µg/m^3^

Estimated 50th percentile (median) of modelled NO~2~ concentration.

no2 Prediction 0.95 Quantile

µg/m^3^

Estimated 95th percentile of modelled NO~2~ concentration.

no2 Prediction Mean

µg/m^3^

Estimated mean of modelled NO~2~ concentration.

o3 Prediction 0.05 Quantile

µg/m^3^

Estimated 5th percentile of modelled O~3~ concentration.

o3 Prediction 0.5 Quantile

µg/m^3^

Estimated 50th percentile (median) of modelled O~3~ concentration.

o3 Prediction 0.95 Quantile

µg/m^3^

Estimated 95th percentile of modelled O~3~ concentration.

o3 Prediction Mean

µg/m^3^

Estimated mean of modelled O~3~ concentration.

pm10 Prediction 0.05 Quantile

µg/m^3^

Estimated 5th percentile of modelled PM~10~ concentration.

pm10 Prediction 0.5 Quantile

µg/m^3^

Estimated 50th percentile (median) of modelled PM~10~ concentration.

pm10 Prediction 0.95 Quantile

µg/m^3^

Estimated 95th percentile of modelled PM~10~ concentration.

pm10 Prediction Mean

µg/m^3^

Estimated mean of modelled PM~10~ concentration.

pm2.5 Prediction 0.05 Quantile

µg/m^3^

Estimated 5th percentile of modelled PM~2.5~ concentration.

pm2.5 Prediction 0.5 Quantile

µg/m^3^

Estimated 50th percentile (median) of modelled PM~2.5~ concentration.

pm2.5 Prediction 0.95 Quantile

µg/m^3^

Estimated 95th percentile of modelled PM~2.5~ concentration.

pm2.5 Prediction Mean

µg/m^3^

Estimated mean of modelled PM~2.5~ concentration.

so2 Prediction 0.05 Quantile

µg/m^3^

Estimated 5th percentile of modelled SO~2~ concentration.

so2 Prediction 0.5 Quantile

µg/m^3^

Estimated 50th percentile (median) of modelled SO~2~ concentration.

so2 Prediction 0.95 Quantile

µg/m^3^

Estimated 95th percentile of modelled SO~2~ concentration.

so2 Prediction Mean

µg/m^3^

Estimated mean of modelled SO~2~ concentration.

Input Variables#

Name

Units

Description

100m U Component of Wind

m/s

East–west wind component at 100 m above ground level.

100m V Component of Wind

m/s

North–south wind component at 100 m above ground level.

10m U Component of Wind

m/s

East–west wind component at 10 m above ground level.

10m V Component of Wind

m/s

North–south wind component at 10 m above ground level.

2m Dewpoint Temperature

K

Temperature at which air becomes saturated, measured at 2 m above ground level.

2m Temperature

K

Air temperature at 2 m above ground level.

Boundary Layer Height

m

Height of the atmospheric boundary layer above ground level.

Downward UV Radiation at Surface

W/m²

Downward ultraviolet radiant flux received at Earth’s surface.

Instantaneous 10m Wind Gust

m/s

Peak wind gust speed observed at 10 m AGL over a short time interval.

Surface Pressure

hPa

Atmospheric pressure at ground level.

Total Column Rain Water

kg/m²

Vertically integrated amount of rain water in a column of air above the surface.

S5P NO₂

mol/m²

Tropospheric column amount of nitrogen dioxide (NO₂) from Sentinel‑5P.

S5P Absorbing Aerosol Index

-

Absorbing Aerosol Index (AAI), indicating the presence of UV-absorbing aerosols in the atmosphere.

S5P CO

mol/m²

Total column amount of carbon monoxide (CO) retrieved from Sentinel‑5P.

S5P O₃

mol/m²

Total column ozone (O₃) retrieved by Sentinel‑5P.

Anthropogenic Emissions Sum Sectors co

kilotonne

Total anthropogenic CO emissions from all sectors.

Anthropogenic Emissions Sum Sectors nox

kilotonne

Total anthropogenic NOₓ emissions from all sectors.

Anthropogenic Emissions Sum Sectors nmvocs

kilotonne

Total anthropogenic non-methane volatile organic compound emissions from all sectors.

Anthropogenic Emissions Sum Sectors other-vocs

kilotonne

Total anthropogenic emissions of other volatile organic compounds from all sectors.

Anthropogenic Emissions Sum Sectors so2

kilotonne

Total anthropogenic SO₂ emissions from all sectors.

Biogenic Emissions Biogenic CO

kilotonne

Total biogenic CO emissions.

Timestamp Local

N/A

Local timestamp (adjusted using UTC offset).

UTC Offset

hours

Offset from UTC time (in hours).

Month Number

-

Integer representing the month, for example 1 (January) – 12 (December).

Week Number

-

Integer denoting the ISO week number (1–53).

Day of Week Number

-

Integer representing the weekday, for example 0 (Monday) – 6 (Sunday).

Hour Number

-

Hour of the day on a 24-hour clock, for example 0 (midnight) – 23 (11 pm).

Training Data Variables#

Alongside the additional variables included in NetCDF files across this subdirectory. For convenience, the Global Model Grid ID is also provided as an attribute.

Name

Units

Description

no2 Measurement

µg/m^3^

Measured nitrogen dioxide (NO~2~) concentration.

o3 Measurement

µg/m^3^

Measured ozone (O~3~) concentration.

pm10 Measurement

µg/m^3^

Measured particulate matter <10 µm (PM~10~) concentration.

pm2.5 Measurement

µg/m^3^

Measured particulate matter <2.5 µm (PM~2.5~)concentration.

so2 Measurement

µg/m^3^

Measured sulfur dioxide (SO~2~) concentration.