# ML-HAPPG 

In [None]:
%%html
<style>
  /* 1) The wrapper enforces the “height = X% of width” */
  .iframe‐wrapper {
    position: relative;
    width: 100%;           /* full-width */
    padding-bottom: 55%;   /* ← 55% of width = height */
    height: 0;             /* collapse the wrapper’s own height (padding-bottom will “create” the height) */
  }

  /* 2) The iframe is absolutely positioned to fill the wrapper */
  .iframe‐wrapper iframe {
    position: absolute;
    top: 0;
    left: 0;
    width: 100%;    /* fill 100% of wrapper’s width */
    height: 100%;   /* fill 100% of wrapper’s (calculated) height */
    border: 0;
  }
</style>

<div class="iframe‐wrapper">
  <iframe 
    src="../../_static/mlhappg_o3.html"
    scrolling="no"
    allowfullscreen
  ></iframe>
</div>

# ML-HAPPG (Machine Learning for Hourly Air Pollution Prediction – Global)

## [A Data-Driven Supervised Machine Learning Approach to Estimating Global Ambient Air Pollution Concentrations With Associated Prediction Intervals](https://doi.org/10.1098/rsos.241288)

Addressing the global challenge of ambient air pollution requires scalable and high-resolution data. However, many regions lack comprehensive monitoring infrastructure. This research introduces a machine learning framework that extends air quality estimation globally, using remote sensing, meteorological reanalysis, and emissions datasets to produce hourly pollutant concentrations at a 0.25 degree spatial resolution across the globe.

This work builds on the LightGBM-based framework introduced in [A framework for scalable ambient air pollution concentration estimation](https://doi.org/10.1017/eds.2025.9), adapting it for a global context. LightGBM, a fast and accurate gradient-boosted decision tree algorithm, enables this large-scale application with high predictive performance and interpretability.

## Dataset Purpose

This dataset empowers diverse stakeholders, including researchers, policymakers, urban planners, and public health authorities, providing a robust basis for conducting air quality assessments and interventions at unprecedented resolution. The improved granularity facilitates more precise studies into air pollution impacts on human health, urban resilience, and environmental justice, surpassing the capabilities of conventional lower-resolution approaches. The framework’s computational efficiency and scalability further demonstrate its potential applicability for similar pollution estimation challenges globally, especially in regions with limited observational infrastructure.

The dataset is readily accessible and stored online, ensuring rapid retrieval and ease of use for various analytical and operational purposes. Users can confidently perform high-resolution analyses supported by validated machine-learning-driven estimates, thereby enhancing informed decision-making and targeted interventions aimed at reducing air pollution exposure and promoting sustainable urban environments.

For questions regarding this dataset please reach out to [Liam J Berrisford](https://orcid.org/0000-0001-6578-3497).

### Models Datasets

All of the models used in the research are included within this dataset. The models included have been trained to predict the mean air pollution concentration and the 5th / 50th (median) / 95th percentile of the air pollution concentration. The model is saved as the LightGBM booster within a .txt file, and the model parameters are kept within a .json file. The code that is used to recreate a LightGBM object within Python that can be used to make predictions is available via the [Environment Insights](https://github.com/berrli/Environmental-Insights) Python package. The directory structure provides details about whether all of the monitoring stations were used for the models, or whether all of the feature vectors were used for the respective model.

### Air Pollution (Target Vector) Dataset Overview

The dataset includes hourly estimates for key ambient air pollutants: Nitrogen Dioxide (NO~2~), Ozone (O~3~), Particulate Matter with a diameter of 10 μm or less (PM~10~), Particulate Matter with a diameter of 2.5 μm or less (PM~2.5~), and Sulphur Dioxide (SO~2~). Rigorous validation was performed to assess the accuracy of model predictions for forecasting air pollution concentrations, estimating values at previously unmeasured locations, and capturing extreme pollution episodes.

Estimates were produced at the mean, as well as at the 0.05, 0.5, and 0.95 quantiles, to capture both the typical and extreme variations in air pollution concentrations. The mean provides a central tendency estimate, useful for general assessments and long-term policy planning. Quantile-based predictions, specifically at the 0.05, 0.5, and 0.95 quantiles, offer deeper insights into the variability and uncertainty inherent in pollution concentration estimates. These quantile predictions enable stakeholders to understand the range and extremes of pollutant levels, supporting risk assessments and targeted interventions, such as public health advisories or emergency response planning during pollution peaks.


### Feature Vector Dataset Overview

The dataset developed in this study includes a comprehensive set of feature vectors used to estimate ambient air pollution concentrations across the globe. Feature vectors represent environmental conditions and phenomena known to influence air pollutant concentrations, including meteorological variables (e.g., wind speed and temperature), emissions from various human activities (e.g., traffic intensity, industrial processes) and remotely sensed air pollution measurements. Each feature was selected based on established scientific evidence linking these variables to the formation, dispersion, and accumulation of air pollutants. Incorporating such a diverse and detailed set of features enables the machine learning model to robustly capture complex spatial and temporal variations in air quality, ultimately improving the accuracy and applicability of pollution estimates.


## Data Description

| **Data description**     |                                       |
|--------------------------|---------------------------------------|
| **Data type**            | Point Estimates                       |
| **Projection**           | EPSG:4326 WGS 84 (latitude/longitude) |
| **Horizontal coverage**  | Global                                |
| **Horizontal resolution**| ~0.25° (approx. 25 km at equator)     |
| **Vertical coverage**    | Surface only                          |
| **Vertical resolution**  | Single layer                          |
| **Temporal coverage**    | 2022                                  |
| **Temporal resolution**  | Hourly                                |
| **File format**          | NetCDF                                |
| **Update frequency**     | Static                                |

### Coordinate Variables

| **Name**     | **Units**        | **Description**                             |
|--------------|------------------|---------------------------------------------|
| Timestamp    | N/A              | Time coordinate (hourly resolution)         |
| Longitude    | Degrees East     | Longitude of centroid (EPSG:4326)           |
| Latitude     | Degrees North    | Latitude of centroid (EPSG:4326)            |

Each NetCDF file is indexed by (Timestamp, Latitude, Longitude).

### Data Variables

#### Output Variables

| **Name**                       | **Units**   | **Description**                                                                                                     |
|--------------------------------|-------------|---------------------------------------------------------------------------------------------------------------------|
| Global Model Grid ID           | –           | Unique identifier for each grid in the Global model, synthetic monitoring station locations are grid centroids.     |
| no2 Prediction 0.05 Quantile   | µg/m^3^     | Estimated 5th percentile of modelled NO~2~ concentration.                                                           |
| no2 Prediction 0.5 Quantile    | µg/m^3^     | Estimated 50th percentile (median) of modelled NO~2~ concentration.                                                 |
| no2 Prediction 0.95 Quantile   | µg/m^3^     | Estimated 95th percentile of modelled NO~2~ concentration.                                                          |
| no2 Prediction Mean            | µg/m^3^     | Estimated mean of modelled NO~2~ concentration.                                                                     |
| o3 Prediction 0.05 Quantile    | µg/m^3^     | Estimated 5th percentile of modelled O~3~ concentration.                                                            |
| o3 Prediction 0.5 Quantile     | µg/m^3^     | Estimated 50th percentile (median) of modelled O~3~ concentration.                                                  |
| o3 Prediction 0.95 Quantile    | µg/m^3^     | Estimated 95th percentile of modelled O~3~ concentration.                                                           |
| o3 Prediction Mean             | µg/m^3^     | Estimated mean of modelled O~3~ concentration.                                                                      |
| pm10 Prediction 0.05 Quantile  | µg/m^3^     | Estimated 5th percentile of modelled PM~10~ concentration.                                                          |
| pm10 Prediction 0.5 Quantile   | µg/m^3^     | Estimated 50th percentile (median) of modelled PM~10~ concentration.                                                |
| pm10 Prediction 0.95 Quantile  | µg/m^3^     | Estimated 95th percentile of modelled PM~10~ concentration.                                                         |
| pm10 Prediction Mean           | µg/m^3^     | Estimated mean of modelled PM~10~ concentration.                                                                    |
| pm2.5 Prediction 0.05 Quantile | µg/m^3^     | Estimated 5th percentile of modelled PM~2.5~ concentration.                                                         |
| pm2.5 Prediction 0.5 Quantile  | µg/m^3^     | Estimated 50th percentile (median) of modelled PM~2.5~ concentration.                                               |
| pm2.5 Prediction 0.95 Quantile | µg/m^3^     | Estimated 95th percentile of modelled PM~2.5~ concentration.                                                        |
| pm2.5 Prediction Mean          | µg/m^3^     | Estimated mean of modelled PM~2.5~ concentration.                                                                   |
| so2 Prediction 0.05 Quantile   | µg/m^3^     | Estimated 5th percentile of modelled SO~2~ concentration.                                                           |
| so2 Prediction 0.5 Quantile    | µg/m^3^     | Estimated 50th percentile (median) of modelled SO~2~ concentration.                                                 |
| so2 Prediction 0.95 Quantile   | µg/m^3^     | Estimated 95th percentile of modelled SO~2~ concentration.                                                          |
| so2 Prediction Mean            | µg/m^3^     | Estimated mean of modelled SO~2~ concentration.                                                                     |

#### Input Variables

| **Name**                                       | **Units**     | **Description**                                                                                                                                                      |
|------------------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 100m U Component of Wind                       | m/s           | East–west wind component at 100 m above ground level.                                                                                                                |
| 100m V Component of Wind                       | m/s           | North–south wind component at 100 m above ground level.                                                                                                              |
| 10m U Component of Wind                        | m/s           | East–west wind component at 10 m above ground level.                                                                                                                 |
| 10m V Component of Wind                        | m/s           | North–south wind component at 10 m above ground level.                                                                                                               |
| 2m Dewpoint Temperature                        | K             | Temperature at which air becomes saturated, measured at 2 m above ground level.                                                                                      |
| 2m Temperature                                 | K             | Air temperature at 2 m above ground level.                                                                                                                           |
| Boundary Layer Height                          | m             | Height of the atmospheric boundary layer above ground level.                                                                                                         |
| Downward UV Radiation at Surface               | W/m²          | Downward ultraviolet radiant flux received at Earth’s surface.                                                                                                       |
| Instantaneous 10m Wind Gust                    | m/s           | Peak wind gust speed observed at 10 m AGL over a short time interval.                                                                                                |
| Surface Pressure                               | hPa           | Atmospheric pressure at ground level.                                                                                                                                |
| Total Column Rain Water                        | kg/m²         | Vertically integrated amount of rain water in a column of air above the surface.                                                                                     |
| S5P NO₂                                        | mol/m²        | Tropospheric column amount of nitrogen dioxide (NO₂) from Sentinel‑5P.                                                                                               |
| S5P Absorbing Aerosol Index                    | -             | Absorbing Aerosol Index (AAI), indicating the presence of UV-absorbing aerosols in the atmosphere.                                                                   |
| S5P CO                                         | mol/m²        | Total column amount of carbon monoxide (CO) retrieved from Sentinel‑5P.                                                                                              |
| S5P O₃                                         | mol/m²        | Total column ozone (O₃) retrieved by Sentinel‑5P.                                                                                                                    |
| Anthropogenic Emissions Sum Sectors co         | kilotonne     | Total anthropogenic CO emissions from all sectors.                                                                                                                   |
| Anthropogenic Emissions Sum Sectors nox        | kilotonne     | Total anthropogenic NOₓ emissions from all sectors.                                                                                                                  |
| Anthropogenic Emissions Sum Sectors nmvocs     | kilotonne     | Total anthropogenic non-methane volatile organic compound emissions from all sectors.                                                                                |
| Anthropogenic Emissions Sum Sectors other-vocs | kilotonne     | Total anthropogenic emissions of other volatile organic compounds from all sectors.                                                                                  |
| Anthropogenic Emissions Sum Sectors so2        | kilotonne     | Total anthropogenic SO₂ emissions from all sectors.                                                                                                                  |
| Biogenic Emissions Biogenic CO                 | kilotonne     | Total biogenic CO emissions.                                                                                                                                         |
| Timestamp Local                                | N/A           | Local timestamp (adjusted using UTC offset).                                                                                                                         |
| UTC Offset                                     | hours         | Offset from UTC time (in hours).                                                                                                                                     |
| Month Number                                   | -             | Integer representing the month, for example 1 (January) – 12 (December).                                                                                             |
| Week Number                                    | -             | Integer denoting the ISO week number (1–53).                                                                                                                         |
| Day of Week Number                             | -             | Integer representing the weekday, for example 0 (Monday) – 6 (Sunday).                                                                                               |
| Hour Number                                    | -             | Hour of the day on a 24-hour clock, for example 0 (midnight) – 23 (11 pm).                                                                                           |


#### Training Data Variables

Alongside the additional variables included in NetCDF files across this subdirectory. For convenience, the Global Model Grid ID is also provided as an attribute.

| **Name**                                   | **Units**     | **Description**                                                                                                                                                      |
|--------------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| no2 Measurement                            | µg/m^3^       | Measured nitrogen dioxide (NO~2~) concentration.                                                                                                                     |
| o3 Measurement                             | µg/m^3^       | Measured ozone (O~3~) concentration.                                                                                                                                 |
| pm10 Measurement                           | µg/m^3^       | Measured particulate matter <10 µm (PM~10~) concentration.                                                                                                           |
| pm2.5 Measurement                          | µg/m^3^       | Measured particulate matter <2.5 µm (PM~2.5~)concentration.                                                                                                          |
| so2 Measurement                            | µg/m^3^       | Measured sulfur dioxide (SO~2~) concentration.                                                                                                                       |
