Processing Pipeline#
All processing functions share a common pattern: query a TileDB bounding box,
optionally reject ground point outliers, compute the product tile by tile, and
write results directly to an ALSZarrStore.
Important
All processing functions are idempotent by default (overwrite=False).
If data already exists for the requested variable, resolution, and year the
function returns immediately without recomputing. Pass overwrite=True to
force recomputation.
Common preprocessing#
Before any height-above-ground computation, ground points (Classification == 2)
are screened for outliers using two PDAL filters applied to the ground-point
subset only (vegetation classification is never changed):
filters.elm(Extended Local Minimum) — detects below-ground misclassifications (multipath returns, water-surface reflections).filters.outlier(statistical mode) — detects above-ground spikes in the ground class (misclassified buildings or vegetation returns).
Height-above-ground is then computed with filters.hag_delaunay (Delaunay TIN,
the same method used by the DTM pipeline) with an automatic filters.hag_nn
fallback for tiles that contain fewer than 3 ground points.
Store setup#
Create a single ALSZarrStore to hold all products for a study area:
from alsdb import ALSProvider
from alsdb.storage import ALSZarrStore
reader = ALSProvider(storage_type="local", uri="my_array")
store = ALSZarrStore("output/forest.zarr")
All processing functions accept provider, store, resolution, bbox,
and year as their first arguments. If year or bbox does not overlap any
stored data a WARNING is logged and the function returns immediately.
Canopy Height Model / DTM / DSM#
from alsdb.processing.chm import compute_chm, compute_dtm, compute_dsm, compute_all
# CHM at 1 m resolution (first returns only, max HAG per cell)
compute_chm(
provider=reader,
store=store,
resolution=1.0,
bbox=(308_000, 4_688_000, 310_000, 4_690_000),
year=2021,
)
# Large area: tiled (500 m sub-tiles, 50 m buffer, 4 workers)
compute_chm(
provider=reader,
store=store,
resolution=1.0,
year=2021,
tile_size=500.0,
tile_buffer=50.0, # buffer for accurate HAG at tile edges
n_workers=4,
)
# DTM only — choose interpolation method
compute_dtm(
provider=reader,
store=store,
resolution=1.0,
year=2021,
dtm_method="tin", # "tin" (default), "idw", or "min"
)
# Compute DTM + DSM + CHM in one tiled pass (recommended for large areas)
compute_all(
provider=reader,
store=store,
resolution=1.0,
year=2021,
tile_size=500.0,
tile_buffer=50.0,
n_workers=4,
dtm_method="tin",
)
Product descriptions:
DTM: Delaunay TIN interpolation of ground points (
Classification == 2) by default (dtm_method="tin"). IDW and minimum-Z binning are available as alternatives viadtm_method="idw"anddtm_method="min".DSM: maximum
Zof first returns per cell (no HAG normalisation).CHM: maximum
HeightAboveGroundof vegetation first returns per cell (Classification∈ {3, 4, 5},ReturnNumber == 1).
Sub-tiles with no returns remain NaN in the store.
Gap fraction and effective LAI#
from alsdb.processing.gap import LAI_K_PRESETS, compute_gap_fraction
# Gap fraction only
compute_gap_fraction(
provider=reader,
store=store,
resolution=10.0,
year=2021,
)
# Gap fraction + effective LAI with clumping correction
compute_gap_fraction(
provider=reader,
store=store,
resolution=10.0,
year=2021,
lai=True,
k=LAI_K_PRESETS["spherical"], # 0.5 — random leaf angles (default)
clumping_index=0.7, # Ω < 1 for clumped canopies (conifers)
min_density=0.5, # mask cells below 0.5 first returns m⁻²
tile_size=500.0,
n_workers=4,
)
Preset extinction coefficients for common forest types:
print(LAI_K_PRESETS)
# {"spherical": 0.5, "planophile": 0.8, "erectophile": 0.35, "conifer": 0.45}
Gap fraction estimator (MacArthur–Wilson):
where ground = Classification == 2, vegetation = Classification ∈ {3, 4, 5},
and only first returns (ReturnNumber == 1) are counted. Unclassified, building,
and noise returns do not contribute to the denominator. Cells with no classified
first returns are NaN.
Effective LAI (Beer–Lambert with clumping correction):
where k is the extinction coefficient and Ω is the element clumping index (default 1.0 = no correction). Output is capped at 15 m² m⁻² to suppress noise in near-zero gap-fraction cells.
LiDAR structural metrics#
from alsdb.processing.biomass import compute_metrics
compute_metrics(
provider=reader,
store=store,
resolution=10.0,
year=2021,
min_density=1.0, # cells below 1 pt m⁻² receive NaN for all metrics
)
Writes 16 variables to the store:
Variable |
Description |
|---|---|
|
50th percentile of vegetation HeightAboveGround (m) |
|
75th percentile (m) |
|
95th percentile (m) — commonly used as a proxy for top-of-canopy height |
|
Maximum HeightAboveGround of vegetation returns per cell (m) |
|
Mean vegetation HeightAboveGround (m) |
|
Canopy cover fraction (vegetation first returns above |
|
Total point density (all returns, pts m⁻²) |
|
Foliage Height Diversity — Shannon entropy of the vertical HAG distribution |
|
Vegetation Complexity Index — FHD normalised to [0, 1] |
|
Canopy Relief Ratio = (hmean − hmin) / (hmax − hmin) |
|
Fraction of vegetation returns with 0 < HAG ≤ 2 m |
|
Fraction with 2 < HAG ≤ 5 m |
|
Fraction with 5 < HAG ≤ 10 m |
|
Fraction with 10 < HAG ≤ 20 m |
|
Fraction with 20 < HAG ≤ 40 m |
|
Fraction with HAG > 40 m |
The min_density guard sets all 16 metrics to NaN for cells below the
threshold, suppressing unreliable estimates in data-sparse areas.
Aboveground biomass#
from alsdb.processing.biomass import compute_biomass
# Default Næsset (2002) allometric model — calibrate before scientific use
compute_biomass(
provider=reader,
store=store,
resolution=10.0,
year=2021,
tile_size=500.0,
n_workers=4,
)
Calibrating the Næsset model:
Use alsdb.processing.biomass.calibrate_naesset with field inventory plots
co-located with ALS acquisitions (≥ 20 plots required; ≥ 50 recommended):
from alsdb.processing.biomass import calibrate_naesset, naesset_model
(a, b, c), pcov = calibrate_naesset(
h95_plots, cc_plots, agb_plots, return_cov=True
)
# pcov is the 3×3 parameter covariance matrix
import numpy as np
print(f"a = {a:.3f} ± {np.sqrt(pcov[0,0]):.3f}")
compute_biomass(
provider=reader, store=store, resolution=10.0, year=2021,
model_fn=lambda m: naesset_model(m, a=a, b=b, c=c),
)
Default model (Næsset 2002 power law):
Default parameters: a=0.8, b=1.8, c=0.5. Output in Mg ha⁻¹.
Warning
The default parameters are generic placeholder values and must be calibrated against field inventory plots before using the results scientifically.
scikit-learn model via wrap_sklearn_model:
from sklearn.ensemble import RandomForestRegressor
from alsdb.processing.biomass import compute_biomass, wrap_sklearn_model
rf = RandomForestRegressor(n_estimators=200, random_state=42)
# X_train columns must match _METRIC_NAMES order (16 features by default)
rf.fit(X_train, y_agb)
compute_biomass(
provider=reader, store=store, resolution=10.0, year=2021,
model_fn=wrap_sklearn_model(rf),
)
# Explicit feature subset (must match the training column order)
compute_biomass(
provider=reader, store=store, resolution=10.0, year=2021,
model_fn=wrap_sklearn_model(rf, features=["h95", "cc", "density"]),
)
Multi-temporal change detection#
alsdb.processing.change.compute_change computes pixel-wise change between
two survey years and writes three derived products:
from alsdb.processing.change import compute_change
# CHM change 2017 → 2021; suppress sub-0.5 m noise
compute_change(
store, "chm",
year_from=2017, year_to=2021,
resolution=1.0,
min_delta=0.5, # absolute threshold — smaller changes flagged 0
pct_min_abs=0.5, # suppress % change where |year_from| < 0.5 m
)
ds = store.to_dataset(resolution=1.0)
flag = ds["chm_change_flag"].sel(time=2021) # +1 gain / −1 loss / 0 stable
delta = ds["chm_delta"].sel(time=2021) # absolute change (m)
Derived variables written per call:
Variable |
Description |
|---|---|
|
Absolute change ( |
|
Relative change (%), NaN where |
|
+1 gain, −1 loss, 0 no significant change (controlled by |
Large-footprint waveform simulation#
from alsdb.processing.waveform import simulate_waveform, simulate_batch
# Single footprint (25 m diameter, GEDI default)
result = simulate_waveform(
provider=reader,
center_x=308_500.0,
center_y=4_689_000.0,
footprint_radius=12.5,
year=2021,
beam_id="BEAM0000", # use empirical GEDI TX pulse kernel
)
print(result.rh[50]) # RH50 — height of median energy (m above ground)
print(result.rh[98]) # RH98 — equivalent to GEDI L2A rh98
print(result.cover) # canopy cover fraction
print(result.z_ground) # estimated ground elevation (m)
# Batch simulation — shots must be a DataFrame with center_x / center_y
import numpy as np, pandas as pd
xs, ys = np.meshgrid(
np.arange(308_100, 309_900, 60),
np.arange(4_688_500, 4_689_900, 60),
)
shots = pd.DataFrame({"center_x": xs.ravel(), "center_y": ys.ravel()})
results = simulate_batch(
provider=reader,
shots=shots,
year=2021,
n_workers=4,
footprint_radius=12.5,
output_path="shots_2021.parquet", # optional Parquet output
# rng=np.random.default_rng(42), # for reproducible noise (noise_std > 0)
)
Simulation algorithm:
Query all returns within the footprint circle (bounding box + circular mask).
Weight returns by the Gaussian beam profile (enabled by default).
Build a vertical histogram in 0.15 m bins.
Convolve with the empirical GEDI TX pulse kernel (per
beam_id) or a Gaussian fallback.Detect the ground return as the lowest prominent peak.
Compute cumulative relative height (RH) metrics from ground up: RH0 … RH100.
Reading results#
All products are accessible from the same store:
ds1m = store.to_dataset(resolution=1.0)
ds10m = store.to_dataset(resolution=10.0)
chm = ds1m["chm"].sel(time=2021)
agb = ds10m["biomass"].sel(time=2021)
h95 = ds10m["h95"].sel(time=2021)
crr = ds10m["crr"].sel(time=2021)
gap = ds10m["gap"].sel(time=2021)
pv_5_10 = ds10m["pv_5_10"].sel(time=2021)
# CRS is attached via rioxarray
print(ds1m.rio.crs) # e.g. "EPSG:25830"
print(store.resolutions()) # [1.0, 10.0]
print(store.variables(10.0)) # all 16 metrics + biomass + gap + lai …