Storage Architecture#
alsDB uses two complementary storage layers. Understanding how they fit together is the key to understanding the rest of the package.
Overview#
LAZ file
│
▼
PDAL (readers.las)
│ year / bbox / CRS ← LAZ header
▼
ALSTile.iter_chunks() ← optional classification filter
│ X, Y, attrs numpy arrays (1 M pts/chunk)
▼
ALSDatabase.write()
│
▼
TileDB sparse array (local / s3://)
│ dimensions: X (float64) × Y (float64) × Year (int16)
│ allows_duplicates=True (multiple returns per XY)
│
├── ALSProvider.query_bbox() → pandas DataFrame / xarray Dataset
│
└── processing.*
│ PDAL hag_delaunay + scipy binned_statistic_2d
▼
ALSZarrStore (Zarr v3, local / s3://)
├── 1m/ chm, dtm, dsm (T × ny × nx) float32
└── 10m/ gap, lai, biomass,
h50…density (T × ny × nx) float32
Layer 1: TileDB sparse array (point clouds)#
Each ingested LAZ file writes its points into a single shared TileDB sparse array. The array has three dimensions:
Dimension |
Type |
Role |
|---|---|---|
|
float64 |
UTM easting (m) |
|
float64 |
UTM northing (m) |
|
int16 |
Survey acquisition year |
All 18 standard LAS point attributes (Z, Intensity, ReturnNumber, Classification, R, G, B, …) are stored as TileDB attributes with ZSTD-9 compression. X and Y use ByteShuffle + ZSTD. Year uses DoubleDelta + ZSTD.
Why TileDB?
Sparse by design: point density varies enormously; TileDB stores only occupied cells with no wasted space.
Fragment-based writes: each ingested tile appends a new fragment. No locking, no global index rebuild. Multiple workers can write simultaneously.
Multi-temporal in one array:surveys from 2019, 2021, and 2023 coexist in the same array.
query_bbox(..., year=2021)reads only the 2021 fragments.S3-native: the array URI can be
s3://bucket/als_array; TileDB handles multipart I/O transparently.
Fragment consolidation
Each ingest call creates a new fragment. After many ingestions, fragment count grows and query overhead increases. ALSDatabase.consolidate() merges fragments to restore performance. ingest_many() calls consolidation automatically every consolidate_every tiles (default 50).
Domain bounds
The spatial domain (X and Y extents of the TileDB array) is chosen automatically from the CRS of the first ingested tile. For UTM Zone 30N (EPSG:25830) the domain spans approximately [100 000, 1 000 000] × [0, 10 000 000]. For unknown CRS a global ±10⁷ fallback is used. The domain only needs to contain all future ingested data; it is set at array creation time and cannot be changed.
Layer 2: Zarr v3 store (gridded products)#
All processing outputs (CHM, DTM, DSM, gap fraction, LAI, structural metrics, biomass) are written to an ALSZarrStore (i.e., a Zarr v3 hierarchy on disk or S3):
forest.zarr/
├── 1m/
│ ├── chm (T, ny, nx) float32 ← Canopy Height Model
│ ├── dtm (T, ny, nx) float32 ← Digital Terrain Model
│ └── dsm (T, ny, nx) float32 ← Digital Surface Model
└── 10m/
├── gap (T, ny, nx) float32 ← gap fraction
├── lai (T, ny, nx) float32 ← effective LAI
├── biomass (T, ny, nx) float32 ← aboveground biomass (Mg/ha)
├── h50 (T, ny, nx) float32 ← height percentile 50
├── h75 (T, ny, nx) float32 ← height percentile 75
├── h95 (T, ny, nx) float32 ← height percentile 95
├── hmean (T, ny, nx) float32 ← mean vegetation height
├── cc (T, ny, nx) float32 ← canopy cover fraction
└── density (T, ny, nx) float32 ← point density (pts/m²)
The T (time) axis stores survey years as integers (e.g. 2019, 2021, 2023). Each variable has an associated x and y coordinate array. Opening the store with to_dataset(resolution=1.0) returns an xarray.Dataset with CRS attached via rioxarray.
Tiling strategy#
For areas larger than a single tile, processing is split into 500 m sub-tiles (configurable via tile_size). Each sub-tile has two bounding boxes:
query_bboxInflated by
tile_buffer(default 50 m) on all sides. Used for the TileDB query and for HAG computation viafilters.hag_delaunay. The larger neighbourhood ensures the Delaunay TIN is well-constrained near tile edges.crop_bboxThe original, non-overlapping tile extent. Only pixels inside this bbox are written to the Zarr store. This prevents double-writing at tile boundaries.
Note
The 50 m buffer is important for CHM accuracy near tile edges. Without it, the ground TIN is under-constrained at the boundary and HAG values for vegetation near the edge can be wrong. DTM and DSM do not require a buffer because they do not depend on the HAG computation.
The combination of buffered query + non-overlapping write is the key design decision that allows correct, parallel, direct-to-Zarr tiled processing.