Data Ingestion#
alsdb.ALSDatabase manages the entire ingestion workflow: reading LAZ files via PDAL, writing point data to the TileDB array, tracking ingestion status in a manifest, and consolidating fragments for query performance.
Overview of the ingestion workflow#
Read the LAZ header: year, CRS, bounding box, and tile name are extracted from the file using PDAL metadata and the
PNOATileName/GenericTileNameparser.Check the manifest: if the file has already been ingested (same filename, status
"ingested"), the tile is skipped. Passoverwrite=Trueto force re-ingestion.Read points in chunks:
ALSTile.iter_chunks()yields ~1 M points at a time as structured NumPy arrays. An optional classification filter can restrict ingestion to specific LAS class codes (e.g. ground + vegetation only).Write to TileDB:
ALSDatabase.write()appends points to the sparse array. Each chunk creates or extends a TileDB fragment. Coordinates are clipped to the array domain bounds.Update the manifest: on success the manifest entry is marked
"ingested"with point count, CRS, and bbox recorded.Consolidate (optional): after every
consolidate_everytilesingest_many()triggersALSDatabase.consolidate()to merge fragments.
Example usage#
Single tile
import alsdb
from alsdb import ALSDatabase
alsdb.setup_logging()
db = ALSDatabase(storage_type="local", uri="my_array")
db.ingest("path/to/tile.laz")
Batch ingest (parallel, with consolidation)
from pathlib import Path
paths = sorted(Path("/data/als/").glob("*.laz"))
db.ingest_many(
paths,
max_workers=8, # parallel LAZ readers / TileDB writers
consolidate_every=50, # consolidate fragments every 50 tiles
)
Filter to specific LAS classes
Pass a list of LAS class codes to ingest() or ingest_many() to restrict which points are stored. This reduces array size when only ground (2) and first vegetation returns (3–5) are needed:
db.ingest("tile.laz", classes=[2, 3, 4, 5])
Re-ingest from scratch
db.ingest_many(paths, overwrite=True)
Inspecting the manifest#
The manifest records the ingestion status of every file. It is stored as JSON alongside the TileDB array.
for entry in db.list_ingested():
print(entry["filename"], entry["year"], entry["n_points"], entry["status"])
# Check stored CRS
print(db.stored_crs()) # e.g. "EPSG:25830"
Fragment consolidation#
Every ingested tile creates a new TileDB fragment. After many ingestions, fragment count grows and the overhead of opening/scanning fragments during queries increases. Consolidation merges fragments into larger ones:
db.consolidate()
By default ingest_many() consolidates every 50 tiles automatically. For very large campaigns you may want to run a final consolidation after all tiles are ingested:
db.ingest_many(paths, max_workers=8, consolidate_every=500)
db.consolidate() # final consolidation
Performance considerations#
Use
max_workers=8or more for large campaigns. Each worker reads a LAZ file and writes to TileDB independently.Keep
consolidate_everybetween 50 and 200. Too frequent consolidation adds overhead; too infrequent degrades query performance.On S3, set
multipart_part_sizein the TileDB S3 config to at least 50 MB for efficient multipart uploads.The TileDB domain bounds are set once at array creation. If you later ingest data outside the initial domain, points will be silently clipped. Choose domain bounds conservatively (see Storage Architecture).