API Reference¶
Core Classes¶
aef_loader.AEFIndex(source: DataSource = DataSource.GCS, gcp_project: str | None = None, cache_dir: Path | None = None)
¶
Manages the AEF GeoParquet index for efficient spatial/temporal queries.
The index contains metadata about all AEF tiles including their bounding boxes, paths, and optionally pre-fetched COG header metadata.
Supports both GCS (Google Cloud Storage) and Source Cooperative (AWS S3) backends.
Example
GCS (requires GCP project for requester-pays):
index = AEFIndex(source=DataSource.GCS, gcp_project="my-project")
await index.download()
tiles = await index.query(bbox=(-122.5, 37.5, -122.0, 38.0), years=(2020, 2023))
Source Cooperative (public, no auth required):
Initialize AEF index manager.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
DataSource
|
Data source (GCS or SOURCE_COOP) |
GCS
|
gcp_project
|
str | None
|
GCP project ID for requester-pays bucket access (GCS only) |
None
|
cache_dir
|
Path | None
|
Directory for caching the index (default: /tmp) |
None
|
download(force: bool = False, local_path: Path | None = None) -> Path
async
¶
Download the AEF index from cloud storage using obstore.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
force
|
bool
|
Force re-download even if cached |
False
|
local_path
|
Path | None
|
Custom path for the index file |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the downloaded index file |
load(path: Path | None = None) -> gpd.GeoDataFrame
¶
Load the index into memory as a GeoDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | None
|
Path to index file (uses cached path if not provided) |
None
|
Returns:
| Type | Description |
|---|---|
GeoDataFrame
|
GeoDataFrame with AEF tile metadata |
query(bbox: BoundingBox | None = None, years: int | DateRange | None = None, limit: int | None = None) -> list[AEFTileInfo]
async
¶
Query the index for tiles matching the given criteria.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bbox
|
BoundingBox | None
|
Bounding box filter (minx, miny, maxx, maxy) in WGS84 |
None
|
years
|
int | DateRange | None
|
Single year or (start_year, end_year) tuple |
None
|
limit
|
int | None
|
Maximum number of tiles to return |
None
|
Returns:
| Type | Description |
|---|---|
list[AEFTileInfo]
|
List of AEFTileInfo objects matching the query |
aef_loader.VirtualTiffReader(gcp_project: str | None = None)
¶
COG reader using virtual-tiff to create virtual zarr stores.
This provides efficient COG access by: - Creating virtual zarr stores from COGs without data duplication - Using async I/O via obstore for cloud access (GCS and S3) - Organizing tiles by UTM zone for proper CRS handling - Integrating directly with xarray for data loading
The primary method is open_tiles_by_zone() which loads tiles organized
by their native UTM zone. To combine data across zones, use
reproject_datatree() from the utils module.
Example
from aef_loader import AEFIndex, VirtualTiffReader, DataSource
from aef_loader.utils import reproject_datatree
from odc.geo.geobox import GeoBox
# Query tiles
index = AEFIndex(source=DataSource.SOURCE_COOP)
await index.download()
index.load()
tiles = await index.query(bbox=(-122.5, 37.5, -121.5, 38.5), years=(2020, 2022))
# Load by UTM zone
async with VirtualTiffReader() as reader:
tree = await reader.open_tiles_by_zone(tiles)
# Reproject to common CRS if needed
target = GeoBox.from_bbox(bbox=(-122.5, 37.5, -121.5, 38.5), crs="EPSG:4326", resolution=0.0001)
combined = reproject_datatree(tree, target)
result = combined.compute()
Initialize the virtual TIFF reader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gcp_project
|
str | None
|
GCP project ID for requester-pays buckets (GCS only) |
None
|
open_tiles_by_zone(tiles: list[AEFTileInfo], ifd: int = 0, chunks: int | dict | Literal['auto'] | None = 'auto') -> DataTree
async
¶
Open tiles and organize them by UTM zone in a DataTree.
Each UTM zone becomes a group in the DataTree, containing a Dataset
with a single 'embeddings' variable with a band dimension (A00–A63).
Both nodata and _FillValue attrs are set to -128 on each
embeddings variable so that downstream tools (odc-geo xr_reproject,
xarray) correctly identify the AEF nodata sentinel.
This is the primary method for loading AEF data. It keeps each zone's
data in its native CRS for accurate spatial operations. To combine
data across zones, use reproject_datatree() from the utils module.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tiles
|
list[AEFTileInfo]
|
List of AEFTileInfo objects from AEFIndex.query() |
required |
ifd
|
int
|
Image File Directory index (0 for full resolution) |
0
|
chunks
|
int | dict | Literal['auto'] | None
|
The chunks parameter to pass to open_zarr, defaults to auto, useful to pass None to stop dask task explosions |
'auto'
|
Returns:
| Type | Description |
|---|---|
DataTree
|
DataTree with structure: ├── 10N/ → Dataset with embeddings(time, band, y, x) in EPSG:32610 ├── 10S/ → Dataset with embeddings(time, band, y, x) in EPSG:32710 ├── 11N/ → Dataset with embeddings(time, band, y, x) in EPSG:32611 ... |
Types¶
aef_loader.DataSource
¶
Bases: Enum
Data source for AEF embeddings.
aef_loader.AEFTileInfo(id: str, path: str, year: int, bbox: BoundingBox, crs_epsg: int, utm_zone: str | None = None, utm_bounds: BoundingBox | None = None, source: DataSource | None = None)
dataclass
¶
Information about an AEF tile/scene.
as_datetime: dt.datetime
property
¶
Get datetime (January 1st of the year).
Utility Functions¶
aef_loader.dequantize_aef(data: np.ndarray | xr.DataArray | xr.Dataset, divisor: float = AEF_DEQUANT_DIVISOR, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset
¶
Dequantize AEF embeddings from int8 to float32.
AEF embeddings are stored as quantized int8 values [-127, 127]. This function converts them back to float32 [-1, 1] for use in ML pipelines.
The formula is: ((value / 127.5) ** 2) * sign(value)
NoData values (-128) are automatically converted to NaN. For DataArray
inputs, both nodata and _FillValue attrs are set to NaN on
the output so that downstream tools (odc-geo, xarray) recognise the
new fill value. All other existing attrs are preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
ndarray | DataArray | Dataset
|
Quantized embedding data (int8) |
required |
divisor
|
float
|
Dequantization divisor (default: 127.5) |
AEF_DEQUANT_DIVISOR
|
nodata_value
|
int
|
Value to treat as nodata (default: -128) |
AEF_NODATA_VALUE
|
Returns:
| Type | Description |
|---|---|
ndarray | DataArray | Dataset
|
Dequantized float32 data in range [-1, 1], with NaN for nodata |
aef_loader.quantize_aef(data: np.ndarray | xr.DataArray, divisor: float = AEF_DEQUANT_DIVISOR) -> np.ndarray | xr.DataArray
¶
Quantize float32 embeddings to int8 for storage.
This is the inverse of dequantize_aef(). Dequantization: ((v / 127.5) ** 2) * sign(v) Quantization (inverse): sign(v) * sqrt(|v|) * 127.5
For DataArray inputs, both nodata and _FillValue attrs are set
to -128 (AEF_NODATA_VALUE) on the output so that downstream tools
(odc-geo, xarray) recognise the int8 nodata sentinel. All other
existing attrs are preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
ndarray | DataArray
|
Float32 embedding data in range [-1, 1] |
required |
divisor
|
float
|
Quantization divisor (default: 127.5) |
AEF_DEQUANT_DIVISOR
|
Returns:
| Type | Description |
|---|---|
ndarray | DataArray
|
Quantized int8 data in range [-127, 127] |
aef_loader.mask_nodata(data: np.ndarray | xr.DataArray, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray
¶
Mask NoData values (-128) in AEF embeddings.
NoData pixels have -128 in all channels. This function replaces NoData values with NaN for proper handling in analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
ndarray | DataArray
|
AEF embedding data (int8) |
required |
nodata_value
|
int
|
Value to mask (default: -128) |
AEF_NODATA_VALUE
|
Returns:
| Type | Description |
|---|---|
ndarray | DataArray
|
Data with NoData values replaced by NaN |
aef_loader.int8_to_float32(data: np.ndarray | xr.DataArray | xr.Dataset, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset
¶
Cast int8 AEF embeddings to float32 without dequantization.
Unlike dequantize_aef(), this performs a simple type cast: int8 values become their float32 equivalents (e.g. 64 -> 64.0, not 0.252). NoData values (-128) are replaced with NaN.
For DataArray inputs, both nodata and _FillValue attrs are set
to NaN on the output. All other existing attrs are preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
ndarray | DataArray | Dataset
|
Quantized embedding data (int8) |
required |
nodata_value
|
int
|
Value to treat as nodata (default: -128) |
AEF_NODATA_VALUE
|
Returns:
| Type | Description |
|---|---|
ndarray | DataArray | Dataset
|
Float32 data with raw int8 values preserved, NaN for nodata |
aef_loader.set_aef_nodata(data: xr.DataArray | xr.Dataset, nodata: int | float = AEF_NODATA_VALUE) -> xr.DataArray | xr.Dataset
¶
Return a copy with the nodata and _FillValue attributes set explicitly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataArray | Dataset
|
Input DataArray or Dataset. |
required |
nodata
|
int | float
|
The nodata sentinel to stamp. Use AEF_NODATA_VALUE (-128) for raw/quantized embeddings, or np.nan for dequantized float data. |
AEF_NODATA_VALUE
|
The input is not modified; a shallow copy (shared data, new attrs) is returned.
aef_loader.split_bands(ds: xr.Dataset, var: str = 'embeddings') -> xr.Dataset
¶
Split a single multi-band DataArray into separate named variables (A00–A63).
This is the inverse of the compact band representation used by VirtualTiffReader.open_tiles_by_zone(). Use this when downstream code expects individual A00–A63 data variables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ds
|
Dataset
|
Dataset containing a variable with a 'band' dimension |
required |
var
|
str
|
Name of the variable to split (default: "embeddings") |
'embeddings'
|
Returns:
| Type | Description |
|---|---|
Dataset
|
Dataset with one variable per band (A00, A01, ..., A63) |
aef_loader.reproject_datatree(tree: DataTree, target_geobox: GeoBox, resampling: str = 'nearest', dst_nodata: int | float | None = None) -> xr.Dataset
¶
Reproject all zones in a DataTree to a common target GeoBox.
This function takes a DataTree with multiple UTM zones and reprojects each zone's dataset to a common coordinate system defined by the target GeoBox. The reprojected datasets are then combined into a single dataset.
The reprojection is lazy - it builds a dask computation graph that only executes when .compute() is called. Chunks are loaded and reprojected on-demand.
For combining zones, this uses xarray's combine_first which: - Uses values from earlier zones where available (non-NaN) - Fills NaN regions with values from subsequent zones - In true overlapping regions (both have valid data), earlier zones take precedence
Since overlapping regions contain reprojections of the same underlying data, values should be identical regardless of which zone they come from.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tree
|
DataTree
|
DataTree with zone datasets as children (from open_tiles_by_zone) |
required |
target_geobox
|
GeoBox
|
Target GeoBox defining the output CRS, resolution, and extent. Can be created with GeoBox.from_bbox() or from an existing dataset. |
required |
resampling
|
str
|
Resampling method - "nearest", "bilinear", "cubic", etc. Default is "nearest" which preserves original int8 values. |
'nearest'
|
dst_nodata
|
int | float | None
|
Nodata value for the output. When None (default), xr_reproject
reads the value from the source DataArray's nodata/_FillValue attrs.
When set, the value is passed to xr_reproject and both |
None
|
Returns:
| Type | Description |
|---|---|
Dataset
|
Combined xr.Dataset with all zones reprojected to the target GeoBox. |
Dataset
|
Data variables remain as dask arrays until .compute() is called. |
Example
from odc.geo.geobox import GeoBox
# Create target geobox (e.g., 100m resolution in EPSG:4326)
target = GeoBox.from_bbox(
bbox=(-122.5, 37.5, -121.5, 38.5),
crs="EPSG:4326",
resolution=0.001, # ~100m at this latitude
)
# Reproject all zones to target
combined = reproject_datatree(tree, target)
result = combined.compute() # triggers actual reprojection