API Reference¶

Core Classes¶

`aef_loader.AEFIndex(source: DataSource = DataSource.GCS, gcp_project: str | None = None, cache_dir: Path | None = None)` ¶

Manages the AEF GeoParquet index for efficient spatial/temporal queries.

The index contains metadata about all AEF tiles including their bounding boxes, paths, and optionally pre-fetched COG header metadata.

Supports both GCS (Google Cloud Storage) and Source Cooperative (AWS S3) backends.

Example

GCS (requires GCP project for requester-pays):

index = AEFIndex(source=DataSource.GCS, gcp_project="my-project")
await index.download()
tiles = await index.query(bbox=(-122.5, 37.5, -122.0, 38.0), years=(2020, 2023))

Source Cooperative (public, no auth required):

index = AEFIndex(source=DataSource.SOURCE_COOP)
await index.download()
tiles = await index.query(bbox=(-122.5, 37.5, -122.0, 38.0), years=(2020, 2023))

Initialize AEF index manager.

Parameters:

Name	Type	Description	Default
`source`	`DataSource`	Data source (GCS or SOURCE_COOP)	`GCS`
`gcp_project`	`str \| None`	GCP project ID for requester-pays bucket access (GCS only)	`None`
`cache_dir`	`Path \| None`	Directory for caching the index (default: /tmp)	`None`

`download(force: bool = False, local_path: Path | None = None) -> Path` `async` ¶

Download the AEF index from cloud storage using obstore.

Parameters:

Name	Type	Description	Default
`force`	`bool`	Force re-download even if cached	`False`
`local_path`	`Path \| None`	Custom path for the index file	`None`

Returns:

Type	Description
`Path`	Path to the downloaded index file

`load(path: Path | None = None) -> gpd.GeoDataFrame` ¶

Load the index into memory as a GeoDataFrame.

Parameters:

Name	Type	Description	Default
`path`	`Path \| None`	Path to index file (uses cached path if not provided)	`None`

Returns:

Type	Description
`GeoDataFrame`	GeoDataFrame with AEF tile metadata

`query(bbox: BoundingBox | None = None, years: int | DateRange | None = None, limit: int | None = None) -> list[AEFTileInfo]` `async` ¶

Query the index for tiles matching the given criteria.

Parameters:

Name	Type	Description	Default
`bbox`	`BoundingBox \| None`	Bounding box filter (minx, miny, maxx, maxy) in WGS84	`None`
`years`	`int \| DateRange \| None`	Single year or (start_year, end_year) tuple	`None`
`limit`	`int \| None`	Maximum number of tiles to return	`None`

Returns:

Type	Description
`list[AEFTileInfo]`	List of AEFTileInfo objects matching the query

`aef_loader.VirtualTiffReader(gcp_project: str | None = None)` ¶

COG reader using virtual-tiff to create virtual zarr stores.

This provides efficient COG access by: - Creating virtual zarr stores from COGs without data duplication - Using async I/O via obstore for cloud access (GCS and S3) - Organizing tiles by UTM zone for proper CRS handling - Integrating directly with xarray for data loading

The primary method is open_tiles_by_zone() which loads tiles organized by their native UTM zone. To combine data across zones, use reproject_datatree() from the utils module.

Example

from aef_loader import AEFIndex, VirtualTiffReader, DataSource
from aef_loader.utils import reproject_datatree
from odc.geo.geobox import GeoBox

# Query tiles
index = AEFIndex(source=DataSource.SOURCE_COOP)
await index.download()
index.load()
tiles = await index.query(bbox=(-122.5, 37.5, -121.5, 38.5), years=(2020, 2022))

# Load by UTM zone
async with VirtualTiffReader() as reader:
    tree = await reader.open_tiles_by_zone(tiles)

# Reproject to common CRS if needed
target = GeoBox.from_bbox(bbox=(-122.5, 37.5, -121.5, 38.5), crs="EPSG:4326", resolution=0.0001)
combined = reproject_datatree(tree, target)
result = combined.compute()

Initialize the virtual TIFF reader.

Parameters:

Name	Type	Description	Default
`gcp_project`	`str \| None`	GCP project ID for requester-pays buckets (GCS only)	`None`

`open_tiles_by_zone(tiles: list[AEFTileInfo], ifd: int = 0, chunks: int | dict | Literal['auto'] | None = 'auto') -> DataTree` `async` ¶

Open tiles and organize them by UTM zone in a DataTree.

Each UTM zone becomes a group in the DataTree, containing a Dataset with a single 'embeddings' variable with a band dimension (A00–A63). Both nodata and _FillValue attrs are set to -128 on each embeddings variable so that downstream tools (odc-geo xr_reproject, xarray) correctly identify the AEF nodata sentinel.

This is the primary method for loading AEF data. It keeps each zone's data in its native CRS for accurate spatial operations. To combine data across zones, use reproject_datatree() from the utils module.

Parameters:

Name	Type	Description	Default
`tiles`	`list[AEFTileInfo]`	List of AEFTileInfo objects from AEFIndex.query()	required
`ifd`	`int`	Image File Directory index (0 for full resolution)	`0`
`chunks`	`int \| dict \| Literal['auto'] \| None`	The chunks parameter to pass to open_zarr, defaults to auto, useful to pass None to stop dask task explosions	`'auto'`

Returns:

Type	Description
`DataTree`	DataTree with structure: ├── 10N/ → Dataset with embeddings(time, band, y, x) in EPSG:32610 ├── 10S/ → Dataset with embeddings(time, band, y, x) in EPSG:32710 ├── 11N/ → Dataset with embeddings(time, band, y, x) in EPSG:32611 ...

Example

tiles = await index.query(bbox=bbox, years=(2020, 2022))
async with VirtualTiffReader() as reader:
    tree = await reader.open_tiles_by_zone(tiles)
for zone in tree.children:
    ds = tree[zone].ds
    print(f"{zone}: {ds.odc.crs}, {dict(ds.sizes)}")

Types¶

`aef_loader.DataSource` ¶

Bases: Enum

Data source for AEF embeddings.

`aef_loader.AEFTileInfo(id: str, path: str, year: int, bbox: BoundingBox, crs_epsg: int, utm_zone: str | None = None, utm_bounds: BoundingBox | None = None, source: DataSource | None = None)` `dataclass` ¶

Information about an AEF tile/scene.

`as_datetime: dt.datetime` `property` ¶

Get datetime (January 1st of the year).

Utility Functions¶

`aef_loader.dequantize_aef(data: np.ndarray | xr.DataArray | xr.Dataset, divisor: float = AEF_DEQUANT_DIVISOR, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset` ¶

Dequantize AEF embeddings from int8 to float32.

AEF embeddings are stored as quantized int8 values [-127, 127]. This function converts them back to float32 [-1, 1] for use in ML pipelines.

The formula is: ((value / 127.5) ** 2) * sign(value)

NoData values (-128) are automatically converted to NaN. For DataArray inputs, both nodata and _FillValue attrs are set to NaN on the output so that downstream tools (odc-geo, xarray) recognise the new fill value. All other existing attrs are preserved.

Parameters:

Name	Type	Description	Default
`data`	`ndarray \| DataArray \| Dataset`	Quantized embedding data (int8)	required
`divisor`	`float`	Dequantization divisor (default: 127.5)	`AEF_DEQUANT_DIVISOR`
`nodata_value`	`int`	Value to treat as nodata (default: -128)	`AEF_NODATA_VALUE`

Returns:

Type	Description
`ndarray \| DataArray \| Dataset`	Dequantized float32 data in range [-1, 1], with NaN for nodata

Example

import numpy as np
quantized = np.array([127, -127, 0, -128], dtype=np.int8)
dequantized = dequantize_aef(quantized)
print(dequantized)  # [~1.0, ~-1.0, 0.0, nan]

`aef_loader.quantize_aef(data: np.ndarray | xr.DataArray, divisor: float = AEF_DEQUANT_DIVISOR) -> np.ndarray | xr.DataArray` ¶

Quantize float32 embeddings to int8 for storage.

This is the inverse of dequantize_aef(). Dequantization: ((v / 127.5) ** 2) * sign(v) Quantization (inverse): sign(v) * sqrt(|v|) * 127.5

For DataArray inputs, both nodata and _FillValue attrs are set to -128 (AEF_NODATA_VALUE) on the output so that downstream tools (odc-geo, xarray) recognise the int8 nodata sentinel. All other existing attrs are preserved.

Parameters:

Name	Type	Description	Default
`data`	`ndarray \| DataArray`	Float32 embedding data in range [-1, 1]	required
`divisor`	`float`	Quantization divisor (default: 127.5)	`AEF_DEQUANT_DIVISOR`

Returns:

Type	Description
`ndarray \| DataArray`	Quantized int8 data in range [-127, 127]

`aef_loader.mask_nodata(data: np.ndarray | xr.DataArray, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray` ¶

Mask NoData values (-128) in AEF embeddings.

NoData pixels have -128 in all channels. This function replaces NoData values with NaN for proper handling in analysis.

Parameters:

Name	Type	Description	Default
`data`	`ndarray \| DataArray`	AEF embedding data (int8)	required
`nodata_value`	`int`	Value to mask (default: -128)	`AEF_NODATA_VALUE`

Returns:

Type	Description
`ndarray \| DataArray`	Data with NoData values replaced by NaN

`aef_loader.int8_to_float32(data: np.ndarray | xr.DataArray | xr.Dataset, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset` ¶

Cast int8 AEF embeddings to float32 without dequantization.

Unlike dequantize_aef(), this performs a simple type cast: int8 values become their float32 equivalents (e.g. 64 -> 64.0, not 0.252). NoData values (-128) are replaced with NaN.

For DataArray inputs, both nodata and _FillValue attrs are set to NaN on the output. All other existing attrs are preserved.

Parameters:

Name	Type	Description	Default
`data`	`ndarray \| DataArray \| Dataset`	Quantized embedding data (int8)	required
`nodata_value`	`int`	Value to treat as nodata (default: -128)	`AEF_NODATA_VALUE`

Returns:

Type	Description
`ndarray \| DataArray \| Dataset`	Float32 data with raw int8 values preserved, NaN for nodata

`aef_loader.set_aef_nodata(data: xr.DataArray | xr.Dataset, nodata: int | float = AEF_NODATA_VALUE) -> xr.DataArray | xr.Dataset` ¶

set_aef_nodata(
    data: xr.DataArray, nodata: int | float = ...
) -> xr.DataArray

set_aef_nodata(
    data: xr.Dataset, nodata: int | float = ...
) -> xr.Dataset

Return a copy with the nodata and _FillValue attributes set explicitly.

Parameters:

Name	Type	Description	Default
`data`	`DataArray \| Dataset`	Input DataArray or Dataset.	required
`nodata`	`int \| float`	The nodata sentinel to stamp. Use AEF_NODATA_VALUE (-128) for raw/quantized embeddings, or np.nan for dequantized float data.	`AEF_NODATA_VALUE`

The input is not modified; a shallow copy (shared data, new attrs) is returned.

`aef_loader.split_bands(ds: xr.Dataset, var: str = 'embeddings') -> xr.Dataset` ¶

Split a single multi-band DataArray into separate named variables (A00–A63).

This is the inverse of the compact band representation used by VirtualTiffReader.open_tiles_by_zone(). Use this when downstream code expects individual A00–A63 data variables.

Parameters:

Name	Type	Description	Default
`ds`	`Dataset`	Dataset containing a variable with a 'band' dimension	required
`var`	`str`	Name of the variable to split (default: "embeddings")	`'embeddings'`

Returns:

Type	Description
`Dataset`	Dataset with one variable per band (A00, A01, ..., A63)

`aef_loader.reproject_datatree(tree: DataTree, target_geobox: GeoBox, resampling: str = 'nearest', dst_nodata: int | float | None = None) -> xr.Dataset` ¶

Reproject all zones in a DataTree to a common target GeoBox.

This function takes a DataTree with multiple UTM zones and reprojects each zone's dataset to a common coordinate system defined by the target GeoBox. The reprojected datasets are then combined into a single dataset.

The reprojection is lazy - it builds a dask computation graph that only executes when .compute() is called. Chunks are loaded and reprojected on-demand.

For combining zones, this uses xarray's combine_first which: - Uses values from earlier zones where available (non-NaN) - Fills NaN regions with values from subsequent zones - In true overlapping regions (both have valid data), earlier zones take precedence

Since overlapping regions contain reprojections of the same underlying data, values should be identical regardless of which zone they come from.

Parameters:

Name	Type	Description	Default
`tree`	`DataTree`	DataTree with zone datasets as children (from open_tiles_by_zone)	required
`target_geobox`	`GeoBox`	Target GeoBox defining the output CRS, resolution, and extent. Can be created with GeoBox.from_bbox() or from an existing dataset.	required
`resampling`	`str`	Resampling method - "nearest", "bilinear", "cubic", etc. Default is "nearest" which preserves original int8 values.	`'nearest'`
`dst_nodata`	`int \| float \| None`	Nodata value for the output. When None (default), xr_reproject reads the value from the source DataArray's nodata/_FillValue attrs. When set, the value is passed to xr_reproject and both `nodata` and `_FillValue` attrs are stamped on each output data variable after reprojection and after the zone merge.	`None`

Returns:

Type	Description
`Dataset`	Combined xr.Dataset with all zones reprojected to the target GeoBox.
`Dataset`	Data variables remain as dask arrays until .compute() is called.

Example

from odc.geo.geobox import GeoBox

# Create target geobox (e.g., 100m resolution in EPSG:4326)
target = GeoBox.from_bbox(
    bbox=(-122.5, 37.5, -121.5, 38.5),
    crs="EPSG:4326",
    resolution=0.001,  # ~100m at this latitude
)

# Reproject all zones to target
combined = reproject_datatree(tree, target)
result = combined.compute()  # triggers actual reprojection

API Reference¶

Core Classes¶

aef_loader.AEFIndex(source: DataSource = DataSource.GCS, gcp_project: str | None = None, cache_dir: Path | None = None) ¶

download(force: bool = False, local_path: Path | None = None) -> Path async ¶

load(path: Path | None = None) -> gpd.GeoDataFrame ¶

query(bbox: BoundingBox | None = None, years: int | DateRange | None = None, limit: int | None = None) -> list[AEFTileInfo] async ¶

aef_loader.VirtualTiffReader(gcp_project: str | None = None) ¶

open_tiles_by_zone(tiles: list[AEFTileInfo], ifd: int = 0, chunks: int | dict | Literal['auto'] | None = 'auto') -> DataTree async ¶

Types¶

aef_loader.DataSource ¶

aef_loader.AEFTileInfo(id: str, path: str, year: int, bbox: BoundingBox, crs_epsg: int, utm_zone: str | None = None, utm_bounds: BoundingBox | None = None, source: DataSource | None = None) dataclass ¶

as_datetime: dt.datetime property ¶

Utility Functions¶

aef_loader.dequantize_aef(data: np.ndarray | xr.DataArray | xr.Dataset, divisor: float = AEF_DEQUANT_DIVISOR, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset ¶

aef_loader.quantize_aef(data: np.ndarray | xr.DataArray, divisor: float = AEF_DEQUANT_DIVISOR) -> np.ndarray | xr.DataArray ¶

aef_loader.mask_nodata(data: np.ndarray | xr.DataArray, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray ¶

aef_loader.int8_to_float32(data: np.ndarray | xr.DataArray | xr.Dataset, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset ¶

aef_loader.set_aef_nodata(data: xr.DataArray | xr.Dataset, nodata: int | float = AEF_NODATA_VALUE) -> xr.DataArray | xr.Dataset ¶

aef_loader.split_bands(ds: xr.Dataset, var: str = 'embeddings') -> xr.Dataset ¶

aef_loader.reproject_datatree(tree: DataTree, target_geobox: GeoBox, resampling: str = 'nearest', dst_nodata: int | float | None = None) -> xr.Dataset ¶

`aef_loader.AEFIndex(source: DataSource = DataSource.GCS, gcp_project: str | None = None, cache_dir: Path | None = None)` ¶

`download(force: bool = False, local_path: Path | None = None) -> Path` `async` ¶

`load(path: Path | None = None) -> gpd.GeoDataFrame` ¶

`query(bbox: BoundingBox | None = None, years: int | DateRange | None = None, limit: int | None = None) -> list[AEFTileInfo]` `async` ¶

`aef_loader.VirtualTiffReader(gcp_project: str | None = None)` ¶

`open_tiles_by_zone(tiles: list[AEFTileInfo], ifd: int = 0, chunks: int | dict | Literal['auto'] | None = 'auto') -> DataTree` `async` ¶

`aef_loader.DataSource` ¶

`aef_loader.AEFTileInfo(id: str, path: str, year: int, bbox: BoundingBox, crs_epsg: int, utm_zone: str | None = None, utm_bounds: BoundingBox | None = None, source: DataSource | None = None)` `dataclass` ¶

`as_datetime: dt.datetime` `property` ¶

`aef_loader.dequantize_aef(data: np.ndarray | xr.DataArray | xr.Dataset, divisor: float = AEF_DEQUANT_DIVISOR, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset` ¶

`aef_loader.quantize_aef(data: np.ndarray | xr.DataArray, divisor: float = AEF_DEQUANT_DIVISOR) -> np.ndarray | xr.DataArray` ¶

`aef_loader.mask_nodata(data: np.ndarray | xr.DataArray, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray` ¶

`aef_loader.int8_to_float32(data: np.ndarray | xr.DataArray | xr.Dataset, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset` ¶

`aef_loader.set_aef_nodata(data: xr.DataArray | xr.Dataset, nodata: int | float = AEF_NODATA_VALUE) -> xr.DataArray | xr.Dataset` ¶

`aef_loader.split_bands(ds: xr.Dataset, var: str = 'embeddings') -> xr.Dataset` ¶

`aef_loader.reproject_datatree(tree: DataTree, target_geobox: GeoBox, resampling: str = 'nearest', dst_nodata: int | float | None = None) -> xr.Dataset` ¶