Skip to content

API Reference

Core Classes

aef_loader.AEFIndex(source: DataSource = DataSource.GCS, gcp_project: str | None = None, cache_dir: Path | None = None)

Manages the AEF GeoParquet index for efficient spatial/temporal queries.

The index contains metadata about all AEF tiles including their bounding boxes, paths, and optionally pre-fetched COG header metadata.

Supports both GCS (Google Cloud Storage) and Source Cooperative (AWS S3) backends.

Example

GCS (requires GCP project for requester-pays):

index = AEFIndex(source=DataSource.GCS, gcp_project="my-project")
await index.download()
tiles = await index.query(bbox=(-122.5, 37.5, -122.0, 38.0), years=(2020, 2023))

Source Cooperative (public, no auth required):

index = AEFIndex(source=DataSource.SOURCE_COOP)
await index.download()
tiles = await index.query(bbox=(-122.5, 37.5, -122.0, 38.0), years=(2020, 2023))

Initialize AEF index manager.

Parameters:

Name Type Description Default
source DataSource

Data source (GCS or SOURCE_COOP)

GCS
gcp_project str | None

GCP project ID for requester-pays bucket access (GCS only)

None
cache_dir Path | None

Directory for caching the index (default: /tmp)

None

download(force: bool = False, local_path: Path | None = None) -> Path async

Download the AEF index from cloud storage using obstore.

Parameters:

Name Type Description Default
force bool

Force re-download even if cached

False
local_path Path | None

Custom path for the index file

None

Returns:

Type Description
Path

Path to the downloaded index file

load(path: Path | None = None) -> gpd.GeoDataFrame

Load the index into memory as a GeoDataFrame.

Parameters:

Name Type Description Default
path Path | None

Path to index file (uses cached path if not provided)

None

Returns:

Type Description
GeoDataFrame

GeoDataFrame with AEF tile metadata

query(bbox: BoundingBox | None = None, years: int | DateRange | None = None, limit: int | None = None) -> list[AEFTileInfo] async

Query the index for tiles matching the given criteria.

Parameters:

Name Type Description Default
bbox BoundingBox | None

Bounding box filter (minx, miny, maxx, maxy) in WGS84

None
years int | DateRange | None

Single year or (start_year, end_year) tuple

None
limit int | None

Maximum number of tiles to return

None

Returns:

Type Description
list[AEFTileInfo]

List of AEFTileInfo objects matching the query

aef_loader.VirtualTiffReader(gcp_project: str | None = None)

COG reader using virtual-tiff to create virtual zarr stores.

This provides efficient COG access by: - Creating virtual zarr stores from COGs without data duplication - Using async I/O via obstore for cloud access (GCS and S3) - Organizing tiles by UTM zone for proper CRS handling - Integrating directly with xarray for data loading

The primary method is open_tiles_by_zone() which loads tiles organized by their native UTM zone. To combine data across zones, use reproject_datatree() from the utils module.

Example
from aef_loader import AEFIndex, VirtualTiffReader, DataSource
from aef_loader.utils import reproject_datatree
from odc.geo.geobox import GeoBox

# Query tiles
index = AEFIndex(source=DataSource.SOURCE_COOP)
await index.download()
index.load()
tiles = await index.query(bbox=(-122.5, 37.5, -121.5, 38.5), years=(2020, 2022))

# Load by UTM zone
async with VirtualTiffReader() as reader:
    tree = await reader.open_tiles_by_zone(tiles)

# Reproject to common CRS if needed
target = GeoBox.from_bbox(bbox=(-122.5, 37.5, -121.5, 38.5), crs="EPSG:4326", resolution=0.0001)
combined = reproject_datatree(tree, target)
result = combined.compute()

Initialize the virtual TIFF reader.

Parameters:

Name Type Description Default
gcp_project str | None

GCP project ID for requester-pays buckets (GCS only)

None

open_tiles_by_zone(tiles: list[AEFTileInfo], ifd: int = 0, chunks: int | dict | Literal['auto'] | None = 'auto') -> DataTree async

Open tiles and organize them by UTM zone in a DataTree.

Each UTM zone becomes a group in the DataTree, containing a Dataset with a single 'embeddings' variable with a band dimension (A00–A63). Both nodata and _FillValue attrs are set to -128 on each embeddings variable so that downstream tools (odc-geo xr_reproject, xarray) correctly identify the AEF nodata sentinel.

This is the primary method for loading AEF data. It keeps each zone's data in its native CRS for accurate spatial operations. To combine data across zones, use reproject_datatree() from the utils module.

Parameters:

Name Type Description Default
tiles list[AEFTileInfo]

List of AEFTileInfo objects from AEFIndex.query()

required
ifd int

Image File Directory index (0 for full resolution)

0
chunks int | dict | Literal['auto'] | None

The chunks parameter to pass to open_zarr, defaults to auto, useful to pass None to stop dask task explosions

'auto'

Returns:

Type Description
DataTree

DataTree with structure: ├── 10N/ → Dataset with embeddings(time, band, y, x) in EPSG:32610 ├── 10S/ → Dataset with embeddings(time, band, y, x) in EPSG:32710 ├── 11N/ → Dataset with embeddings(time, band, y, x) in EPSG:32611 ...

Example
tiles = await index.query(bbox=bbox, years=(2020, 2022))
async with VirtualTiffReader() as reader:
    tree = await reader.open_tiles_by_zone(tiles)
for zone in tree.children:
    ds = tree[zone].ds
    print(f"{zone}: {ds.odc.crs}, {dict(ds.sizes)}")

Types

aef_loader.DataSource

Bases: Enum

Data source for AEF embeddings.

aef_loader.AEFTileInfo(id: str, path: str, year: int, bbox: BoundingBox, crs_epsg: int, utm_zone: str | None = None, utm_bounds: BoundingBox | None = None, source: DataSource | None = None) dataclass

Information about an AEF tile/scene.

as_datetime: dt.datetime property

Get datetime (January 1st of the year).

Utility Functions

aef_loader.dequantize_aef(data: np.ndarray | xr.DataArray | xr.Dataset, divisor: float = AEF_DEQUANT_DIVISOR, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset

Dequantize AEF embeddings from int8 to float32.

AEF embeddings are stored as quantized int8 values [-127, 127]. This function converts them back to float32 [-1, 1] for use in ML pipelines.

The formula is: ((value / 127.5) ** 2) * sign(value)

NoData values (-128) are automatically converted to NaN. For DataArray inputs, both nodata and _FillValue attrs are set to NaN on the output so that downstream tools (odc-geo, xarray) recognise the new fill value. All other existing attrs are preserved.

Parameters:

Name Type Description Default
data ndarray | DataArray | Dataset

Quantized embedding data (int8)

required
divisor float

Dequantization divisor (default: 127.5)

AEF_DEQUANT_DIVISOR
nodata_value int

Value to treat as nodata (default: -128)

AEF_NODATA_VALUE

Returns:

Type Description
ndarray | DataArray | Dataset

Dequantized float32 data in range [-1, 1], with NaN for nodata

Example
import numpy as np
quantized = np.array([127, -127, 0, -128], dtype=np.int8)
dequantized = dequantize_aef(quantized)
print(dequantized)  # [~1.0, ~-1.0, 0.0, nan]

aef_loader.quantize_aef(data: np.ndarray | xr.DataArray, divisor: float = AEF_DEQUANT_DIVISOR) -> np.ndarray | xr.DataArray

Quantize float32 embeddings to int8 for storage.

This is the inverse of dequantize_aef(). Dequantization: ((v / 127.5) ** 2) * sign(v) Quantization (inverse): sign(v) * sqrt(|v|) * 127.5

For DataArray inputs, both nodata and _FillValue attrs are set to -128 (AEF_NODATA_VALUE) on the output so that downstream tools (odc-geo, xarray) recognise the int8 nodata sentinel. All other existing attrs are preserved.

Parameters:

Name Type Description Default
data ndarray | DataArray

Float32 embedding data in range [-1, 1]

required
divisor float

Quantization divisor (default: 127.5)

AEF_DEQUANT_DIVISOR

Returns:

Type Description
ndarray | DataArray

Quantized int8 data in range [-127, 127]

aef_loader.mask_nodata(data: np.ndarray | xr.DataArray, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray

Mask NoData values (-128) in AEF embeddings.

NoData pixels have -128 in all channels. This function replaces NoData values with NaN for proper handling in analysis.

Parameters:

Name Type Description Default
data ndarray | DataArray

AEF embedding data (int8)

required
nodata_value int

Value to mask (default: -128)

AEF_NODATA_VALUE

Returns:

Type Description
ndarray | DataArray

Data with NoData values replaced by NaN

aef_loader.int8_to_float32(data: np.ndarray | xr.DataArray | xr.Dataset, nodata_value: int = AEF_NODATA_VALUE) -> np.ndarray | xr.DataArray | xr.Dataset

Cast int8 AEF embeddings to float32 without dequantization.

Unlike dequantize_aef(), this performs a simple type cast: int8 values become their float32 equivalents (e.g. 64 -> 64.0, not 0.252). NoData values (-128) are replaced with NaN.

For DataArray inputs, both nodata and _FillValue attrs are set to NaN on the output. All other existing attrs are preserved.

Parameters:

Name Type Description Default
data ndarray | DataArray | Dataset

Quantized embedding data (int8)

required
nodata_value int

Value to treat as nodata (default: -128)

AEF_NODATA_VALUE

Returns:

Type Description
ndarray | DataArray | Dataset

Float32 data with raw int8 values preserved, NaN for nodata

aef_loader.set_aef_nodata(data: xr.DataArray | xr.Dataset, nodata: int | float = AEF_NODATA_VALUE) -> xr.DataArray | xr.Dataset

set_aef_nodata(
    data: xr.DataArray, nodata: int | float = ...
) -> xr.DataArray
set_aef_nodata(
    data: xr.Dataset, nodata: int | float = ...
) -> xr.Dataset

Return a copy with the nodata and _FillValue attributes set explicitly.

Parameters:

Name Type Description Default
data DataArray | Dataset

Input DataArray or Dataset.

required
nodata int | float

The nodata sentinel to stamp. Use AEF_NODATA_VALUE (-128) for raw/quantized embeddings, or np.nan for dequantized float data.

AEF_NODATA_VALUE

The input is not modified; a shallow copy (shared data, new attrs) is returned.

aef_loader.split_bands(ds: xr.Dataset, var: str = 'embeddings') -> xr.Dataset

Split a single multi-band DataArray into separate named variables (A00–A63).

This is the inverse of the compact band representation used by VirtualTiffReader.open_tiles_by_zone(). Use this when downstream code expects individual A00–A63 data variables.

Parameters:

Name Type Description Default
ds Dataset

Dataset containing a variable with a 'band' dimension

required
var str

Name of the variable to split (default: "embeddings")

'embeddings'

Returns:

Type Description
Dataset

Dataset with one variable per band (A00, A01, ..., A63)

aef_loader.reproject_datatree(tree: DataTree, target_geobox: GeoBox, resampling: str = 'nearest', dst_nodata: int | float | None = None) -> xr.Dataset

Reproject all zones in a DataTree to a common target GeoBox.

This function takes a DataTree with multiple UTM zones and reprojects each zone's dataset to a common coordinate system defined by the target GeoBox. The reprojected datasets are then combined into a single dataset.

The reprojection is lazy - it builds a dask computation graph that only executes when .compute() is called. Chunks are loaded and reprojected on-demand.

For combining zones, this uses xarray's combine_first which: - Uses values from earlier zones where available (non-NaN) - Fills NaN regions with values from subsequent zones - In true overlapping regions (both have valid data), earlier zones take precedence

Since overlapping regions contain reprojections of the same underlying data, values should be identical regardless of which zone they come from.

Parameters:

Name Type Description Default
tree DataTree

DataTree with zone datasets as children (from open_tiles_by_zone)

required
target_geobox GeoBox

Target GeoBox defining the output CRS, resolution, and extent. Can be created with GeoBox.from_bbox() or from an existing dataset.

required
resampling str

Resampling method - "nearest", "bilinear", "cubic", etc. Default is "nearest" which preserves original int8 values.

'nearest'
dst_nodata int | float | None

Nodata value for the output. When None (default), xr_reproject reads the value from the source DataArray's nodata/_FillValue attrs. When set, the value is passed to xr_reproject and both nodata and _FillValue attrs are stamped on each output data variable after reprojection and after the zone merge.

None

Returns:

Type Description
Dataset

Combined xr.Dataset with all zones reprojected to the target GeoBox.

Dataset

Data variables remain as dask arrays until .compute() is called.

Example
from odc.geo.geobox import GeoBox

# Create target geobox (e.g., 100m resolution in EPSG:4326)
target = GeoBox.from_bbox(
    bbox=(-122.5, 37.5, -121.5, 38.5),
    crs="EPSG:4326",
    resolution=0.001,  # ~100m at this latitude
)

# Reproject all zones to target
combined = reproject_datatree(tree, target)
result = combined.compute()  # triggers actual reprojection