Existing modules#

The Dataset class is expected to be populated by modules (see Module system). Here is a quick description of modules that are bundled with this package.

The features managed by the modules presented below inherit from abstract classes that define the API for that feature. Note that they are not defined through the abc module, and thus will not raise if instantiated. These classes are more guidelines than strict protocols.

Note

For developers

Nevertheless it is advised to keep a common signature for module subclasses, relying on keyword arguments if necessary. This helps ensure inter-operability between module and easy substitution of modules types.

Parameters#

Dict#

ParamsManagerDict stores the parameters in a dictionary. Technically it is a subclass of dict that has a callback setup to void the modules cache when a parameters is changed.

The callback is called only when setting a value that is new or different from the old value. Any change to a mutable (list or dict) will not register:

# Will void cache
dm.params["a"] = 0
dm.params["nested"] = {"b": 1}

# Will *not* void cache
dm.params["a"] = 0  # no change
dm.params["nested"]["b"] = 2

Section#

ParamsManagerSection stores the parameters in a Section object. The Section class is specified in the attribute SECTION_CLS and defaults to an empty Section. When initialized, the modules creates a new SECTION_CLS and updates it with the argument passed to it.

It can be defined with:

class MyDataset(Dataset):

    Params = ParamsManagerSection.new(MySection)

The section has a callback setup to it so that any modification will trigger a cache void (except mutable modification):

# Will void cache
dm.params["a"] = 0
dm.params.a = 1
dm.params.nested.b = 1
dm.param["my_list"] = [0]

# Will *not* void cache
dm.params["a"] = 1  # no change
dm.params["my_list"].append(1)

App#

ParamsManagerApp stores its parameters in an ApplicationBase object. An application must be supplied as argument. It will be copied (any modification of the dataset parameters will not affect the original application instance).

Tip

Typechecking

The specific application class does not need to be specified, but can be type-hinted with:

class MyApp(ApplicationBase):
    ...

class MyDataset(Dataset):
    ParamsManager = ParamsManagerApp
    params_manager: ParamsManagerApp[MyApp]

Source#

The data is found from a source: one or more files on disk, or a remote data-store for instance.

Simple#

For simple case where you do not need a full method, SimpleSource will just return its attribute source_loc.

MultiFile#

For datasets consisting of multiple files the package provide two modules that follow the abstract class MultiFileSource. For both of them the user should implement get_root_directory() which returns the directory containing the files (as a path, or a list of sub-folders that will be joined).

Glob#

The module GlobSource can find files on disk that follow a given pattern, defined by get_glob_pattern(). Files on disk matching the pattern are cached and available at datafiles(). For instance:

class MyDataManager(DataManagerBase):

    class Source(GlobSource):
        def get_root_directory(self):
            return ["/data", self.params["user"], "subfolder"]

        def get_glob_pattern(self):
            return "SST_*.nc"

files = MyDataManager().get_source()

FileFinder#

For a similar scenario of a dataset split across many files (for different dates, variables or parameters values) an even more precise solution is provided by FileFinderSource. This module relies on the filefinder package to find files according to a specific filename pattern. For instance:

class MyDataManager(DataManagerBase):

    class Source(FileFinderSource):
        def get_root_directory(self):
            return ["/data", self.params["user"], "subfolder"]

        def get_glob_pattern(self):
            return "SST_%(depth:fmt=.1f)_%(Y)%(m)%(d).nc"

This module has several advantages over a simple glob pattern. Its filename pattern can define parameters with specific formatting. Thus it can “fix” some parameters and restrict its search. With the same example as above we can select only the files for a specific depth:

MyDataManager(depth=10.0).get_source()

If we fix all parameters we can also generate a filename for a given set of parameters:

MyDataManager(depth=10.0).source.get_filename(Y=2015, m=5, d=1)
# or equivalent:
MyDataManager(depth=10.0, Y=2015, m=5, d=1).source.get_filename()

See the filefinder documentation for more details on its features.

Xarray#

A compilation of module for interfacing with Xarray is available in data_assistant.data.xarray. This submodule is not imported in the top evel package to avoid importing Xarray unless needed.

Loaders#

XarrayLoader will load from either a single file or store with open_dataset() or from multiple files using :open_mfdataset().

Options for these functions can be changed in the attributes OPEN_DATASET_KWARGS and OPEN_MFDATASET_KWARGS:

class MyDataset(Dataset):
    class Loader(XarrayLoader):
        OPEN_MFDATASET_KWARGS = dict(...)

Writers#

class:.XarrayWriter allows to write to either a single file/store or multiple files if given a sequence of datasets. It will guess to function to use from the file extension. It currently supports Zarr and Netcdf.

Note

The write() method will automatically add metadata to the dataset attributes via add_metadata().

When writing data across multiple files or stores, if given a Dask client argument, it will use send_calls_together() to execute multiple writing operations in parallel.

Important

Doing so is not so straightforward. It may fail on some filesystems with permisssion errors. Using the scratch filesystem on a cluster might solve this issue. See send_calls_together() documentation for details on the implementation.

When writing to multiple files, XarrayWriter module needs multiple datasets and their respective target file. XarraySplitWriter intends to simplify further the writing process by splitting automatically a dataset across files. It must be paired with a source-managing module that implements the Splitable protocol. Which means that some parameters can be left unspecified and along which the dataset will be split. It must also be able to return a filename given values for those unspecified parameters. The FileFinderSource can be used to that purpose. For instance we can split a dataset along its depth dimension and automatically group by month, using a dataset along the lines of:

>>> ds
<xarray.Dataset>
Dimensions:              (time: 365, depth: 50, lat: 4320, lon: 8640)
Coordinates:
* time                 (time) datetime64[ns] 2020-01-01 ... 2020-12-31
* depth                (depth) int64 0 1 5 10 20 40 ... 500 750 1000
* lat                  (lat) float32 89.98 89.94 89.9 ... -89.9 -89.94 -89.98
* lon                  (lon) float32 -180.0 -179.9 -179.9 ... 179.9 180.0
Data variables:
    temp                  (time, depth, lat, lon) float32 dask.array<chunksize=(1, 1, 4320, 8640), meta=np.ndarray>

and a data manager defined as:

class MyDataset(Dataset):

    Writer = XarraySplitWriter

    class Source(FileFinderSource):
        def get_root_directory(self):
            return "/data/directory/"

        def get_filename_pattern(self):
            """Yearly folders, date as YYYYMM and depth as integer."""
            return "%(Y)/temp_%(Y)%(m)_depth_%(depth:fmt=d).nc"

we can then simply call MyDataset().write(ds). Note this will detect that the smallest time parameter in the filename pattern is the month and split the dataset appropriately using xarray.Dataset.resample(). This can be specified manually or avoided alltogether. See XarraySplitWriter.write() documentation for details.

Note

If the overall write() implementation is not appropriate, it is possible to control more finely the splitting process by using split_by_unfixed() and split_by_time(). The “time” dimension is split separately to account for the fact that a filename pattern will define separate datetime elements (the year, the month, the day, …).