XarraySplitWriter#

class XarraySplitWriter(params=None, **kwargs)#

Bases: SplitWriterMixin, XarrayWriter

Writer for Xarray datasets in multiple files automatically.

Can automatically split a dataset to the corresponding files by communicating with a source plugin that implement the HasUnfixed protocol. This is meant to work for FileFinderManager.

The time dimension is treated on its own because of its complexity and because the user can manually specify the desired time resolution of the files (otherwise we will try to guess using the filename pattern). Resampling will be avoided if we can simply loop over the time dimension (desired frequency equals data frequency).

The dimensions names must correspond to pattern parameters.

All the resulting writing operations, the ‘calls’, can be executed serially (default behavior) or be submitted in parallel using Dask. They can all be sent all at once (chop=None, default) or limited to parallels groups of smaller size that will run serially.

Both parts (of splitting the xarray dataset and sending calls) can be used separately, if there is a need for more flexibility or missing features.

Parameters:

params (t.Any | None)

split_by_time(ds, time_freq=True)#

Split dataset in time groups.

If the frequency of the dataset is the same as the target one, it will not be resampled to avoid unecessary work. There might be false positives (offsets maybe ?). In which case you should resample before manually, and set time_freq to false.

Parameters:
  • time_freq (str | bool) –

    If it is a string, use it as a frequency/period for xarray.Dataset.resample(). For example M will return datasets grouped by month. See this page for details on period strings.

    If False: do not resample, just return a list with one dataset for each time index.

    If True the frequency will be guessed from the filename pattern. The smallest period present will be used.

  • params – Parameters to replace for writing data.

  • ds (Dataset)

Returns:

List of datasets

Return type:

list[Dataset]

split_by_unfixed(ds)#

Use parameters in the filename pattern to guess how to group.

The dataset is split in sub-datasets such that each sub-dataset correspond to a unique combinaison of unfixed parameter values which will give a unique filename.

Coordinates whose name does not correspond to an unfixed group in the filename pattern will be written entirely in each file.

Parameters:

ds (Dataset)

Return type:

list[Dataset]

time_intervals_groups = {'B': 'MS', 'F': 'D', 'H': 'H', 'M': 'min', 'S': 'S', 'X': 'S', 'Y': 'YS', 'd': 'D', 'j': 'D', 'm': 'MS', 'x': 'D'}#

List of correspondance between pattern names and pandas frequencies.

The pattern names are arranged in increasing order.

to_calls(datasets, squeeze=False)#

Transform sequence of datasets into writing calls.

A writing call being a tuple of a dataset and the filename to write it to.

Parameters:
  • squeeze (bool | str | abc.Mapping[abc.Hashable, bool | str]) –

    How to squeeze dimensions of size one. If False, dimensions are left as is. If True, squeeze. If equal to “drop”, the squeezed coordinate is dropped instead of being kept as a scalar.

    This can be configured by dimensions with a mapping of dimensions to a squeeze argument.

  • datasets (abc.Sequence[xr.Dataset])

Return type:

list[CallXr]

write(data, target=None, time_freq=True, squeeze=False, client=None, chop=None, **kwargs)#

Write data to disk.

First split datasets following the parameters that vary in the filename pattern. Then split all datasets obtained along the time dimension (if present), and according to the time_freq argument. The dimensions left of size one are squeezed according to the argument value.

Each cut dataset is written to its corresponding filename. Directories will automatically be created if necessary.

Parameters:
  • data (xr.Dataset) – Data to write.

  • target (None) – Cannot be used here. Use XarrayWriter instead.

  • time_freq (str | bool) –

    If it is a string, use it as a frequency/period for xarray.Dataset.resample(). For example M will return datasets grouped by month. See this page https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#period-aliases for details on period strings. If the frequency of the dataset is the same as the target one, it will not be resampled to avoid unecessary work. There might be false positives (offsets maybe ?). In which case you should resample before manually, and set time_freq to false.

    If False: do not resample, just return a list with one dataset for each time index.

    If True the frequency will be guessed from the filename pattern. The smallest period present will be used.

  • squeeze (bool | str | abc.Mapping[abc.Hashable, bool | str]) –

    How to squeeze dimensions of size one. If False, dimensions are left as is. If True, squeeze. If equal to “drop”, the squeezed coordinate is dropped instead of being kept as a scalar.

    This can be configured by dimension with a mapping of each dimension to a squeeze argument.

  • client (Client | None) – Dask distributed.Client instance. If present multiple write calls will be send in parallel. See send_calls_together() for details. If left to None, the write calls will be sent serially.

  • chop (int | None) – If None (default), all calls are sent together. If chop is an integer, groups of calls of size chop (at most) will be sent one after the other, calls within each group being run in parallel.

  • kwargs – Passed to the function that writes to disk (xarray.Dataset.to_netcdf()).