erlab.io.dataloader

Base functionality for implementing data loaders.

This module provides a base class LoaderBase for implementing data loaders. Data loaders are plugins used to load data from various file formats.

Each data loader is a subclass of LoaderBase that must implement several methods and attributes.

A detailed guide on how to implement a data loader can be found in the User Guide.

Classes

LoaderBase()

Base class for loader plugins.

LoaderRegistry([state])

Registry of loader plugins.

Exceptions

LoaderNotFoundError(key)

Raised when a loader is not found in the registry.

UnsupportedFileError(loader, file_path)

Raised when the loader does not support the given file extension.

ValidationError

Raised when the loaded data fails validation checks.

ValidationWarning

Issued when the loaded data fails validation checks.

class erlab.io.dataloader.LoaderBase[source]

Bases: object

Base class for loader plugins.

name: str

Name of the loader. Using a unique and descriptive name is recommended. For easy access, it is recommended to use a name that passes str.isidentifier().

Notes

  • Changing the name of a loader is not recommended as it may break existing code. Pick a simple, descriptive name that is unlikely to change.

  • Loaders with the name prefixed with an underscore are not registered.

description: str

A short description of the loader shown to users.

Added in version 3.3.0.

aliases: Iterable[str] | None = None

Alternative names for the loader.

Deprecated since version 3.3.0: Accessing loaders with aliases is deprecated and will be removed in a future version. Use the loader name instead.

extensions: ClassVar[set[str] | None] = None

File extensions supported by the loader in lowercase with the leading dot.

An UnsupportedFileError is raised if a file with an unsupported extension is passed to the loader. If None, the loader will attempt to load any file passed to it.

If the loader supports directories, the extension should be an empty string.

Added in version 3.5.1.

name_map: ClassVar[dict[str, str | Iterable[str]]] = {}

Dictionary that maps new coordinate or attribute names to original coordinate or attribute names. If there are multiple possible names for a single attribute, the value can be passed as an iterable.

Note

  • Non-dimension coordinates in the resulting data will try to follow the order of the keys in this mapping.

  • Original coordinate names included in this mapping will be replaced by the new names. However, original attribute names will be duplicated with the new names so that both the original and new names are present in the data after loading. This is to keep track of the original names for reference.

coordinate_attrs: tuple[str, ...] = ()

Attribute names (after renaming) that should be treated as coordinates.

Put any attributes that should be propagated when concatenating data here.

Notes

  • If a listed attribute is not found, it is silently skipped.

  • The attributes given here, both before and after renaming, are removed from the attributes to avoid conflicting values.

  • If an existing coordinate with the same name is already present, the existing coordinate takes precedence and the attribute is silently dropped.

See also

process_keys

average_attrs: tuple[str, ...] = ()

Names of attributes or coordinates (after renaming) that should be averaged over.

This is useful for attributes that may slightly vary between scans.

Notes

  • If a listed attribute is not found, it is silently skipped.

  • Attributes listed here are first treated as coordinates in process_keys, and then averaged in post_process.

additional_attrs: ClassVar[dict[str, str | float | datetime | Callable[[DataArray], str | float | datetime]]] = {}

Additional attributes to be added to the data after loading.

If a callable is provided, it will be called with the data as the only argument.

Notes

  • The attributes are added after renaming with process_keys, so keys will appear in the data as provided.

  • If an attribute with the same name is already present in the data, it is skipped unless the key is listed in overridden_attrs.

overridden_attrs: tuple[str, ...] = ()

Keys in additional_attrs that should override existing attributes.

additional_coords: ClassVar[dict[str, str | float | datetime | Callable[[DataArray], str | float | datetime]]] = {}

Additional coordinates to be added to the data after loading.

If a callable is provided, it will be called with the data as the only argument.

Notes

  • The coordinates are added after renaming with process_keys, so keys will appear in the data as provided.

  • If a coordinate with the same name is already present in the data, it is skipped unless the key is listed in overridden_coords.

overridden_coords: tuple[str, ...] = ()

Keys in additional_coords that should override existing coordinates.

always_single: bool = True

Setting this to True disables implicit loading of multiple files for a single scan. This is useful for setups where each scan is always stored in a single file.

parallel_threshold: int = 30

Minimum number of files in a scan to use parallel loading. If the number of files is less than this threshold, files are loaded sequentially.

Only used when always_single is False.

skip_validate: bool = False

If True, validation checks will be skipped. If False, data will be checked with validate.

strict_validation: bool = False

If True, validation checks will raise a ValidationError on the first failure instead of warning. Useful for debugging data loaders. This has no effect if skip_validate is True.

formatters: ClassVar[dict[str, Callable]] = {}

Optional mapping from attr or coord names (after renaming) to custom formatters.

The formatters are callables that takes the attribute value and returns a value that can be converted to a string via value_to_string. The resulting string representations are used for human readable display in the summary table and the information accessor.

The values returned by the formatters will be further formatted by value_to_string before being displayed.

If the key is a coordinate, the function will automatically be vectorized over every value.

Note

The formatters are only used for display purposes and do not affect the stored data.

See also

get_formatted_attr_or_coord()

The method that uses this mapping to provide human-readable values.

summary_sort: str | None = None

Optional default column to sort the summary table by.

If None, the summary table is sorted in the order of the files returned by files_for_summary.

property summary_attrs: dict[str, str | Callable[[DataArray], Any]]

Mapping from summary column names to attr or coord names (after renaming).

If the value is a callable, it will be called with the data as the only argument. This can be used to extract values from the data that are not stored as attributes or spread across multiple attributes.

If not overridden, returns a basic mapping based on name_map.

It is highly recommended to override this property to provide a more detailed and informative summary. See existing loaders for examples.

property file_dialog_methods: dict[str, tuple[Callable, dict[str, Any]]]

Map from file dialog names to the loader method and its arguments.

Override this property in the subclass to provide support for loading data from the load menu of the ImageTool GUI.

Returns:

loader_mapping (dictionary of str to tuple of (callable, dict)) – A dictionary mapping the file dialog names to a tuple of length 2 containing the data loading function and arguments.

The keys should be the names of the file dialog options passed to setNameFilter.

The first item of the value tuple should be a callable that takes the first positional argument as a path to a file, usually self.load.

The second item should be a dictionary containing keyword arguments to be passed to the method.

Multiple key-value pairs can be returned to provide multiple options.

Example

For instance, the loader for ALS BL4 implements the following mapping which enables loading .pxt and .ibw files within ImageTool using self.load with no keyword arguments:

@property
def file_dialog_methods(self):
    return {"ALS BL4.0.3 Raw Data (*.pxt, *.ibw)": (self.load, {})}
classmethod value_to_string(val)[source]

Format the given value based on its type.

The default behavior formats the given value with erlab.utils.formatting.format_value(). Override this classmethod to change the printed format of summaries and information accessors. This method is applied after the formatters in formatters.

classmethod get_styler(df)[source]

Return a styled version of the given dataframe.

This method, along with value_to_string, determines the display formatting of the summary dataframe. Override this classmethod to change the display style.

Parameters:

df – The summary dataframe.

Returns:

pandas.io.formats.style.Styler – The styler to be displayed.

Return type:

pandas.io.formats.style.Styler

load(identifier, data_dir=None, *, chunks=None, single=False, combine=True, parallel=None, progress=True, load_kwargs=None, loader_extensions=None, **kwargs)[source]

Load ARPES data.

This method is the main entry point for loading ARPES data.

Note

This method is not meant to be overridden in subclasses.

Parameters:
  • identifier (str | PathLike | int) –

    Value that identifies a scan uniquely.

    • If a string or path-like object is given, it is assumed to be the path to the data file relative to data_dir. If data_dir is not specified, identifier is assumed to be the full path to the data file.

    • If an integer is given, it is assumed to be a number that specifies the scan number, and is used to automatically determine the path to the data file(s). In this case, the data_dir argument must be specified.

  • data_dir (str | PathLike | None, default: None) –

    Where to look for the data. Must be a path to a valid directory. This argument is required when identifier is an integer.

    When called as erlab.io.load(), this argument defaults to the value set by erlab.io.set_data_dir() or erlab.io.loader_context().

  • chunks (int | dict | Literal['auto'] | tuple[int, ...] | None, default: None) – Chunking strategy for loading data with dask for supported loaders.

  • single (bool, default: False) –

    This argument is only used when always_single is False, and identifier is given as a string or path-like object.

    If identifier points to a file that is included in a multiple file scan, the default behavior when single is False is to return data from all files in the same scan. How the data is combined is determined by the combine argument. If True, only the data from the file given is returned.

  • combine (bool, default: True) –

    Whether to attempt to combine multiple files into a single data object. If False, a list of data is returned. If True, the loader tries to combine the data into a single data object and return it. Depending on the type of each data object, the returned object can be a xarray.DataArray, xarray.Dataset, or a xarray.DataTree.

    This argument is only used when single is False.

  • parallel (bool | None, default: None) –

    Whether to load multiple files in parallel using dask. For possible values, see load_multiple_parallel.

    This argument is only used when single is False.

  • progress (bool, default: True) –

    Whether to show a progress bar when loading multiple files.

    This argument is only used when single is False.

  • load_kwargs (dict[str, Any] | None, default: None) – Additional keyword arguments to be passed to load_single. You can also pass additional keyword arguments directly to load, and they will be dispatched to either identify or load_single based on their signatures. See the **kwargs argument for details.

  • loader_extensions (Mapping[str, Any] | None, default: None) – Temporary extensions to loader attributes, with the same keys accepted by extend_loader.

  • **kwargs – Additional keyword arguments are passed to identify and load_single based on their signatures. If a keyword argument is accepted by both methods, it is passed to identify. Use the load_kwargs argument to pass an ambiguous keyword argument to load_single.

Returns:

xarray.DataArray or xarray.Dataset or xarray.DataTree – The loaded data.

Return type:

DataArray | Dataset | DataTree | list[DataArray] | list[Dataset] | list[DataTree]

Notes

  • The data_dir set by erlab.io.set_data_dir() or erlab.io.loader_context() is only used when called as erlab.io.load(). When called directly on a loader instance, the data_dir argument must be specified.

  • For convenience, the data_dir set by erlab.io.set_data_dir() or erlab.io.loader_context() is silently ignored when all of the following are satisfied:

    • identifier is an absolute path to an existing file.

    • data_dir is not explicitly provided.

    • The path created by joining data_dir and identifier does not point to an existing file.

    This way, absolute file paths can be passed directly to the loader without changing the default data directory. For instance, consider the following directory structure.

    cwd/
    
    ├── data/
    
    └── example.txt
    

    The following code will load ./example.txt instead of raising an error that ./data/example.txt is missing:

    import erlab
    
    erlab.io.set_data_dir("data")
    
    erlab.io.load("example.txt")
    

    However, if ./data/example.txt also exists, the same code will load that one instead while warning about the ambiguity. This behavior may lead to unexpected results when the directory structure is not organized. Keep this in mind and try to keep all data files in the same level.

extend_loader(*, name_map=None, coordinate_attrs=None, average_attrs=None, additional_attrs=None, overridden_attrs=None, additional_coords=None, overridden_coords=None)[source]

Context manager that temporarily extends various loader attributes.

This context manager can be used to temporarily customize the behavior of the data loader. This is particularly useful when loading data across multiple files, where the coordinate_attrs can be extended so that the attributes in the data are promoted to coordinates and propagated when combining data across files.

For one-off loads, the same arguments can be passed to load or erlab.io.load() with the loader_extensions keyword. This keeps the extension settings attached to the load call, which is useful for generated loading code and ImageTool manager reload metadata.

Parameters:

Example

import erlab

erlab.io.set_loader("loader_name")

with erlab.io.extend_loader(coordinate_attrs=("scan_number",)):
    data = erlab.io.load("file_name")

data = erlab.io.load(
    "file_name",
    loader_extensions={"coordinate_attrs": ("scan_number",)},
)

See also

load

Load data with optional loader_extensions.

coordinate_attrs

The attribute that is temporarily extended.

summarize(data_dir, exclude=None, *, cache=True, display=True, rc=None)[source]

Summarize the data in the given directory.

Note

This method is not meant to be overridden in subclasses.

Takes a path to a directory and summarizes the data in the directory to a table, much like a log file. This is useful for quickly inspecting the contents of a directory.

The dataframe is formatted using the style from get_styler and displayed in the IPython shell. Results are cached in a pickle file in the directory.

Parameters:
  • data_dir – Directory to summarize.

  • exclude (default: None) – A string or sequence of strings specifying glob patterns for files to be excluded from the summary. If provided, caching will be disabled.

  • cache (default: True) – Whether to use caching for the summary.

  • display (default: True) – Whether to display the formatted dataframe using the IPython shell. If False, the dataframe will be returned without formatting. If True but the IPython shell is not detected, the dataframe styler will be returned.

  • rc (default: None) – Optional dictionary of matplotlib rcParams to override the default for the plot in the interactive summary. Plot options such as the figure size and colormap can be changed using this argument.

Returns:

pandas.DataFrame or pandas.io.formats.style.Styler or None – Summary of the data in the directory.

  • If display is False, the summary DataFrame is returned.

  • If display is True and the IPython shell is detected, the summary will be displayed, and None will be returned.

    • If ipywidgets is installed, an interactive widget will be returned instead of None.

  • If display is True but the IPython shell is not detected, the styler for the summary DataFrame will be returned.

Return type:

pandas.DataFrame | pandas.io.formats.style.Styler | None

get_formatted_attr_or_coord(data, attr_or_coord_name)[source]

Return the formatted value of the given attribute or coordinate.

The value is formatted using the function specified in formatters.

Parameters:
  • data (DataArray) – The data to extract the attribute or coordinate from.

  • attr_or_coord_name (str or callable) – The name of the attribute or coordinate to extract. If a callable is passed, it is called with the data as the only argument.

Notes

  • Numpy datetime64 scalars are converted to pandas timestamps before formatting.

  • If the attribute or coordinate is not found, an empty string is returned.

load_single(file_path, **kwargs)[source]

Load a single file and return it as an xarray data structure.

All scan-specific postprocessing should be implemented in this method.

This method must be implemented to return the smallest possible data structure that represents the data in a single file. For instance, if a single file contains a single scan region, the method should return a single xarray.DataArray. If it contains multiple regions, the method should return a xarray.Dataset or xarray.DataTree depending on whether the regions can be merged with without conflicts (i.e., all mutual coordinates of the regions are the same).

Subclasses may add additional keyword arguments to this method as needed, which can be passed through load using the load_kwargs argument.

If the loader supports dask-based lazy loading, it should add a chunks keyword argument to this method, which should be passed to the underlying data loading function (e.g., xarray.open_dataset(), xarray.open_datatree()).

Parameters:
  • file_path – Full path to the file to be loaded.

  • without_values – Used when creating a summary table. With this option set to True, only the coordinates and attributes of the output data are accessed so that the values can be replaced with placeholder numbers, speeding up the summary generation for lazy loading enabled file formats like HDF5 or NeXus.

Returns:

DataArray or Dataset or DataTree – The loaded data.

Notes

  • For loaders with always_single set to False, the return type of this method must be consistent across all associated files, i.e., for all files that can be returned together from identify so that they can be combined without conflicts. This should not be a problem in most cases since the data structure of associated files acquired during the same scan will be identical.

  • For xarray.DataTree objects, returned trees must be named with a unique identifier to avoid conflicts when combining.

identify(num, data_dir, **kwargs)[source]

Identify the files and coordinates for a given scan number.

This method takes a scan index and transforms it into a list of file paths and coordinates. See below for the expected behavior.

If no files are found for the given parameters, an empty list and an empty dictionary should be returned. Alternatively, return a single None to indicate a failure to identify the scan.

Parameters:
  • num – The index of the scan to identify.

  • data_dir – The directory containing the data.

Returns:

  • files (list of str or path-like) – A list of file paths.

    For scans spread over multiple files, the list must contain all files corresponding to the given scan index.

    For single-file scans, behavior depends on always_single. If True, all files matching the scan index should be returned, but only the first file will be loaded and a warning will be shown. If False, there is no way to tell whether returned files are part of a valid multiple-file scan. The loader must then ensure that only a single file is returned and issue appropriate warnings if multiple files are detected for a single-file scan. See erlab.io.plugins.merlin.MERLINLoader.identify() for an example.

  • coord_dict (dict of str to sequence) – A dictionary mapping scan axes names to scan coordinates.

    The keys must match the coordinate name conventions used by the data returned by load_single.

    • For scans spread over multiple files, the coordinates will be sequences, with each element corresponding to each file in files.

    • For single file scans or multiple file scans that have no well-defined scan axes (such as multi-region scans), an empty dictionary should be returned.

infer_index(name)[source]

Infer the index for the given file name.

This method takes a file name with the path and extension stripped, and tries to infer the scan index from it. If the index can be inferred, it is returned along with additional keyword arguments that should be passed to load. If the index is not found, None should be returned for the index, and an empty dictionary for additional keyword arguments.

Parameters:

name (str) – The base name of the file without the path and extension.

Returns:

  • index – The inferred index if found, otherwise None.

  • additional_kwargs – Additional keyword arguments to be passed to identify when the index is found. This argument is useful when the index alone is not enough to load the data.

Return type:

tuple[int | None, dict[str, Any]]

Note

For loaders with always_single set to True, this method is unused.

files_for_summary(data_dir)[source]

Return a list of files that can be loaded by the loader.

This method is used to select files that can be loaded by the loader when generating a summary.

Parameters:

data_dir (str | PathLike) – The directory containing the data.

Returns:

list of str or path-like – A list of files that can be loaded by the loader.

Return type:

list[str | PathLike]

combine_attrs(variable_attrs, context=None)[source]

Combine multiple attributes into a single attribute.

This method is used as the combine_attrs argument in xarray.concat() and xarray.merge() when combining data from multiple files into a single object. By default, it has the same behavior as specifying combine_attrs='override' by taking the first set of attributes.

The method can be overridden to provide fine-grained control over how the attributes are combined, e.g., by merging dictionaries or taking the average of some attributes.

Parameters:
  • variable_attrs (Sequence[dict[str, Any]]) – A sequence of attributes to be combined.

  • context (Context | None, default: None) – The context in which the attributes are being combined. This has no effect, but is required by xarray.

Returns:

dict[str, typing.Any] – The combined attributes.

Return type:

dict[str, Any]

pre_combine_multiple(data_list, coord_dict)[source]

Pre-process data before combining multiple files.

This method is called only for loaders that support combining multiple files into a single object, i.e., loaders with always_single set to False. The default implementation returns the input data and coordinate dictionary unchanged.

Override this function to perform any necessary concatenation-specific pre-processing steps. The primary use case is to correct small inconsistencies in the loaded data that result in broken concatenation/combination.

For instance, ALS BL4.0.3 Merlin often produces data with the energy axis start and step values shifted by a small amount (typically on the order of μeV). This results in different energy values for the same scan in different files, leading to the data not being combined correctly. See the implementation of MERLINLoader.

Parameters:
  • data_list (list of DataArray or Dataset or DataTree) – A list of data objects to be pre-processed prior to combining.

  • coord_dict (dict of str to sequence) – A dictionary mapping coordinate names to sequences of coordinate values, as returned by identify.

Returns:

  • data_list (list of DataArray or Dataset or DataTree) – The pre-processed data objects.

  • coord_dict (dict of str to sequence) – The coordinate dictionary, with any necessary modifications made.

Return type:

tuple[list[DataArray] | list[Dataset] | list[DataTree], dict[str, Sequence]]

process_keys(data, key_mapping=None)[source]

Rename coordinates and attributes based on the given mapping.

This method is used to rename coordinates and attributes. This method is called by post_process. Extend or override this method to customize the renaming behavior.

Parameters:
  • data (DataArray) – The data to be processed.

  • key_mapping (dict[str, str] | None, default: None) – A dictionary mapping original names to new names. If not provided, name_map_reversed is used.

post_process(darr)[source]

Post-process the given DataArray.

This method takes a single DataArray and applies post-processing steps such as renaming coordinates and attributes.

This method is called by post_process_general.

Parameters:

darr (DataArray) – The DataArray to be post-processed.

Returns:

DataArray – The post-processed DataArray.

Return type:

DataArray

Note

When introducing a custom post-processing step in a loader, make sure to call the parent method in the subclass implementation.

post_process_general(data)[source]

Post-process any data structure.

This method extends post_process to handle any data structure.

This method is called by load as the final step in the data loading process.

Parameters:

data (DataArray or Dataset or DataTree) –

The data to be post-processed.

  • If a DataArray, the data is post-processed using post_process.

  • If a Dataset, a new Dataset containing each data variable post-processed using post_process is returned. The attributes of the original Dataset are preserved.

  • If a xarray.DataTree, the post-processing is applied to each leaf node Dataset.

Returns:

DataArray or Dataset or DataTree – The post-processed data with the same type as the input.

Return type:

DataArray | Dataset | DataTree

classmethod validate(data)[source]

Validate the input data to ensure it is in the correct format.

Checks for the presence of all coordinates and attributes required for common analysis procedures like momentum conversion. If the data does not pass validation, a ValidationError is raised or a warning is issued, depending on the strict_validation flag. Validation is skipped for loaders with skip_validate set to True.

Parameters:

data (DataArray or Dataset or DataTree) – The data to be validated. If a xarray.Dataset or xarray.DataTree is passed, validation is performed on each data variable recursively.

load_multiple_parallel(file_paths, *, parallel=None, progress=True, post_process=False, **kwargs)[source]

Load from multiple files in parallel.

Parameters:
  • file_paths (list[str]) – A list of file paths to load.

  • parallel (bool | None, default: None) –

    Whether to load data in parallel using dask.

    • If None, parallel loading is enabled only if the number of files is greater than the loader’s parallel_threshold.

    • If True, data loading will always be performed in parallel.

    • If False, data will be loaded sequentially.

  • progress (bool, default: True) – Whether to show a progress bar.

  • post_process (bool, default: False) – Whether to post-process each data object after loading.

  • **kwargs – Additional keyword arguments to be passed to load_single.

Returns:

A list of the loaded data.

Return type:

list[DataArray] | list[Dataset] | list[DataTree]

exception erlab.io.dataloader.LoaderNotFoundError(key)[source]

Bases: Exception

Raised when a loader is not found in the registry.

class erlab.io.dataloader.LoaderRegistry(state=None)[source]

Bases: object

Registry of loader plugins.

Stores and manages data loaders. The loaders can be accessed by name in a dictionary-like manner or as an attribute.

Most public methods of this class instance can be accessed through the erlab.io namespace.

Examples

>>> import erlab
>>> "merlin" in erlab.io.loaders  # Check if MERLIN loader is registered
True
>>> list(erlab.io.loaders.keys())  # List registered loader names
['da30', 'erpes', ...]

Notes

  • Public methods are thread-safe.

  • Per-context state (current_loader and data_dir) uses contextvars so that concurrent threads/tasks do not step on each other.

classmethod instance()[source]

Return a registry wrapper bound to the shared loader state.

keys()[source]
items()[source]
get(key)[source]

Get a loader instance by name or alias.

property current_loader: LoaderBase | None

Current loader.

property current_data_dir: Path | None

Directory to search for data files.

property default_data_dir: PathLike | None

Deprecated alias for current_data_dir.

Deprecated since version 3.0.0: Use current_data_dir instead.

set_loader(loader)[source]

Set the current data loader for the current context.

All subsequent calls to load will use the provided loader.

Parameters:

loader (str | LoaderBase | None) – The loader to set. It can be either a string representing the name or alias of the loader, or a valid loader class.

Example

>>> erlab.io.set_loader("merlin")
>>> dat_merlin_1 = erlab.io.load(...)
>>> dat_merlin_2 = erlab.io.load(...)
set_data_dir(data_dir)[source]

Set the default data directory for the current context.

All subsequent calls to erlab.io.load() will use the provided data_dir unless specified.

Parameters:

data_dir (str | PathLike | None) – The default data directory to use.

Note

This will only affect erlab.io.load(). If the loader’s load method is called directly, it will not use the default data directory.

loader_context(loader=None, data_dir=None)[source]

Context manager that temporarily sets the current loader and data directory.

Parameters:
  • loader (str, optional) – The name or alias of the loader to use in the context.

  • data_dir (str or os.PathLike, optional) – The data directory to use in the context.

Examples

  • Load data within a context manager:

    >>> with erlab.io.loader_context("merlin"):
    ...     dat_merlin = erlab.io.load(...)
    
  • Load data with different loaders and directories:

    >>> erlab.io.set_loader("ssrl52", data_dir="/path/to/dir1")
    >>> dat_ssrl_1 = erlab.io.load(...)
    >>> with erlab.io.loader_context("merlin", data_dir="/path/to/dir2"):
    ...     dat_merlin = erlab.io.load(...)
    >>> dat_ssrl_2 = erlab.io.load(...)
    
load(*, single=False, combine=True, parallel=False, progress=True, load_kwargs=None, loader_extensions=None, **kwargs)[source]

Load ARPES data.

This method is the main entry point for loading ARPES data.

Note

This method is not meant to be overridden in subclasses.

Parameters:
  • identifier

    Value that identifies a scan uniquely.

    • If a string or path-like object is given, it is assumed to be the path to the data file relative to data_dir. If data_dir is not specified, identifier is assumed to be the full path to the data file.

    • If an integer is given, it is assumed to be a number that specifies the scan number, and is used to automatically determine the path to the data file(s). In this case, the data_dir argument must be specified.

  • data_dir (str | PathLike | None, default: None) –

    Where to look for the data. Must be a path to a valid directory. This argument is required when identifier is an integer.

    When called as erlab.io.load(), this argument defaults to the value set by erlab.io.set_data_dir() or erlab.io.loader_context().

  • chunks – Chunking strategy for loading data with dask for supported loaders.

  • single (bool, default: False) –

    This argument is only used when always_single is False, and identifier is given as a string or path-like object.

    If identifier points to a file that is included in a multiple file scan, the default behavior when single is False is to return data from all files in the same scan. How the data is combined is determined by the combine argument. If True, only the data from the file given is returned.

  • combine (bool, default: True) –

    Whether to attempt to combine multiple files into a single data object. If False, a list of data is returned. If True, the loader tries to combine the data into a single data object and return it. Depending on the type of each data object, the returned object can be a xarray.DataArray, xarray.Dataset, or a xarray.DataTree.

    This argument is only used when single is False.

  • parallel (bool, default: False) –

    Whether to load multiple files in parallel using dask. For possible values, see load_multiple_parallel.

    This argument is only used when single is False.

  • progress (bool, default: True) –

    Whether to show a progress bar when loading multiple files.

    This argument is only used when single is False.

  • load_kwargs (dict[str, Any] | None, default: None) – Additional keyword arguments to be passed to load_single. You can also pass additional keyword arguments directly to load, and they will be dispatched to either identify or load_single based on their signatures. See the **kwargs argument for details.

  • loader_extensions (Mapping[str, Any] | None, default: None) – Temporary extensions to loader attributes, with the same keys accepted by extend_loader.

  • **kwargs – Additional keyword arguments are passed to identify and load_single based on their signatures. If a keyword argument is accepted by both methods, it is passed to identify. Use the load_kwargs argument to pass an ambiguous keyword argument to load_single.

Returns:

xarray.DataArray or xarray.Dataset or xarray.DataTree – The loaded data.

Return type:

DataArray | Dataset | DataTree | list[DataArray] | list[Dataset] | list[DataTree]

Notes

  • The data_dir set by erlab.io.set_data_dir() or erlab.io.loader_context() is only used when called as erlab.io.load(). When called directly on a loader instance, the data_dir argument must be specified.

  • For convenience, the data_dir set by erlab.io.set_data_dir() or erlab.io.loader_context() is silently ignored when all of the following are satisfied:

    • identifier is an absolute path to an existing file.

    • data_dir is not explicitly provided.

    • The path created by joining data_dir and identifier does not point to an existing file.

    This way, absolute file paths can be passed directly to the loader without changing the default data directory. For instance, consider the following directory structure.

    cwd/
    
    ├── data/
    
    └── example.txt
    

    The following code will load ./example.txt instead of raising an error that ./data/example.txt is missing:

    import erlab
    
    erlab.io.set_data_dir("data")
    
    erlab.io.load("example.txt")
    

    However, if ./data/example.txt also exists, the same code will load that one instead while warning about the ambiguity. This behavior may lead to unexpected results when the directory structure is not organized. Keep this in mind and try to keep all data files in the same level.

extend_loader(average_attrs=None, additional_attrs=None, overridden_attrs=None, additional_coords=None, overridden_coords=None)[source]

Context manager that temporarily extends various loader attributes.

This context manager can be used to temporarily customize the behavior of the data loader. This is particularly useful when loading data across multiple files, where the coordinate_attrs can be extended so that the attributes in the data are promoted to coordinates and propagated when combining data across files.

For one-off loads, the same arguments can be passed to load or erlab.io.load() with the loader_extensions keyword. This keeps the extension settings attached to the load call, which is useful for generated loading code and ImageTool manager reload metadata.

Parameters:

Example

import erlab

erlab.io.set_loader("loader_name")

with erlab.io.extend_loader(coordinate_attrs=("scan_number",)):
    data = erlab.io.load("file_name")

data = erlab.io.load(
    "file_name",
    loader_extensions={"coordinate_attrs": ("scan_number",)},
)

See also

load

Load data with optional loader_extensions.

coordinate_attrs

The attribute that is temporarily extended.

summarize(*, cache=True, display=True, rc=None)[source]

Summarize the data in the given directory.

Note

This method is not meant to be overridden in subclasses.

Takes a path to a directory and summarizes the data in the directory to a table, much like a log file. This is useful for quickly inspecting the contents of a directory.

The dataframe is formatted using the style from get_styler and displayed in the IPython shell. Results are cached in a pickle file in the directory.

Parameters:
  • data_dir – Directory to summarize.

  • exclude (default: None) – A string or sequence of strings specifying glob patterns for files to be excluded from the summary. If provided, caching will be disabled.

  • cache (default: True) – Whether to use caching for the summary.

  • display (default: True) – Whether to display the formatted dataframe using the IPython shell. If False, the dataframe will be returned without formatting. If True but the IPython shell is not detected, the dataframe styler will be returned.

  • rc (default: None) – Optional dictionary of matplotlib rcParams to override the default for the plot in the interactive summary. Plot options such as the figure size and colormap can be changed using this argument.

Returns:

pandas.DataFrame or pandas.io.formats.style.Styler or None – Summary of the data in the directory.

  • If display is False, the summary DataFrame is returned.

  • If display is True and the IPython shell is detected, the summary will be displayed, and None will be returned.

    • If ipywidgets is installed, an interactive widget will be returned instead of None.

  • If display is True but the IPython shell is not detected, the styler for the summary DataFrame will be returned.

Return type:

pandas.DataFrame | pandas.io.formats.style.Styler | None

exception erlab.io.dataloader.UnsupportedFileError(loader, file_path)[source]

Bases: Exception

Raised when the loader does not support the given file extension.

exception erlab.io.dataloader.ValidationError[source]

Bases: Exception

Raised when the loaded data fails validation checks.

exception erlab.io.dataloader.ValidationWarning[source]

Bases: UserWarning

Issued when the loaded data fails validation checks.