erlab.io.dataloader

Base functionality for implementing data loaders.

This module provides a base class LoaderBase for implementing data loaders. Data loaders are plugins used to load data from various file formats. Each data loader that subclasses LoaderBase is registered on import in loaders.

Loaded ARPES data must contain several attributes and coordinates. See the implementation of LoaderBase.validate for details.

A detailed guide on how to implement a data loader can be found in Reading and writing data.

If additional post-processing is required, the LoaderBase.post_process() method can be extended to include the necessary functionality.

Classes

LoaderBase()

Base class for all data loaders.

LoaderRegistry()

RegistryBase()

Base class for the loader registry.

Exceptions

LoaderNotFoundError(key)

Raised when a loader is not found in the registry.

ValidationError

Raised when the loaded data fails validation checks.

ValidationWarning

Issued when the loaded data fails validation checks.

exception erlab.io.dataloader.LoaderNotFoundError(key)[source]

Bases: Exception

Raised when a loader is not found in the registry.

exception erlab.io.dataloader.ValidationError[source]

Bases: Exception

Raised when the loaded data fails validation checks.

exception erlab.io.dataloader.ValidationWarning[source]

Bases: UserWarning

Issued when the loaded data fails validation checks.

class erlab.io.dataloader.LoaderBase[source]

Bases: object

Base class for all data loaders.

combine_multiple(data_list, coord_dict)[source]
classmethod formatter(val)[source]

Format the given value based on its type.

This method is used when formatting the cells of the summary dataframe.

Parameters:

val (object) – The value to be formatted.

Returns:

The formatted value.

Return type:

str or object

Note

This function formats the given value based on its type. It supports formatting for various types including numpy arrays, lists of strings, floating-point numbers, integers, and datetime objects.

The function also tries to replace the Unicode hyphen-minus sign “-” (U+002D) with the better-looking Unicode minus sign “−” (U+2212) in most cases.

  • For numpy arrays:
    • If the array has a size of 1, the value is recursively formatted using formatter(val.item()).

    • If the array can be squeezed to a 1-dimensional array, the following are applied.

      • If the array is evenly spaced, the start, end, step, and length values are formatted and returned as a string in the format “start→end (step, length)”.

      • If the array is monotonic increasing or decreasing but not evenly spaced, the start, end, and length values are formatted and returned as a string in the format “start→end (length)”.

      • If all elements are equal, the value is recursively formatted using formatter(val[0]).

      • If the array is not monotonic, the minimum and maximum values are formatted and returned as a string in the format “min~max”.

    • For arrays with more dimensions, the array is returned as is.

  • For lists:

    The list is grouped by consecutive equal elements, and the count of each element is formatted and returned as a string in the format “[element]×count”.

  • For floating-point numbers:
    • If the number is an integer, it is formatted as an integer using formatter(np.int64(val)).

    • Otherwise, it is formatted as a floating-point number with 4 decimal places and returned as a string.

  • For integers:

    The integer is returned as a string.

  • For datetime objects:

    The datetime object is formatted as a string in the format “%Y-%m-%d %H:%M:%S”.

  • For other types:

    The value is returned as is.

Examples

>>> formatter(np.array([0.1, 0.15, 0.2]))
'0.1→0.2 (0.05, 3)'
>>> formatter(np.array([1.0, 2.0, 2.1]))
'1→2.1 (3)'
>>> formatter(np.array([1.0, 2.1, 2.0]))
'1~2.1 (3)'
>>> formatter([1, 1, 2, 2, 2, 3, 3, 3, 3])
'[1]×2, [2]×3, [3]×4'
>>> formatter(3.14159)
'3.1416'
>>> formatter(42.0)
'42'
>>> formatter(42)
'42'
>>> formatter(datetime.datetime(2024, 1, 1, 12, 0, 0, 0))
'2024-01-01 12:00:00'
generate_summary(data_dir)[source]

Generate a dataframe summarizing the data in the given directory.

Takes a path to a directory and summarizes the data in the directory to a pandas DataFrame, much like a log file. This is useful for quickly inspecting the contents of a directory.

Parameters:

data_dir (str | PathLike) – Path to a directory.

Returns:

Summary of the data in the directory.

Return type:

pandas.DataFrame

classmethod get_styler(df)[source]

Return a styled version of the given dataframe.

This method, along with formatter, determines the display formatting of the summary dataframe. Override this method to change the display style.

Parameters:

df (pandas.DataFrame) – Summary dataframe as returned by generate_summary.

Returns:

The styler to be displayed.

Return type:

pandas.io.formats.style.Styler

identify(num, data_dir)[source]

Identify the files and coordinates for a given scan number.

This method takes a scan index and transforms it into a list of file paths and coordinates. For scans spread over multiple files, the coordinates must be a dictionary mapping scan axes names to scan coordinates. For single file scans, the list should contain only one file path and coordinates must be an empty dictionary.

The keys of the coordinates must be transformed to new names prior to returning by using the mapping returned by the name_map_reversed property.

Parameters:
  • num (int) – The index of the scan to identify.

  • data_dir (str | os.PathLike) – The directory containing the data.

Returns:

  • files (list[str]) – A list of file paths.

  • coord_dict (dict[str, Iterable]) – A dictionary mapping scan axes names to scan coordinates. For scans spread over multiple files, the coordinates will be iterables corresponding to each file in the files list. For single file scans, an empty dictionary is returned.

Return type:

tuple[list[str], dict[str, Iterable]]

infer_index(name)[source]

Infer the index for the given file name.

This method takes a file name with the path and extension stripped, and tries to infer the scan index from it. If the index can be inferred, it is returned along with additional keyword arguments that should be passed to load. If the index is not found, None should be returned for the index, and an empty dictionary for additional keyword arguments.

Parameters:

name (str) – The base name of the file without the path and extension.

Returns:

  • index – The inferred index if found, otherwise None.

  • additional_kwargs – Additional keyword arguments to be passed to load when the index is found. This argument is useful when the index alone is not enough to load the data.

Return type:

tuple[int | None, dict[str, Any]]

Note

This method is used to determine all files for a given scan. Hence, for loaders with always_single set to True, this method does not have to be implemented.

isummarize(df=None, **kwargs)[source]

Display an interactive summary.

This method provides an interactive summary of the data using ipywidgets and matplotlib.

Parameters:
  • df (DataFrame | None) – A summary dataframe as returned by generate_summary. If None, a dataframe will be generated using summarize. Defaults to None.

  • **kwargs – Additional keyword arguments to be passed to summarize if df is None.

Note

This method requires ipywidgets to be installed. If not found, an ImportError will be raised.

load(identifier, data_dir=None, **kwargs)[source]

Load ARPES data.

Parameters:
  • identifier (str | int) – Value that identifies a scan uniquely. If a string or path-like object is given, it is assumed to be the path to the data file. If an integer is given, it is assumed to be a number that specifies the scan number, and is used to automatically determine the path to the data file(s).

  • data_dir (str | None) – Where to look for the data. If None, the default data directory will be used.

  • single – For some setups, data for a single scan is saved over multiple files. This argument is only used for such setups. When identifier is resolved to a single file within a multiple file scan, the default behavior when single is False is to return a single concatenated array that contains data from all files in the same scan. If single is set to True, only the data from the file given is returned. This argument is ignored when identifier is a number.

  • **kwargs – Additional keyword arguments are passed to identify.

Returns:

The loaded data.

Return type:

xarray.DataArray or xarray.Dataset or list of xarray.DataArray

load_multiple_parallel(file_paths, n_jobs=None)[source]

Load multiple files in parallel.

Parameters:
  • file_paths (list[str]) – A list of file paths to load.

  • n_jobs (int | None) – The number of jobs to run in parallel. If None, the number of jobs is set to 1 for less than 15 files and to -1 (all CPU cores) for 15 or more files.

Return type:

A list of the loaded data.

load_single(file_path)[source]

Load a single file and return it in applicable format.

Any scan-specific postprocessing should be implemented in this method. When the single file contains many regions, the method should return a single dataset whenever the data can be merged with xarray.merge without conflicts. Otherwise, a list of xarray.DataArrays should be returned.

Parameters:

file_path (str | os.PathLike) – Full path to the file to be loaded.

Returns:

The loaded data.

Return type:

xarray.DataArray or xarray.Dataset or list of xarray.DataArray

post_process(data)[source]
post_process_general(data)[source]
process_keys(data, key_mapping=None)[source]
static reverse_mapping(mapping)[source]

Reverse the given mapping dictionary to form a one-to-one mapping.

Parameters:

mapping (Mapping[str, str | Iterable[str]]) – The mapping dictionary to be reversed.

Example

>>> mapping = {"a": "1", "b": ["2", "3"]}
>>> reverse_mapping(mapping)
{'1': 'a', '2': 'b', '3': 'b'}
summarize(data_dir, usecache=True, *, cache=True, display=True, **kwargs)[source]

Summarize the data in the given directory.

Takes a path to a directory and summarizes the data in the directory to a table, much like a log file. This is useful for quickly inspecting the contents of a directory.

The dataframe is formatted using the style from get_styler and displayed in the IPython shell. Results are cached in a pickle file in the directory.

Parameters:
  • data_dir (str | os.PathLike) – Directory to summarize.

  • usecache (bool) – Whether to use the cached summary if available. If False, the summary will be regenerated. The cache will be updated if cache is True.

  • cache (bool) – Whether to cache the summary in a pickle file in the directory. If False, no cache will be created or updated. Note that existing cache files will not be deleted, and will be used if usecache is True.

  • display (bool) – Whether to display the formatted dataframe using the IPython shell. If False, the dataframe will be returned without formatting. If True but the IPython shell is not detected, the dataframe styler will be returned.

  • **kwargs – Additional keyword arguments to be passed to generate_summary.

Returns:

df – Summary of the data in the directory.

  • If display is False, the summary DataFrame is returned.

  • If display is True and the IPython shell is detected, the summary will be displayed, and None will be returned.

    • If ipywidgets is installed, an interactive widget will be returned instead of None.

  • If display is True but the IPython shell is not detected, the styler for the summary DataFrame will be returned.

Return type:

pandas.DataFrame or pandas.io.formats.style.Styler or None

classmethod validate(data)[source]

Validate the input data to ensure it is in the correct format.

Checks for the presence of all required coordinates and attributes. If the data does not pass validation, a ValidationError is raised or a warning is issued, depending on the value of the strict_validation flag. Validation is skipped for loaders with attribute skip_validate set to True.

Parameters:

data (xr.DataArray | xr.Dataset | list[xr.DataArray | xr.Dataset]) – The data to be validated.

Raises:

ValidationError

additional_attrs: ClassVar[dict[str, str | int | float]] = {}

Additional attributes to be added to the data after loading.

additional_coords: ClassVar[dict[str, str | int | float]] = {}

Additional non-dimension coordinates to be added to the data after loading.

aliases: Iterable[str] | None = None

List of alternative names for the loader.

always_single: bool = True

If True, this indicates that all individual scans always lead to a single data file. No concatenation of data from multiple files will be performed.

coordinate_attrs: tuple[str, ...] = ()

Names of attributes (after renaming) that should be treated as coordinates.

Note

Although the data loader tries to preserve the original attributes, the attributes given here, both before and after renaming, will be removed from attrs for consistency.

name: str

Name of the loader. Using a unique and descriptive name is recommended. For easy access, it is recommended to use a name that passes str.isidentifier().

name_map: ClassVar[dict[str, str | Iterable[str]]] = {}

Dictionary that maps new coordinate or attribute names to original coordinate or attribute names. If there are multiple possible names for a single attribute, the value can be passed as an iterable.

property name_map_reversed: dict[str, str]

A reversed version of the name_map dictionary.

This property is useful for mapping original names to new names.

skip_validate: bool = False

If True, validation checks will be skipped.

strict_validation: bool = False

If True, validation check will raise a ValidationError on the first failure instead of warning. Useful for debugging data loaders.

class erlab.io.dataloader.LoaderRegistry[source]

Bases: RegistryBase

get(key)[source]
load(data_dir=None, **kwargs)[source]

Load ARPES data.

Parameters:
  • identifier (str | os.PathLike | int | None) – Value that identifies a scan uniquely. If a string or path-like object is given, it is assumed to be the path to the data file. If an integer is given, it is assumed to be a number that specifies the scan number, and is used to automatically determine the path to the data file(s).

  • data_dir (str | os.PathLike | None) – Where to look for the data. If None, the default data directory will be used.

  • single – For some setups, data for a single scan is saved over multiple files. This argument is only used for such setups. When identifier is resolved to a single file within a multiple file scan, the default behavior when single is False is to return a single concatenated array that contains data from all files in the same scan. If single is set to True, only the data from the file given is returned. This argument is ignored when identifier is a number.

  • **kwargs – Additional keyword arguments are passed to identify.

Returns:

The loaded data.

Return type:

xarray.DataArray or xarray.Dataset or list of xarray.DataArray

loader_context(data_dir=None)[source]

Context manager for the current data loader and data directory.

Parameters:
  • loader (str, optional) – The name or alias of the loader to use in the context.

  • data_dir (str or os.PathLike, optional) – The data directory to use in the context.

Examples

  • Load data within a context manager:

    >>> with erlab.io.loader_context("merlin"):
    ...     dat_merlin = erlab.io.load(...)
    
  • Load data with different loaders and directories:

    >>> erlab.io.set_loader("ssrl52", data_dir="/path/to/dir1")
    >>> dat_ssrl_1 = erlab.io.load(...)
    >>> with erlab.io.loader_context("merlin", data_dir="/path/to/dir2"):
    ...     dat_merlin = erlab.io.load(...)
    >>> dat_ssrl_2 = erlab.io.load(...)
    
register(loader_class)[source]
set_data_dir(data_dir)[source]

Set the default data directory for the data loader.

All subsequent calls to load will use the data_dir set here unless specified.

Parameters:

data_dir (str | PathLike | None) – The path to a directory.

Note

This will only affect load. If the loader’s load method is called directly, it will not use the default data directory.

set_loader(loader)[source]

Set the current data loader.

All subsequent calls to load will use the loader set here.

Parameters:

loader (str | LoaderBase | None) – The loader to set. It can be either a string representing the name or alias of the loader, or a valid loader class.

Example

>>> erlab.io.set_loader("merlin")
>>> dat_merlin_1 = erlab.io.load(...)
>>> dat_merlin_2 = erlab.io.load(...)
summarize(usecache=True, *, cache=True, display=True, **kwargs)[source]

Summarize the data in the given directory.

Takes a path to a directory and summarizes the data in the directory to a table, much like a log file. This is useful for quickly inspecting the contents of a directory.

The dataframe is formatted using the style from get_styler and displayed in the IPython shell. Results are cached in a pickle file in the directory.

Parameters:
  • data_dir (str | os.PathLike | None) – Directory to summarize.

  • usecache (bool) – Whether to use the cached summary if available. If False, the summary will be regenerated. The cache will be updated if cache is True.

  • cache (bool) – Whether to cache the summary in a pickle file in the directory. If False, no cache will be created or updated. Note that existing cache files will not be deleted, and will be used if usecache is True.

  • display (bool) – Whether to display the formatted dataframe using the IPython shell. If False, the dataframe will be returned without formatting. If True but the IPython shell is not detected, the dataframe styler will be returned.

  • **kwargs – Additional keyword arguments to be passed to generate_summary.

Returns:

df – Summary of the data in the directory.

  • If display is False, the summary DataFrame is returned.

  • If display is True and the IPython shell is detected, the summary will be displayed, and None will be returned.

    • If ipywidgets is installed, an interactive widget will be returned instead of None.

  • If display is True but the IPython shell is not detected, the styler for the summary DataFrame will be returned.

Return type:

pandas.DataFrame or pandas.io.formats.style.Styler or None

alias_mapping: ClassVar[dict[str, str]]

Mapping of aliases to loader names

current_loader: LoaderBase | None

Current loader

default_data_dir: str | PathLike | None

Default directory to search for data files

loaders: ClassVar[dict[str, LoaderBase | type[LoaderBase]]]

Registered loaders

class erlab.io.dataloader.RegistryBase[source]

Bases: object

Base class for the loader registry.

This class implements the singleton pattern, ensuring that only one instance of the registry is created and used throughout the application.

classmethod instance()[source]

Return the registry instance.