erlab.io.dataloader¶
Base functionality for implementing data loaders.
This module provides a base class LoaderBase
for implementing data loaders. Data
loaders are plugins used to load data from various file formats. Each data loader that
subclasses LoaderBase
is registered on import in loaders
.
Loaded ARPES data must contain several attributes and coordinates. See the
implementation of LoaderBase.validate
for details.
A detailed guide on how to implement a data loader can be found in Reading and writing data.
If additional post-processing is required, the LoaderBase.post_process()
method
can be extended to include the necessary functionality.
Classes
Base class for all data loaders. |
|
Base class for the loader registry. |
Exceptions
|
Raised when a loader is not found in the registry. |
Raised when the loaded data fails validation checks. |
|
Issued when the loaded data fails validation checks. |
- exception erlab.io.dataloader.LoaderNotFoundError(key)[source]¶
Bases:
Exception
Raised when a loader is not found in the registry.
- exception erlab.io.dataloader.ValidationError[source]¶
Bases:
Exception
Raised when the loaded data fails validation checks.
- exception erlab.io.dataloader.ValidationWarning[source]¶
Bases:
UserWarning
Issued when the loaded data fails validation checks.
- class erlab.io.dataloader.LoaderBase[source]¶
Bases:
object
Base class for all data loaders.
- classmethod formatter(val)[source]¶
Format the given value based on its type.
This method is used when formatting the cells of the summary dataframe.
- Parameters:
val (object) – The value to be formatted.
- Returns:
The formatted value.
- Return type:
Note
This function formats the given value based on its type. It supports formatting for various types including numpy arrays, lists of strings, floating-point numbers, integers, and datetime objects.
The function also tries to replace the Unicode hyphen-minus sign “-” (U+002D) with the better-looking Unicode minus sign “−” (U+2212) in most cases.
- For numpy arrays:
If the array has a size of 1, the value is recursively formatted using
formatter(val.item())
.If the array can be squeezed to a 1-dimensional array, the following are applied.
If the array is evenly spaced, the start, end, step, and length values are formatted and returned as a string in the format “start→end (step, length)”.
If the array is monotonic increasing or decreasing but not evenly spaced, the start, end, and length values are formatted and returned as a string in the format “start→end (length)”.
If all elements are equal, the value is recursively formatted using
formatter(val[0])
.If the array is not monotonic, the minimum and maximum values are formatted and returned as a string in the format “min~max”.
For arrays with more dimensions, the array is returned as is.
- For lists:
The list is grouped by consecutive equal elements, and the count of each element is formatted and returned as a string in the format “[element]×count”.
- For floating-point numbers:
If the number is an integer, it is formatted as an integer using
formatter(np.int64(val))
.Otherwise, it is formatted as a floating-point number with 4 decimal places and returned as a string.
- For integers:
The integer is returned as a string.
- For datetime objects:
The datetime object is formatted as a string in the format “%Y-%m-%d %H:%M:%S”.
- For other types:
The value is returned as is.
Examples
>>> formatter(np.array([0.1, 0.15, 0.2])) '0.1→0.2 (0.05, 3)'
>>> formatter(np.array([1.0, 2.0, 2.1])) '1→2.1 (3)'
>>> formatter(np.array([1.0, 2.1, 2.0])) '1~2.1 (3)'
>>> formatter([1, 1, 2, 2, 2, 3, 3, 3, 3]) '[1]×2, [2]×3, [3]×4'
>>> formatter(3.14159) '3.1416'
>>> formatter(42.0) '42'
>>> formatter(42) '42'
>>> formatter(datetime.datetime(2024, 1, 1, 12, 0, 0, 0)) '2024-01-01 12:00:00'
- generate_summary(data_dir)[source]¶
Generate a dataframe summarizing the data in the given directory.
Takes a path to a directory and summarizes the data in the directory to a pandas DataFrame, much like a log file. This is useful for quickly inspecting the contents of a directory.
- Parameters:
- Returns:
Summary of the data in the directory.
- Return type:
- classmethod get_styler(df)[source]¶
Return a styled version of the given dataframe.
This method, along with
formatter
, determines the display formatting of the summary dataframe. Override this method to change the display style.- Parameters:
df (pandas.DataFrame) – Summary dataframe as returned by
generate_summary
.- Returns:
The styler to be displayed.
- Return type:
- identify(num, data_dir)[source]¶
Identify the files and coordinates for a given scan number.
This method takes a scan index and transforms it into a list of file paths and coordinates. For scans spread over multiple files, the coordinates must be a dictionary mapping scan axes names to scan coordinates. For single file scans, the list should contain only one file path and coordinates must be an empty dictionary.
The keys of the coordinates must be transformed to new names prior to returning by using the mapping returned by the
name_map_reversed
property.- Parameters:
num (int) – The index of the scan to identify.
data_dir (str | os.PathLike) – The directory containing the data.
- Returns:
files (
list[str]
) – A list of file paths.coord_dict (
dict[str
,Iterable]
) – A dictionary mapping scan axes names to scan coordinates. For scans spread over multiple files, the coordinates will be iterables corresponding to each file in thefiles
list. For single file scans, an empty dictionary is returned.
- Return type:
- infer_index(name)[source]¶
Infer the index for the given file name.
This method takes a file name with the path and extension stripped, and tries to infer the scan index from it. If the index can be inferred, it is returned along with additional keyword arguments that should be passed to
load
. If the index is not found,None
should be returned for the index, and an empty dictionary for additional keyword arguments.- Parameters:
name (str) – The base name of the file without the path and extension.
- Returns:
index
– The inferred index if found, otherwise None.additional_kwargs
– Additional keyword arguments to be passed toload
when the index is found. This argument is useful when the index alone is not enough to load the data.
- Return type:
Note
This method is used to determine all files for a given scan. Hence, for loaders with
always_single
set toTrue
, this method does not have to be implemented.
- isummarize(df=None, **kwargs)[source]¶
Display an interactive summary.
This method provides an interactive summary of the data using ipywidgets and matplotlib.
- Parameters:
df (DataFrame | None) – A summary dataframe as returned by
generate_summary
. If None, a dataframe will be generated usingsummarize
. Defaults to None.**kwargs – Additional keyword arguments to be passed to
summarize
ifdf
is None.
Note
This method requires
ipywidgets
to be installed. If not found, anImportError
will be raised.
- load(identifier, data_dir=None, **kwargs)[source]¶
Load ARPES data.
- Parameters:
identifier (str | int) – Value that identifies a scan uniquely. If a string or path-like object is given, it is assumed to be the path to the data file. If an integer is given, it is assumed to be a number that specifies the scan number, and is used to automatically determine the path to the data file(s).
data_dir (str | None) – Where to look for the data. If
None
, the default data directory will be used.single – For some setups, data for a single scan is saved over multiple files. This argument is only used for such setups. When
identifier
is resolved to a single file within a multiple file scan, the default behavior whensingle
isFalse
is to return a single concatenated array that contains data from all files in the same scan. Ifsingle
is set toTrue
, only the data from the file given is returned. This argument is ignored whenidentifier
is a number.**kwargs – Additional keyword arguments are passed to
identify
.
- Returns:
The loaded data.
- Return type:
xarray.DataArray
orxarray.Dataset
orlist
ofxarray.DataArray
- load_single(file_path)[source]¶
Load a single file and return it in applicable format.
Any scan-specific postprocessing should be implemented in this method. When the single file contains many regions, the method should return a single dataset whenever the data can be merged with
xarray.merge
without conflicts. Otherwise, a list ofxarray.DataArray
s should be returned.- Parameters:
file_path (str | os.PathLike) – Full path to the file to be loaded.
- Returns:
The loaded data.
- Return type:
xarray.DataArray
orxarray.Dataset
orlist
ofxarray.DataArray
- static reverse_mapping(mapping)[source]¶
Reverse the given mapping dictionary to form a one-to-one mapping.
Example
>>> mapping = {"a": "1", "b": ["2", "3"]} >>> reverse_mapping(mapping) {'1': 'a', '2': 'b', '3': 'b'}
- summarize(data_dir, usecache=True, *, cache=True, display=True, **kwargs)[source]¶
Summarize the data in the given directory.
Takes a path to a directory and summarizes the data in the directory to a table, much like a log file. This is useful for quickly inspecting the contents of a directory.
The dataframe is formatted using the style from
get_styler
and displayed in the IPython shell. Results are cached in a pickle file in the directory.- Parameters:
data_dir (str | os.PathLike) – Directory to summarize.
usecache (bool) – Whether to use the cached summary if available. If
False
, the summary will be regenerated. The cache will be updated ifcache
isTrue
.cache (bool) – Whether to cache the summary in a pickle file in the directory. If
False
, no cache will be created or updated. Note that existing cache files will not be deleted, and will be used ifusecache
isTrue
.display (bool) – Whether to display the formatted dataframe using the IPython shell. If
False
, the dataframe will be returned without formatting. IfTrue
but the IPython shell is not detected, the dataframe styler will be returned.**kwargs – Additional keyword arguments to be passed to
generate_summary
.
- Returns:
df – Summary of the data in the directory.
If
display
isFalse
, the summary DataFrame is returned.If
display
isTrue
and the IPython shell is detected, the summary will be displayed, andNone
will be returned.If
ipywidgets
is installed, an interactive widget will be returned instead ofNone
.
If
display
isTrue
but the IPython shell is not detected, the styler for the summary DataFrame will be returned.
- Return type:
- classmethod validate(data)[source]¶
Validate the input data to ensure it is in the correct format.
Checks for the presence of all required coordinates and attributes. If the data does not pass validation, a
ValidationError
is raised or a warning is issued, depending on the value of thestrict_validation
flag. Validation is skipped for loaders with attributeskip_validate
set toTrue
.- Parameters:
data (xr.DataArray | xr.Dataset | list[xr.DataArray | xr.Dataset]) – The data to be validated.
- Raises:
- additional_attrs: ClassVar[dict[str, str | int | float]] = {}¶
Additional attributes to be added to the data after loading.
- additional_coords: ClassVar[dict[str, str | int | float]] = {}¶
Additional non-dimension coordinates to be added to the data after loading.
- always_single: bool = True¶
If
True
, this indicates that all individual scans always lead to a single data file. No concatenation of data from multiple files will be performed.
- coordinate_attrs: tuple[str, ...] = ()¶
Names of attributes (after renaming) that should be treated as coordinates.
Note
Although the data loader tries to preserve the original attributes, the attributes given here, both before and after renaming, will be removed from attrs for consistency.
- name: str¶
Name of the loader. Using a unique and descriptive name is recommended. For easy access, it is recommended to use a name that passes
str.isidentifier()
.
- name_map: ClassVar[dict[str, str | Iterable[str]]] = {}¶
Dictionary that maps new coordinate or attribute names to original coordinate or attribute names. If there are multiple possible names for a single attribute, the value can be passed as an iterable.
- property name_map_reversed: dict[str, str]¶
A reversed version of the name_map dictionary.
This property is useful for mapping original names to new names.
- strict_validation: bool = False¶
If
True
, validation check will raise aValidationError
on the first failure instead of warning. Useful for debugging data loaders.
- class erlab.io.dataloader.LoaderRegistry[source]¶
Bases:
RegistryBase
- load(data_dir=None, **kwargs)[source]¶
Load ARPES data.
- Parameters:
identifier (str | os.PathLike | int | None) – Value that identifies a scan uniquely. If a string or path-like object is given, it is assumed to be the path to the data file. If an integer is given, it is assumed to be a number that specifies the scan number, and is used to automatically determine the path to the data file(s).
data_dir (str | os.PathLike | None) – Where to look for the data. If
None
, the default data directory will be used.single – For some setups, data for a single scan is saved over multiple files. This argument is only used for such setups. When
identifier
is resolved to a single file within a multiple file scan, the default behavior whensingle
isFalse
is to return a single concatenated array that contains data from all files in the same scan. Ifsingle
is set toTrue
, only the data from the file given is returned. This argument is ignored whenidentifier
is a number.**kwargs – Additional keyword arguments are passed to
identify
.
- Returns:
The loaded data.
- Return type:
xarray.DataArray
orxarray.Dataset
orlist
ofxarray.DataArray
- loader_context(data_dir=None)[source]¶
Context manager for the current data loader and data directory.
- Parameters:
loader (
str
, optional) – The name or alias of the loader to use in the context.data_dir (
str
oros.PathLike
, optional) – The data directory to use in the context.
Examples
Load data within a context manager:
>>> with erlab.io.loader_context("merlin"): ... dat_merlin = erlab.io.load(...)
Load data with different loaders and directories:
>>> erlab.io.set_loader("ssrl52", data_dir="/path/to/dir1") >>> dat_ssrl_1 = erlab.io.load(...) >>> with erlab.io.loader_context("merlin", data_dir="/path/to/dir2"): ... dat_merlin = erlab.io.load(...) >>> dat_ssrl_2 = erlab.io.load(...)
- set_data_dir(data_dir)[source]¶
Set the default data directory for the data loader.
All subsequent calls to
load
will use thedata_dir
set here unless specified.Note
This will only affect
load
. If the loader’sload
method is called directly, it will not use the default data directory.
- set_loader(loader)[source]¶
Set the current data loader.
All subsequent calls to
load
will use the loader set here.- Parameters:
loader (str | LoaderBase | None) – The loader to set. It can be either a string representing the name or alias of the loader, or a valid loader class.
Example
>>> erlab.io.set_loader("merlin") >>> dat_merlin_1 = erlab.io.load(...) >>> dat_merlin_2 = erlab.io.load(...)
- summarize(usecache=True, *, cache=True, display=True, **kwargs)[source]¶
Summarize the data in the given directory.
Takes a path to a directory and summarizes the data in the directory to a table, much like a log file. This is useful for quickly inspecting the contents of a directory.
The dataframe is formatted using the style from
get_styler
and displayed in the IPython shell. Results are cached in a pickle file in the directory.- Parameters:
data_dir (str | os.PathLike | None) – Directory to summarize.
usecache (bool) – Whether to use the cached summary if available. If
False
, the summary will be regenerated. The cache will be updated ifcache
isTrue
.cache (bool) – Whether to cache the summary in a pickle file in the directory. If
False
, no cache will be created or updated. Note that existing cache files will not be deleted, and will be used ifusecache
isTrue
.display (bool) – Whether to display the formatted dataframe using the IPython shell. If
False
, the dataframe will be returned without formatting. IfTrue
but the IPython shell is not detected, the dataframe styler will be returned.**kwargs – Additional keyword arguments to be passed to
generate_summary
.
- Returns:
df – Summary of the data in the directory.
If
display
isFalse
, the summary DataFrame is returned.If
display
isTrue
and the IPython shell is detected, the summary will be displayed, andNone
will be returned.If
ipywidgets
is installed, an interactive widget will be returned instead ofNone
.
If
display
isTrue
but the IPython shell is not detected, the styler for the summary DataFrame will be returned.
- Return type:
- current_loader: LoaderBase | None¶
Current loader
- loaders: ClassVar[dict[str, LoaderBase | type[LoaderBase]]]¶
Registered loaders