erlab.io.dataloader¶
Base functionality for implementing data loaders.
This module provides a base class LoaderBase for implementing data loaders.
Data loaders are plugins used to load data from various file formats.
Each data loader is a subclass of LoaderBase that must implement several
methods and attributes.
A detailed guide on how to implement a data loader can be found in the User Guide.
Classes
Base class for loader plugins. |
|
|
Registry of loader plugins. |
Exceptions
|
Raised when a loader is not found in the registry. |
|
Raised when the loader does not support the given file extension. |
Raised when the loaded data fails validation checks. |
|
Issued when the loaded data fails validation checks. |
- class erlab.io.dataloader.LoaderBase[source]¶
Bases:
objectBase class for loader plugins.
-
name:
str¶ Name of the loader. Using a unique and descriptive name is recommended. For easy access, it is recommended to use a name that passes
str.isidentifier().Notes
Changing the name of a loader is not recommended as it may break existing code. Pick a simple, descriptive name that is unlikely to change.
Loaders with the name prefixed with an underscore are not registered.
-
aliases:
Iterable[str] |None= None¶ Alternative names for the loader.
Deprecated since version 3.3.0: Accessing loaders with aliases is deprecated and will be removed in a future version. Use the loader name instead.
-
extensions:
ClassVar[set[str] |None] = None¶ File extensions supported by the loader in lowercase with the leading dot.
An
UnsupportedFileErroris raised if a file with an unsupported extension is passed to the loader. IfNone, the loader will attempt to load any file passed to it.If the loader supports directories, the extension should be an empty string.
Added in version 3.5.1.
-
name_map:
ClassVar[dict[str,str|Iterable[str]]] = {}¶ Dictionary that maps new coordinate or attribute names to original coordinate or attribute names. If there are multiple possible names for a single attribute, the value can be passed as an iterable.
Note
Non-dimension coordinates in the resulting data will try to follow the order of the keys in this mapping.
Original coordinate names included in this mapping will be replaced by the new names. However, original attribute names will be duplicated with the new names so that both the original and new names are present in the data after loading. This is to keep track of the original names for reference.
-
coordinate_attrs:
tuple[str,...] = ()¶ Attribute names (after renaming) that should be treated as coordinates.
Put any attributes that should be propagated when concatenating data here.
Notes
If a listed attribute is not found, it is silently skipped.
The attributes given here, both before and after renaming, are removed from the attributes to avoid conflicting values.
If an existing coordinate with the same name is already present, the existing coordinate takes precedence and the attribute is silently dropped.
See also
-
average_attrs:
tuple[str,...] = ()¶ Names of attributes or coordinates (after renaming) that should be averaged over.
This is useful for attributes that may slightly vary between scans.
Notes
If a listed attribute is not found, it is silently skipped.
Attributes listed here are first treated as coordinates in
process_keys, and then averaged inpost_process.
See also
-
additional_attrs:
ClassVar[dict[str,str|float|datetime|Callable[[DataArray],str|float|datetime]]] = {}¶ Additional attributes to be added to the data after loading.
If a callable is provided, it will be called with the data as the only argument.
Notes
The attributes are added after renaming with
process_keys, so keys will appear in the data as provided.If an attribute with the same name is already present in the data, it is skipped unless the key is listed in
overridden_attrs.
-
overridden_attrs:
tuple[str,...] = ()¶ Keys in
additional_attrsthat should override existing attributes.
-
additional_coords:
ClassVar[dict[str,str|float|datetime|Callable[[DataArray],str|float|datetime]]] = {}¶ Additional coordinates to be added to the data after loading.
If a callable is provided, it will be called with the data as the only argument.
Notes
The coordinates are added after renaming with
process_keys, so keys will appear in the data as provided.If a coordinate with the same name is already present in the data, it is skipped unless the key is listed in
overridden_coords.
-
overridden_coords:
tuple[str,...] = ()¶ Keys in
additional_coordsthat should override existing coordinates.
-
always_single:
bool= True¶ Setting this to
Truedisables implicit loading of multiple files for a single scan. This is useful for setups where each scan is always stored in a single file.
-
parallel_threshold:
int= 30¶ Minimum number of files in a scan to use parallel loading. If the number of files is less than this threshold, files are loaded sequentially.
Only used when
always_singleisFalse.
-
skip_validate:
bool= False¶ If
True, validation checks will be skipped. IfFalse, data will be checked withvalidate.
-
strict_validation:
bool= False¶ If
True, validation checks will raise aValidationErroron the first failure instead of warning. Useful for debugging data loaders. This has no effect ifskip_validateisTrue.
-
formatters:
ClassVar[dict[str,Callable]] = {}¶ Optional mapping from attr or coord names (after renaming) to custom formatters.
The formatters are callables that takes the attribute value and returns a value that can be converted to a string via
value_to_string. The resulting string representations are used for human readable display in the summary table and the information accessor.The values returned by the formatters will be further formatted by
value_to_stringbefore being displayed.If the key is a coordinate, the function will automatically be vectorized over every value.
Note
The formatters are only used for display purposes and do not affect the stored data.
See also
get_formatted_attr_or_coord()The method that uses this mapping to provide human-readable values.
-
summary_sort:
str|None= None¶ Optional default column to sort the summary table by.
If
None, the summary table is sorted in the order of the files returned byfiles_for_summary.
- property summary_attrs: dict[str, str | Callable[[DataArray], Any]]¶
Mapping from summary column names to attr or coord names (after renaming).
If the value is a callable, it will be called with the data as the only argument. This can be used to extract values from the data that are not stored as attributes or spread across multiple attributes.
If not overridden, returns a basic mapping based on
name_map.It is highly recommended to override this property to provide a more detailed and informative summary. See existing loaders for examples.
- property file_dialog_methods: dict[str, tuple[Callable, dict[str, Any]]]¶
Map from file dialog names to the loader method and its arguments.
Override this property in the subclass to provide support for loading data from the load menu of the ImageTool GUI.
- Returns:
loader_mapping (
dictionaryofstrtotupleof(callable,dict)) – A dictionary mapping the file dialog names to a tuple of length 2 containing the data loading function and arguments.The keys should be the names of the file dialog options passed to
setNameFilter.The first item of the value tuple should be a callable that takes the first positional argument as a path to a file, usually
self.load.The second item should be a dictionary containing keyword arguments to be passed to the method.
Multiple key-value pairs can be returned to provide multiple options.
Example
For instance, the loader for ALS BL4 implements the following mapping which enables loading
.pxtand.ibwfiles within ImageTool usingself.loadwith no keyword arguments:@property def file_dialog_methods(self): return {"ALS BL4.0.3 Raw Data (*.pxt, *.ibw)": (self.load, {})}
- classmethod value_to_string(val)[source]¶
Format the given value based on its type.
The default behavior formats the given value with
erlab.utils.formatting.format_value(). Override this classmethod to change the printed format of summaries and information accessors. This method is applied after the formatters informatters.
- classmethod get_styler(df)[source]¶
Return a styled version of the given dataframe.
This method, along with
value_to_string, determines the display formatting of the summary dataframe. Override this classmethod to change the display style.- Parameters:
df – The summary dataframe.
- Returns:
pandas.io.formats.style.Styler– The styler to be displayed.- Return type:
- load(identifier, data_dir=None, *, chunks=None, single=False, combine=True, parallel=None, progress=True, load_kwargs=None, loader_extensions=None, **kwargs)[source]¶
Load ARPES data.
This method is the main entry point for loading ARPES data.
Note
This method is not meant to be overridden in subclasses.
- Parameters:
identifier (
str|PathLike|int) –Value that identifies a scan uniquely.
If a string or path-like object is given, it is assumed to be the path to the data file relative to
data_dir. Ifdata_diris not specified,identifieris assumed to be the full path to the data file.If an integer is given, it is assumed to be a number that specifies the scan number, and is used to automatically determine the path to the data file(s). In this case, the
data_dirargument must be specified.
data_dir (
str|PathLike|None, default:None) –Where to look for the data. Must be a path to a valid directory. This argument is required when
identifieris an integer.When called as
erlab.io.load(), this argument defaults to the value set byerlab.io.set_data_dir()orerlab.io.loader_context().chunks (
int|dict|Literal['auto'] |tuple[int,...] |None, default:None) – Chunking strategy for loading data withdaskfor supported loaders.single (
bool, default:False) –This argument is only used when
always_singleisFalse, andidentifieris given as a string or path-like object.If
identifierpoints to a file that is included in a multiple file scan, the default behavior whensingleisFalseis to return data from all files in the same scan. How the data is combined is determined by thecombineargument. IfTrue, only the data from the file given is returned.combine (
bool, default:True) –Whether to attempt to combine multiple files into a single data object. If
False, a list of data is returned. IfTrue, the loader tries to combine the data into a single data object and return it. Depending on the type of each data object, the returned object can be axarray.DataArray,xarray.Dataset, or axarray.DataTree.This argument is only used when
singleisFalse.parallel (
bool|None, default:None) –Whether to load multiple files in parallel using
dask. For possible values, seeload_multiple_parallel.This argument is only used when
singleisFalse.progress (
bool, default:True) –Whether to show a progress bar when loading multiple files.
This argument is only used when
singleisFalse.load_kwargs (
dict[str,Any] |None, default:None) – Additional keyword arguments to be passed toload_single. You can also pass additional keyword arguments directly toload, and they will be dispatched to eitheridentifyorload_singlebased on their signatures. See the**kwargsargument for details.loader_extensions (
Mapping[str,Any] |None, default:None) – Temporary extensions to loader attributes, with the same keys accepted byextend_loader.**kwargs – Additional keyword arguments are passed to
identifyandload_singlebased on their signatures. If a keyword argument is accepted by both methods, it is passed toidentify. Use theload_kwargsargument to pass an ambiguous keyword argument toload_single.
- Returns:
xarray.DataArrayorxarray.Datasetorxarray.DataTree– The loaded data.- Return type:
DataArray | Dataset | DataTree | list[DataArray] | list[Dataset] | list[DataTree]
Notes
The
data_dirset byerlab.io.set_data_dir()orerlab.io.loader_context()is only used when called aserlab.io.load(). When called directly on a loader instance, thedata_dirargument must be specified.For convenience, the
data_dirset byerlab.io.set_data_dir()orerlab.io.loader_context()is silently ignored when all of the following are satisfied:identifieris an absolute path to an existing file.data_diris not explicitly provided.The path created by joining
data_dirandidentifierdoes not point to an existing file.
This way, absolute file paths can be passed directly to the loader without changing the default data directory. For instance, consider the following directory structure.
cwd/ ├── data/ └── example.txt
The following code will load
./example.txtinstead of raising an error that./data/example.txtis missing:import erlab erlab.io.set_data_dir("data") erlab.io.load("example.txt")
However, if
./data/example.txtalso exists, the same code will load that one instead while warning about the ambiguity. This behavior may lead to unexpected results when the directory structure is not organized. Keep this in mind and try to keep all data files in the same level.
- extend_loader(*, name_map=None, coordinate_attrs=None, average_attrs=None, additional_attrs=None, overridden_attrs=None, additional_coords=None, overridden_coords=None)[source]¶
Context manager that temporarily extends various loader attributes.
This context manager can be used to temporarily customize the behavior of the data loader. This is particularly useful when loading data across multiple files, where the
coordinate_attrscan be extended so that the attributes in the data are promoted to coordinates and propagated when combining data across files.For one-off loads, the same arguments can be passed to
loadorerlab.io.load()with theloader_extensionskeyword. This keeps the extension settings attached to the load call, which is useful for generated loading code and ImageTool manager reload metadata.- Parameters:
name_map (
dict[str,str|Iterable[str]] |None, default:None) – Extendsname_map.coordinate_attrs (
tuple[str,...] |None, default:None) – Extendscoordinate_attrs.average_attrs (
tuple[str,...] |None, default:None) – Extendsaverage_attrs.additional_attrs (
dict[str,str|float|Callable[[DataArray],str|float]] |None, default:None) – Extendsadditional_attrs.overridden_attrs (
tuple[str,...] |None, default:None) – Extendsoverridden_attrs.additional_coords (
dict[str,str|float|Callable[[DataArray],str|float]] |None, default:None) – Extendsadditional_coords.overridden_coords (
tuple[str,...] |None, default:None) – Extendsoverridden_coords.
Example
import erlab erlab.io.set_loader("loader_name") with erlab.io.extend_loader(coordinate_attrs=("scan_number",)): data = erlab.io.load("file_name") data = erlab.io.load( "file_name", loader_extensions={"coordinate_attrs": ("scan_number",)}, )
See also
loadLoad data with optional
loader_extensions.coordinate_attrsThe attribute that is temporarily extended.
- summarize(data_dir, exclude=None, *, cache=True, display=True, rc=None)[source]¶
Summarize the data in the given directory.
Note
This method is not meant to be overridden in subclasses.
Takes a path to a directory and summarizes the data in the directory to a table, much like a log file. This is useful for quickly inspecting the contents of a directory.
The dataframe is formatted using the style from
get_stylerand displayed in the IPython shell. Results are cached in a pickle file in the directory.- Parameters:
data_dir – Directory to summarize.
exclude (default:
None) – A string or sequence of strings specifying glob patterns for files to be excluded from the summary. If provided, caching will be disabled.cache (default:
True) – Whether to use caching for the summary.display (default:
True) – Whether to display the formatted dataframe using the IPython shell. IfFalse, the dataframe will be returned without formatting. IfTruebut the IPython shell is not detected, the dataframe styler will be returned.rc (default:
None) – Optional dictionary of matplotlib rcParams to override the default for the plot in the interactive summary. Plot options such as the figure size and colormap can be changed using this argument.
- Returns:
pandas.DataFrameorpandas.io.formats.style.StylerorNone– Summary of the data in the directory.If
displayisFalse, the summary DataFrame is returned.If
displayisTrueand the IPython shell is detected, the summary will be displayed, andNonewill be returned.If
ipywidgetsis installed, an interactive widget will be returned instead ofNone.
If
displayisTruebut the IPython shell is not detected, the styler for the summary DataFrame will be returned.
- Return type:
- get_formatted_attr_or_coord(data, attr_or_coord_name)[source]¶
Return the formatted value of the given attribute or coordinate.
The value is formatted using the function specified in
formatters.- Parameters:
Notes
Numpy datetime64 scalars are converted to pandas timestamps before formatting.
If the attribute or coordinate is not found, an empty string is returned.
- load_single(file_path, **kwargs)[source]¶
Load a single file and return it as an xarray data structure.
All scan-specific postprocessing should be implemented in this method.
This method must be implemented to return the smallest possible data structure that represents the data in a single file. For instance, if a single file contains a single scan region, the method should return a single
xarray.DataArray. If it contains multiple regions, the method should return axarray.Datasetorxarray.DataTreedepending on whether the regions can be merged with without conflicts (i.e., all mutual coordinates of the regions are the same).Subclasses may add additional keyword arguments to this method as needed, which can be passed through
loadusing theload_kwargsargument.If the loader supports dask-based lazy loading, it should add a
chunkskeyword argument to this method, which should be passed to the underlying data loading function (e.g.,xarray.open_dataset(),xarray.open_datatree()).- Parameters:
file_path – Full path to the file to be loaded.
without_values – Used when creating a summary table. With this option set to
True, only the coordinates and attributes of the output data are accessed so that the values can be replaced with placeholder numbers, speeding up the summary generation for lazy loading enabled file formats like HDF5 or NeXus.
- Returns:
Notes
For loaders with
always_singleset toFalse, the return type of this method must be consistent across all associated files, i.e., for all files that can be returned together fromidentifyso that they can be combined without conflicts. This should not be a problem in most cases since the data structure of associated files acquired during the same scan will be identical.For
xarray.DataTreeobjects, returned trees must be named with a unique identifier to avoid conflicts when combining.
- identify(num, data_dir, **kwargs)[source]¶
Identify the files and coordinates for a given scan number.
This method takes a scan index and transforms it into a list of file paths and coordinates. See below for the expected behavior.
If no files are found for the given parameters, an empty list and an empty dictionary should be returned. Alternatively, return a single
Noneto indicate a failure to identify the scan.- Parameters:
num – The index of the scan to identify.
data_dir – The directory containing the data.
- Returns:
files (
listofstrorpath-like) – A list of file paths.For scans spread over multiple files, the list must contain all files corresponding to the given scan index.
For single-file scans, behavior depends on
always_single. IfTrue, all files matching the scan index should be returned, but only the first file will be loaded and a warning will be shown. IfFalse, there is no way to tell whether returned files are part of a valid multiple-file scan. The loader must then ensure that only a single file is returned and issue appropriate warnings if multiple files are detected for a single-file scan. Seeerlab.io.plugins.merlin.MERLINLoader.identify()for an example.coord_dict (
dictofstrtosequence) – A dictionary mapping scan axes names to scan coordinates.The keys must match the coordinate name conventions used by the data returned by
load_single.For scans spread over multiple files, the coordinates will be sequences, with each element corresponding to each file in
files.For single file scans or multiple file scans that have no well-defined scan axes (such as multi-region scans), an empty dictionary should be returned.
- infer_index(name)[source]¶
Infer the index for the given file name.
This method takes a file name with the path and extension stripped, and tries to infer the scan index from it. If the index can be inferred, it is returned along with additional keyword arguments that should be passed to
load. If the index is not found,Noneshould be returned for the index, and an empty dictionary for additional keyword arguments.- Parameters:
name (
str) – The base name of the file without the path and extension.- Returns:
index– The inferred index if found, otherwise None.additional_kwargs– Additional keyword arguments to be passed toidentifywhen the index is found. This argument is useful when the index alone is not enough to load the data.
- Return type:
Note
For loaders with
always_singleset toTrue, this method is unused.
- files_for_summary(data_dir)[source]¶
Return a list of files that can be loaded by the loader.
This method is used to select files that can be loaded by the loader when generating a summary.
- combine_attrs(variable_attrs, context=None)[source]¶
Combine multiple attributes into a single attribute.
This method is used as the
combine_attrsargument inxarray.concat()andxarray.merge()when combining data from multiple files into a single object. By default, it has the same behavior as specifyingcombine_attrs='override'by taking the first set of attributes.The method can be overridden to provide fine-grained control over how the attributes are combined, e.g., by merging dictionaries or taking the average of some attributes.
- Parameters:
- Returns:
dict[str,typing.Any]– The combined attributes.- Return type:
- pre_combine_multiple(data_list, coord_dict)[source]¶
Pre-process data before combining multiple files.
This method is called only for loaders that support combining multiple files into a single object, i.e., loaders with
always_singleset toFalse. The default implementation returns the input data and coordinate dictionary unchanged.Override this function to perform any necessary concatenation-specific pre-processing steps. The primary use case is to correct small inconsistencies in the loaded data that result in broken concatenation/combination.
For instance, ALS BL4.0.3 Merlin often produces data with the energy axis start and step values shifted by a small amount (typically on the order of μeV). This results in different energy values for the same scan in different files, leading to the data not being combined correctly. See the implementation of
MERLINLoader.
- process_keys(data, key_mapping=None)[source]¶
Rename coordinates and attributes based on the given mapping.
This method is used to rename coordinates and attributes. This method is called by
post_process. Extend or override this method to customize the renaming behavior.
- post_process(darr)[source]¶
Post-process the given
DataArray.This method takes a single
DataArrayand applies post-processing steps such as renaming coordinates and attributes.This method is called by
post_process_general.- Parameters:
darr (
DataArray) – TheDataArrayto be post-processed.- Returns:
DataArray– The post-processedDataArray.- Return type:
Note
When introducing a custom post-processing step in a loader, make sure to call the parent method in the subclass implementation.
- post_process_general(data)[source]¶
Post-process any data structure.
This method extends
post_processto handle any data structure.This method is called by
loadas the final step in the data loading process.- Parameters:
data (
DataArrayorDatasetorDataTree) –The data to be post-processed.
If a
DataArray, the data is post-processed usingpost_process.If a
Dataset, a newDatasetcontaining each data variable post-processed usingpost_processis returned. The attributes of the originalDatasetare preserved.If a
xarray.DataTree, the post-processing is applied to each leaf nodeDataset.
- Returns:
DataArrayorDatasetorDataTree– The post-processed data with the same type as the input.- Return type:
- classmethod validate(data)[source]¶
Validate the input data to ensure it is in the correct format.
Checks for the presence of all coordinates and attributes required for common analysis procedures like momentum conversion. If the data does not pass validation, a
ValidationErroris raised or a warning is issued, depending on thestrict_validationflag. Validation is skipped for loaders withskip_validateset toTrue.- Parameters:
data (
DataArrayorDatasetorDataTree) – The data to be validated. If axarray.Datasetorxarray.DataTreeis passed, validation is performed on each data variable recursively.
- load_multiple_parallel(file_paths, *, parallel=None, progress=True, post_process=False, **kwargs)[source]¶
Load from multiple files in parallel.
- Parameters:
parallel (
bool|None, default:None) –Whether to load data in parallel using
dask.If
None, parallel loading is enabled only if the number of files is greater than the loader’sparallel_threshold.If
True, data loading will always be performed in parallel.If
False, data will be loaded sequentially.
progress (
bool, default:True) – Whether to show a progress bar.post_process (
bool, default:False) – Whether to post-process each data object after loading.**kwargs – Additional keyword arguments to be passed to
load_single.
- Returns:
A listofthe loaded data.- Return type:
-
name:
- exception erlab.io.dataloader.LoaderNotFoundError(key)[source]¶
Bases:
ExceptionRaised when a loader is not found in the registry.
- class erlab.io.dataloader.LoaderRegistry(state=None)[source]¶
Bases:
objectRegistry of loader plugins.
Stores and manages data loaders. The loaders can be accessed by name in a dictionary-like manner or as an attribute.
Most public methods of this class instance can be accessed through the
erlab.ionamespace.Examples
>>> import erlab >>> "merlin" in erlab.io.loaders # Check if MERLIN loader is registered True >>> list(erlab.io.loaders.keys()) # List registered loader names ['da30', 'erpes', ...]
Notes
Public methods are thread-safe.
Per-context state (
current_loaderanddata_dir) usescontextvarsso that concurrent threads/tasks do not step on each other.
- property current_loader: LoaderBase | None¶
Current loader.
- property default_data_dir: PathLike | None¶
Deprecated alias for current_data_dir.
Deprecated since version 3.0.0: Use
current_data_dirinstead.
- set_loader(loader)[source]¶
Set the current data loader for the current context.
All subsequent calls to
loadwill use the provided loader.- Parameters:
loader (
str|LoaderBase|None) – The loader to set. It can be either a string representing the name or alias of the loader, or a valid loader class.
Example
>>> erlab.io.set_loader("merlin") >>> dat_merlin_1 = erlab.io.load(...) >>> dat_merlin_2 = erlab.io.load(...)
- set_data_dir(data_dir)[source]¶
Set the default data directory for the current context.
All subsequent calls to
erlab.io.load()will use the provideddata_dirunless specified.Note
This will only affect
erlab.io.load(). If the loader’sloadmethod is called directly, it will not use the default data directory.
- loader_context(loader=None, data_dir=None)[source]¶
Context manager that temporarily sets the current loader and data directory.
- Parameters:
loader (
str, optional) – The name or alias of the loader to use in the context.data_dir (
stroros.PathLike, optional) – The data directory to use in the context.
Examples
Load data within a context manager:
>>> with erlab.io.loader_context("merlin"): ... dat_merlin = erlab.io.load(...)
Load data with different loaders and directories:
>>> erlab.io.set_loader("ssrl52", data_dir="/path/to/dir1") >>> dat_ssrl_1 = erlab.io.load(...) >>> with erlab.io.loader_context("merlin", data_dir="/path/to/dir2"): ... dat_merlin = erlab.io.load(...) >>> dat_ssrl_2 = erlab.io.load(...)
- load(*, single=False, combine=True, parallel=False, progress=True, load_kwargs=None, loader_extensions=None, **kwargs)[source]¶
Load ARPES data.
This method is the main entry point for loading ARPES data.
Note
This method is not meant to be overridden in subclasses.
- Parameters:
identifier –
Value that identifies a scan uniquely.
If a string or path-like object is given, it is assumed to be the path to the data file relative to
data_dir. Ifdata_diris not specified,identifieris assumed to be the full path to the data file.If an integer is given, it is assumed to be a number that specifies the scan number, and is used to automatically determine the path to the data file(s). In this case, the
data_dirargument must be specified.
data_dir (
str|PathLike|None, default:None) –Where to look for the data. Must be a path to a valid directory. This argument is required when
identifieris an integer.When called as
erlab.io.load(), this argument defaults to the value set byerlab.io.set_data_dir()orerlab.io.loader_context().chunks – Chunking strategy for loading data with
daskfor supported loaders.single (
bool, default:False) –This argument is only used when
always_singleisFalse, andidentifieris given as a string or path-like object.If
identifierpoints to a file that is included in a multiple file scan, the default behavior whensingleisFalseis to return data from all files in the same scan. How the data is combined is determined by thecombineargument. IfTrue, only the data from the file given is returned.combine (
bool, default:True) –Whether to attempt to combine multiple files into a single data object. If
False, a list of data is returned. IfTrue, the loader tries to combine the data into a single data object and return it. Depending on the type of each data object, the returned object can be axarray.DataArray,xarray.Dataset, or axarray.DataTree.This argument is only used when
singleisFalse.parallel (
bool, default:False) –Whether to load multiple files in parallel using
dask. For possible values, seeload_multiple_parallel.This argument is only used when
singleisFalse.progress (
bool, default:True) –Whether to show a progress bar when loading multiple files.
This argument is only used when
singleisFalse.load_kwargs (
dict[str,Any] |None, default:None) – Additional keyword arguments to be passed toload_single. You can also pass additional keyword arguments directly toload, and they will be dispatched to eitheridentifyorload_singlebased on their signatures. See the**kwargsargument for details.loader_extensions (
Mapping[str,Any] |None, default:None) – Temporary extensions to loader attributes, with the same keys accepted byextend_loader.**kwargs – Additional keyword arguments are passed to
identifyandload_singlebased on their signatures. If a keyword argument is accepted by both methods, it is passed toidentify. Use theload_kwargsargument to pass an ambiguous keyword argument toload_single.
- Returns:
xarray.DataArrayorxarray.Datasetorxarray.DataTree– The loaded data.- Return type:
DataArray | Dataset | DataTree | list[DataArray] | list[Dataset] | list[DataTree]
Notes
The
data_dirset byerlab.io.set_data_dir()orerlab.io.loader_context()is only used when called aserlab.io.load(). When called directly on a loader instance, thedata_dirargument must be specified.For convenience, the
data_dirset byerlab.io.set_data_dir()orerlab.io.loader_context()is silently ignored when all of the following are satisfied:identifieris an absolute path to an existing file.data_diris not explicitly provided.The path created by joining
data_dirandidentifierdoes not point to an existing file.
This way, absolute file paths can be passed directly to the loader without changing the default data directory. For instance, consider the following directory structure.
cwd/ ├── data/ └── example.txt
The following code will load
./example.txtinstead of raising an error that./data/example.txtis missing:import erlab erlab.io.set_data_dir("data") erlab.io.load("example.txt")
However, if
./data/example.txtalso exists, the same code will load that one instead while warning about the ambiguity. This behavior may lead to unexpected results when the directory structure is not organized. Keep this in mind and try to keep all data files in the same level.
- extend_loader(average_attrs=None, additional_attrs=None, overridden_attrs=None, additional_coords=None, overridden_coords=None)[source]¶
Context manager that temporarily extends various loader attributes.
This context manager can be used to temporarily customize the behavior of the data loader. This is particularly useful when loading data across multiple files, where the
coordinate_attrscan be extended so that the attributes in the data are promoted to coordinates and propagated when combining data across files.For one-off loads, the same arguments can be passed to
loadorerlab.io.load()with theloader_extensionskeyword. This keeps the extension settings attached to the load call, which is useful for generated loading code and ImageTool manager reload metadata.- Parameters:
name_map – Extends
name_map.coordinate_attrs (
tuple[str,...] |None, default:None) – Extendscoordinate_attrs.average_attrs (
tuple[str,...] |None, default:None) – Extendsaverage_attrs.additional_attrs (
dict[str,str|float|Callable[[DataArray],str|float]] |None, default:None) – Extendsadditional_attrs.overridden_attrs (
tuple[str,...] |None, default:None) – Extendsoverridden_attrs.additional_coords (
dict[str,str|float|Callable[[DataArray],str|float]] |None, default:None) – Extendsadditional_coords.overridden_coords (
tuple[str,...] |None, default:None) – Extendsoverridden_coords.
Example
import erlab erlab.io.set_loader("loader_name") with erlab.io.extend_loader(coordinate_attrs=("scan_number",)): data = erlab.io.load("file_name") data = erlab.io.load( "file_name", loader_extensions={"coordinate_attrs": ("scan_number",)}, )
See also
loadLoad data with optional
loader_extensions.coordinate_attrsThe attribute that is temporarily extended.
- summarize(*, cache=True, display=True, rc=None)[source]¶
Summarize the data in the given directory.
Note
This method is not meant to be overridden in subclasses.
Takes a path to a directory and summarizes the data in the directory to a table, much like a log file. This is useful for quickly inspecting the contents of a directory.
The dataframe is formatted using the style from
get_stylerand displayed in the IPython shell. Results are cached in a pickle file in the directory.- Parameters:
data_dir – Directory to summarize.
exclude (default:
None) – A string or sequence of strings specifying glob patterns for files to be excluded from the summary. If provided, caching will be disabled.cache (default:
True) – Whether to use caching for the summary.display (default:
True) – Whether to display the formatted dataframe using the IPython shell. IfFalse, the dataframe will be returned without formatting. IfTruebut the IPython shell is not detected, the dataframe styler will be returned.rc (default:
None) – Optional dictionary of matplotlib rcParams to override the default for the plot in the interactive summary. Plot options such as the figure size and colormap can be changed using this argument.
- Returns:
pandas.DataFrameorpandas.io.formats.style.StylerorNone– Summary of the data in the directory.If
displayisFalse, the summary DataFrame is returned.If
displayisTrueand the IPython shell is detected, the summary will be displayed, andNonewill be returned.If
ipywidgetsis installed, an interactive widget will be returned instead ofNone.
If
displayisTruebut the IPython shell is not detected, the styler for the summary DataFrame will be returned.
- Return type:
- exception erlab.io.dataloader.UnsupportedFileError(loader, file_path)[source]¶
Bases:
ExceptionRaised when the loader does not support the given file extension.
- exception erlab.io.dataloader.ValidationError[source]¶
Bases:
ExceptionRaised when the loaded data fails validation checks.
- exception erlab.io.dataloader.ValidationWarning[source]¶
Bases:
UserWarningIssued when the loaded data fails validation checks.