ASDF Data Set¶

Python implementation of the Adaptable Seismic Data Format (ASDF).

copyright:: Lion Krischer (lion.krischer@gmail.com), 2013-2021
license:: BSD 3-Clause (“BSD New” or “BSD Simplified”)

class pyasdf.asdf_data_set.ASDFDataSet(filename, compression='gzip-3', shuffle=True, debug=False, mpi=None, mode='a', single_item_read_limit_in_mb=4096.0, format_version=None)[source]¶

Object dealing with Adaptable Seismic Data Format (ASDF).

Central object of this Python package.

Parameters:

filename (str) – The filename of the HDF5 file (to be).
compression (str) –
The compression to use. Defaults to "gzip-3" which yielded good results in the past. Will only be applied to newly created data sets. Existing ones are not touched. Using parallel I/O will also disable compression as it is not possible to use both at the same time.

Available compressions choices (all of them are lossless):
- None: No compression
- "gzip-0" - "gzip-9": Gzip compression level 0 (worst but fast) to 9 (best but slow)
- "lzf": LZF compression
- "szip-ec-8": szip compression
- "szip-ec-10": szip compression
- "szip-nn-8": szip compression
- "szip-nn-10": szip compression
shuffle (bool) – Turn the shuffle filter on or off. Turning it on oftentimes increases the compression ratio at the expense of some speed.
debug (bool) – If True, print debug messages. Potentially very verbose.
mpi (bool) – Force MPI on/off. Don’t touch this unless you have a reason.
mode (str) – The mode the file is opened in. Passed to the underlying h5py.File constructor. pyasdf expects to be able to write to files for many operations so it might result in strange errors if a read-only mode is used. Nonetheless this is quite useful for some use cases as long as one is aware of the potential repercussions.
single_item_read_limit_in_mb (float) – This limits the amount of waveform data that can be read with a simple attribute or dictionary access. Some ASDF files can get very big and this raises an error if one tries to access more then the specified value. This is mainly to guard against accidentally filling ones memory on the interactive command line when just exploring an ASDF data set. There are other ways to still access data and even this setting can be overwritten. This test will only be triggered when the file is beigger thatn this limit in the first place as the test is fairly expensive.

add_auxiliary_data(data, data_type, path, parameters, provenance_id=None)[source]¶

Adds auxiliary data to the file.

Parameters:

data – The actual data as a n-dimensional numpy array.
data_type – The type of data, think of it like a subfolder.
path – The path of the data. Must be unique per data_type. Can be a path separated by forward slashes at which point it will be stored in a nested structure.
parameters – Any additional options, as a Python dictionary.
provenance_id – The id of the provenance of this data. The provenance information itself must be added separately. Must be given as a qualified name, e.g. '{namespace_uri}id'.

The data type is the category of the data and the path the name of that particular piece of data within that category.

>>> ds.add_auxiliary_data(numpy.random.random(10),
...                       data_type="RandomArrays",
...                       path="test_data",
...                       parameters={"a": 1, "b": 2})
>>> ds.auxiliary_data.RandomArrays.test_data
Auxiliary Data of Type 'RandomArrays'
Path: 'test_data'
Data shape: '(10, )', dtype: 'float64'
Parameters:
    a: 1
    b: 2

The path can contain forward slashes to create a nested hierarchy of auxiliary data.

>>> ds.add_auxiliary_data(numpy.random.random(10),
...                       data_type="RandomArrays",
...                       path="some/nested/path/test_data",
...                       parameters={"a": 1, "b": 2})
>>> ds.auxiliary_data.RandomArrays.some.nested.path.test_data
Auxiliary Data of Type 'RandomArrays'
Path: 'some/nested/path/test_data'
Data shape: '(10, )', dtype: 'float64'
Parameters:
    a: 1
    b: 2

add_auxiliary_data_file(filename_or_object, path, parameters=None, provenance_id=None)[source]¶

Special function adding a file or file like object as an auxiliary data object denoting a file. This is very useful to store arbitrary files in ASDF.

Parameters:

filename_or_object – Filename, open-file or file-like object. File should really be opened in binary mode, but this i not checked.
path – The path of the file. Has the same limitations as normal tags.
parameters – Any additional options, as a Python dictionary.
provenance_id – The id of the provenance of this data. The provenance information itself must be added separately. Must be given as a qualified name, e.g. '{namespace_uri}id'.

add_provenance_document(document, name=None)[source]¶

Add a provenance document to the current ASDF file.

Parameters:

document (Filename, file-like objects or prov document.) – The document to add.
name (str) – The name of the document within ASDF. Must be lowercase alphanumeric with optional underscores. If not given, it will autogenerate one. If given and it already exists, it will be overwritten.

add_quakeml(event)[source]¶

Adds a QuakeML file or an existing ObsPy event to the data set.

An exception will be raised if an event is attempted to be added that already exists within the data set. Duplicates are detected based on the public ids of the events.

Parameters:: event (Event or Catalog) – Filename or existing ObsPy event object.
Raises:: ValueError

Example

For now we will create a new ASDF file but one can also use an existing one.

>>> impory pyasdf
>>> import obspy
>>> ds = pyasdf.ASDFDataSet("new_file.h5")

One can add an event either by passing a filename …

>>> ds.add_quakeml("/path/to/quake.xml")

… or by passing an existing event or catalog object.

>>> cat = obspy.read_events("/path/to/quakem.xml")
>>> ds.add_quakeml(cat)

add_stationxml(stationxml)[source]¶

Adds the StationXML to the data set object.

This does some fairly exhaustive processing and will happily split the StationXML file and merge it with existing ones.

If you care to have an a more or less unchanged StationXML file in the data set object be sure to add it one station at a time.

Parameters:: stationxml (str or Inventory) – Filename of StationXML file or an ObsPy inventory object containing the same.

add_waveforms(waveform, tag, event_id=None, origin_id=None, magnitude_id=None, focal_mechanism_id=None, provenance_id=None, labels=None)[source]¶

Adds one or more waveforms to the current ASDF file.

Parameters:

waveform (obspy.core.stream.Stream, obspy.core.trace.Trace, str, …) – The waveform to add. Can either be an ObsPy Stream or Trace object or something ObsPy can read.
tag (str) – The path that will be given to all waveform files. It is mandatory for all traces and facilitates identification of the data within one ASDF volume. The "raw_record" path is, by convention, reserved to raw, recorded, unprocessed data.
event_id (obspy.core.event.Event, obspy.core.event.ResourceIdentifier, str, or list) – The event or id which the waveform is associated with. This is useful for recorded data if a clear association is given, but also for synthetic data. Can also be a list of items.
origin_id (obspy.core.event.Origin, obspy.core.event.ResourceIdentifier, str, or list) – The particular origin this waveform is associated with. This is mainly useful for synthetic data where the origin is precisely known. Can also be a list of items.
magnitude_id (obspy.core.event.Magnitude, obspy.core.event.ResourceIdentifier, str, or list) – The particular magnitude this waveform is associated with. This is mainly useful for synthetic data where the magnitude is precisely known. Can also be a list of items.
focal_mechanism_id (obspy.core.event.FocalMechanism, obspy.core.event.ResourceIdentifier, str, or list) – The particular focal mechanism this waveform is associated with. This is mainly useful for synthetic data where the mechanism is precisely known. Can also be a list of items.
provenance_id – The id of the provenance of this data. The provenance information itself must be added separately. Must be given as a qualified name, e.g. '{namespace_uri}id'.
labels (list of str) – A list of labels to associate with all the added traces. Must not contain a comma as that is used as a separator.

Examples

We first setup an example ASDF file with a single event.

>>> from pyasdf import ASDFDataSet
>>> ds = ASDFDataSet("event_tests.h5")
>>> ds.add_quakeml("quake.xml")
>>> event = ds.events[0]

Now assume we have a MiniSEED file that is an unprocessed observation of that earthquake straight from a datacenter called recording.mseed. We will now add it to the file, give it the "raw_recording" path (which is reserved for raw, recorded, and unproceseed data) and associate it with the event. Keep in mind that this association is optional. It can also be associated with multiple events - in that case just pass a list of objects.

>>> ds.add_waveforms("recording.mseed", path="raw_recording",
...                  event_id=event)

It is also possible to directly add obspy.core.stream.Stream objects containing an arbitrary number of obspy.core.trace.Trace objects.

>>> import obspy
>>> st = obspy.read()  # Reads an example file without argument.
>>> print(st)
3 Trace(s) in Stream:
BW.RJOB..EHZ | 2009-08-24T00:20:03.00Z - ... | 100.0 Hz, 3000 samples
BW.RJOB..EHN | 2009-08-24T00:20:03.00Z - ... | 100.0 Hz, 3000 samples
BW.RJOB..EHE | 2009-08-24T00:20:03.00Z - ... | 100.0 Hz, 3000 samples
>>> ds.add_waveforms(st, path="obspy_example")

Just to demonstrate that all waveforms can also be retrieved again.

>>> print(print(ds.waveforms.BW_RJOB.obspy_example))
3 Trace(s) in Stream:
BW.RJOB..EHZ | 2009-08-24T00:20:03.00Z - ... | 100.0 Hz, 3000 samples
BW.RJOB..EHN | 2009-08-24T00:20:03.00Z - ... | 100.0 Hz, 3000 samples
BW.RJOB..EHE | 2009-08-24T00:20:03.00Z - ... | 100.0 Hz, 3000 samples

For the last example lets assume we have the result of a simulation stored in the synthetics.sac file. In this case we know the precise source parameters (we specified them before running the simulation) so it is a good idea to add that association to the waveform. Please again keep in mind that they are all optional and depending on your use case they might or might not be useful/meaningful. You can again pass lists of all of these objects in which case multiple associations will be stored in the file.

>>> origin = event.preferred_origin()
>>> magnitude = event.preferred_magnitude()
>>> focal_mechanism = event.preferred_focal_mechansism()
>>> ds.add_waveforms("synthetics.sac", event_id=event,
...                  origin_id=origin, magnitude_id=magnitude,
...                  focal_mechanism_id=focal_mechanism)

append_waveforms(waveform, tag)[source]¶

Append waveforms to an existing data array if possible.

Note

The add_waveforms() method is the better choice in most cases. This function here is only useful for special cases!

The main purpose of this function is to enable the construction of ASDF files with large single arrays. The classical example is all recordings of a station for a year or longer. If the add_waveforms() method is used ASDF will internally store each file in one or more data sets. This function here will attempt to enlarge existing data arrays and append to them creating larger ones that are a bit more efficient to read. It is slower to write this way but it can potentially be faster to read.

This is only possible if the to be appended waveform traces seamlessly fit after an existing trace with a tolerance of half a sampling interval.

Please note that a single array in ASDF cannot have any gaps and/or overlaps so even if this function is used it might still result in several data sets in the HDF5 file.

This additionally requires that the data-sets being appended to have chunks as non-chunked data cannot be resized. MPI is consequently not allowed for this function as well.

Warning

Any potentially set event_id, origin_id, magnitude_id, focal_mechanism_id, provenance_id, or labels will carry over from the trace that is being appended to so please only use this method if you known what you are doing.

Example

Assuming three trace in three different files, A, B, and C, that could seamlessly be merged to one big trace without producing a gap or overlap. Using add_waveforms() will create three seperate data arrays in the ASDF file, e.g.

ds.add_waveforms(A, tag="random")
ds.add_waveforms(B, tag="random")
ds.add_waveforms(C, tag="random")

results in:

/Waveforms
    /XX.YYYY
        A
        B
        C

Using this method here on the other hand will (if possible) create a single large array which might be a bit faster to read or to iterate over.

ds.append_waveforms(A, tag="random")
ds.append_waveforms(B, tag="random")
ds.append_waveforms(C, tag="random")

This results in:

/Waveforms
    /XX.YYYY
        A + B + C

Parameters:

waveform (obspy.core.stream.Stream, obspy.core.trace.Trace, str, …) – The waveform to add. Can either be an ObsPy Stream or Trace object or something ObsPy can read.
tag (str) – The path that will be given to all waveform files. It is mandatory for all traces and facilitates identification of the data within one ASDF volume. The "raw_record" path is, by convention, reserved to raw, recorded, unprocessed data.

property asdf_format_version_in_file¶: Returns the version of the ASDF version specified in the file.

property events¶

Get all events stored in the data set.

Return type:: An ObsPy Catalog object.

property filename¶: Get the path of the underlying file on the filesystem. Works in most circumstances.

property filesize¶: Return the current size of the file.

flush()[source]¶: Flush the underlying HDF5 file.

get_all_coordinates()[source]¶: Get a dictionary of the coordinates of all stations.

get_data_for_tag(station_name, tag)[source]¶

Returns the waveform and station data for the requested station and path.

Parameters:

station_name (str) – A string with network id and station id, e.g. "IU.ANMO"
tag (str) – The path of the waveform.

Returns:

tuple of the waveform and the inventory.

Return type:

(Stream, Inventory)

get_provenance_document(document_name)[source]¶

Retrieve a provenance document with a certain name.

Parameters:: document_name – The name of the provenance document to retrieve.

get_waveforms(network, station, location, channel, starttime, endtime, tag, automerge=True)[source]¶

Directly access waveforms.

This enables a more exact specification of what one wants to retrieve from an ASDF file. Most importantly it honors the start and end time and only reads those samples that are actually desired - this is especially important for large, continuous data sets.

Parameters:

network (str) – The network code. Can contain wildcards.
station (str) – The station code. Can contain wildcards.
location (str) – The location code. Can contain wildcards.
channel (str) – The channel code. Can contain wildcards.
starttime (obspy.core.utcdatetime.UTCDateTime.) – The time of the first sample.
endtime (obspy.core.utcdatetime.UTCDateTime.) – The time of the last sample.
tag (str) – The tag to extract.
automerge (bool) – Automatically merge adjacent traces if they are exactly adjacent (e.g. last sample from previous trace + first sample of next trace are one delta apart).

ifilter(*query_objects)[source]¶

Return an iterator containing only the filtered information. Usage is fairly complex, a separate documentation page for Querying a Data Set is available - here is just a quick example:

>>> for station in ds.ifilter(ds.q.network == "B?",
...                           ds.q.channel == "*Z",
...                           ds.q.starttime >= "2015-01-01")
...     ...

property mpi¶: Returns a named tuple with comm, rank, size, and MPI if run with MPI and False otherwise.

property pretty_filesize¶: Return a pretty string representation of the size of the file.

process(process_function, output_filename, tag_map, traceback_limit=3, **kwargs)[source]¶

Process the contents of an ASDF file in parallel.

Applies a function to the contents of the current data set and writes the output to a new ASDF file. Can be run with and without MPI. Please see the Parallel Processing document for more details.

Parameters:

process_function (function) – A function with two argument: An obspy.core.stream.Stream object and an obspy.core.inventory.inventory.Inventory object. It should return a obspy.core.stream.Stream object which will then be written to the new file.
output_filename (str) – The output filename. Must not yet exist.
tag_map (dict) – A dictionary mapping the input tags to output tags.
traceback_limit (int) – The length of the traceback printed if an error occurs in one of the workers.

process_two_files_without_parallel_output(other_ds, process_function, traceback_limit=3)[source]¶

Process data in two data sets.

This is mostly useful for comparing data in two data sets in any number of scenarios. It again takes a function and will apply it on each station that is common in both data sets. Please see the Parallel Processing document for more details.

Can only be run with MPI.

Parameters:

other_ds (ASDFDataSet) – The data set to compare to.
process_function – The processing function takes two parameters: The station group from this data set and the matching station group from the other data set.
traceback_limit (int) – The length of the traceback printed if an error occurs in one of the workers.

Returns:

A dictionary for each station with gathered values. Will only be available on rank 0.

validate()[source]¶

Validate and ASDF file. It currently checks that each waveform file has a corresponding station file.

This does not (by far) replace the actual ASDF format validator.

property waveform_tags¶: Returns a set with all tags in the dataset.