Tutorial ======== .. contents:: :local: :depth: 3 .. admonition:: Example Data Set For the time being we will attempt to curate an up-to-date example data set which you can use to get started. `Download it here (220 MB). `_ Learning Python and ObsPy ------------------------- ``pyasdf`` is written in `Python `_ and utilizes the data structures of `ObsPy `_ to allow the construction of modern and efficient workflows. Python is an easy to learn and powerful interactive programming language with an exhaustive scientific ecosystem. The following resources are useful if you are starting out with Python and ObsPy: * `Good, general Python tutorial `_ * `IPython Notebook in Nature `_ * `Introduction to the scientific Python ecosystem `_ * `The ObsPy Documentation `_ * `The ObsPy Tutorial `_ Using pyasdf ------------ This section will teach you the basics of creating and working with ASDF files with `pyasdf`. For more detailed information, please see the :doc:`api` documentation or have a look at the :doc:`examples` section that demonstrates more complex workflows. .. admonition:: Notes on using MPI ``pyasdf`` will behave slightly different depending on whether it is being called with MPI or not. If called with MPI, the underlying HDF5 files will be opened with MPI I/O and fully parallel I/O will be utilized for the processing functions. Keep the scripts reasonably short if using MPI or place appropriate barriers. The module will detect if it has been called with MPI if ``mpi4py`` can be imported and the size of the communicator is greater than one. Thus calling it with ``mpirun/mpiexec`` with only one core will not be recognized as being called with MPI. Due to a limitation of Python you should always delete all references to the :class:`~pyasdf.asdf_data_set.ASDFDataSet` objects at the end. Otherwise the program will potentially not finish when called with MPI. An even better alternative is described in :doc:`usage_as_context_manager`. Creating an ASDF data set ^^^^^^^^^^^^^^^^^^^^^^^^^ Initialization ************** The first step is always to import ``pyasdf`` and create a :class:`~pyasdf.asdf_data_set.ASDFDataSet` object. If the file does not exists, it will be created, otherwise the existing one will be opened. .. code-block:: python >>> import pyasdf >>> ds = pyasdf.ASDFDataSet("new_file.h5", compression="gzip-3") Compression will help to reduce the size of files without affecting the quality as all offered compression options are lossless. Only new data will be compressed and compression will be disable for files created with parallel I/O as these two features are not compatible. See the documentation of the :class:`~pyasdf.asdf_data_set.ASDFDataSet` object for more details. At any point you can get an overview of the contents by printing the object .. code-block:: python >>> print(ds) ASDF file [format version: 1.0.0]: 'new_file.h5' (7.7 KB) Contains 0 event(s) Contains waveform data from 0 station(s). Adding Information about an Earthquake ************************************** ASDF can optionally store events and associate waveforms with a given event. To add an event with ``pyasdf`` use the :meth:`~pyasdf.asdf_data_set.ASDFDataSet.add_quakeml` method. Be aware that all operations will directly write to the file without an explicit *save/write* step. This enables ``pyasdf`` to deal with arbitrarily big data sets. .. code-block:: python >>> ds.add_quakeml("/path/to/quake.xml") >>> print(ds) ASDF file [format version: 1.0.0]: 'new_file.h5' (14.7 KB) Contains 1 event(s) Contains waveform data from 0 station(s). The event can later be accessed again by using the ``event`` attribute. Please be careful that this might contain multiple events and not only one. .. code-block:: python >>> event = ds.events[0] Adding Waveforms **************** The next step is to add waveform data. In this example we will add all SAC files in one folder with the help of the :meth:`~pyasdf.asdf_data_set.ASDFDataSet.add_waveforms` method. Printing the progress is unnecessary but useful when dealing with large amounts of data. There are a couple of subtleties to keep in mind here: * The data will be compressed with the compression settings given during the initialization of the data set object. * It is possible to optionally associate waveforms with a specific event. * You must give a tag. The tag is an additional identifier of that particular waveform. The ``"raw_recording"`` tag is by convention only given to raw, unprocessed data. * The actual type of the data will not change. This example uses SAC which means data is in single precision floating point, MiniSEED will typically be in single precision integer if coming from a data center. If you desire a different data type to be stored for whatever reason you have to manually convert it and pass the ObsPy :class:`~obspy.core.stream.Stream` object. .. code-block:: python >>> import glob >>> files = glob.glob("/path/to/data/*.mseed") >>> for _i, filename in enumerate(files): ... print("Adding file %i of %i ..." % (_i + 1, len(files))) ... ds.add_waveforms(filename, tag="raw_recording", event_id=event) Adding file 1 of 588 ... ... >>> print(ds) ASDF file [format version: 1.0.0]: 'new_file.h5' (169.7 MB) Contains 1 event(s) Contains waveform data from 196 station(s). Adding Station Information ************************** The last step to create a very basic ASDF file is to add station information which is fairly straightforward. .. note:: Please keep in mind that this will potentially rearrange and split the StationXML files to fit within the structure imposed by the ASDF format. StationXML can store any number and combination of networks, stations, and channels whereas ASDF requires a **separate StationXML file per station**. ``pyasdf`` will thus split up files if necessary and also merge them with any possibly already existing StationXML files. .. code-block:: python >>> files = glob.glob("/path/to/stations/*.xml") >>> for _i, filename in enumerate(files): ... print("Adding file %i of %i ..." % (_i + 1, len(files))) ... ds.add_stationxml(filename) Adding file 1 of 196 ... ... >>> print(ds) ASDF file [format version: 1.0.0]: 'new_file.h5' (188.3 MB) Contains 1 event(s) Contains waveform data from 196 station(s). Adding Auxiliary Data ********************* The ASDF format has the capability to store generic non-waveform data called *auxiliary data*. Auxiliary data are data arrays with associcated attributes that can represent anything; each sub-community is supposed to come up with conventions of how to use this. .. code-block:: python >>> import numpy as np >>> data = np.random.random(100) # The type always should be camel case. >>> data_type = "RandomArrays" # Name to identify the particular piece of data. >>> path = "example_array" # Any additional parameters as a Python dictionary which will end up as # attributes of the array. >>> parameters = { ... "a": 1, ... "b": 2.0} >>> ds.add_auxiliary_data(data=data, data_type=data_type, path=path, ... parameters=parameters) >>> print(ds) ASDF file [format version: '1.0.0']: 'new_file.h5' (188.3 MB) Contains 1 event(s) Contains waveform data from 196 station(s). Contains 1 type(s) of auxiliary data: RandomArrays Reading an existing ASDF data set ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once you have acquired an ASDF file by whatever means it is time to read the data. To a large extent reading works by attribute access (meaning tab completion in interactive environments is available). At access time the data is read from an ASDF file an parsed to an ObsPy object. As always, the first step is to open a file. Note that this does not yet read any data. .. code-block:: python >>> import pyasdf >>> ds = pyasdf.ASDFDataSet("example.h5") Reading Events ************** To read event data, simply access the ``event`` attribute. At access time the events will be parsed to an :class:`obspy.core.event.Catalog` object which can used for further analysis. .. code-block:: python >>> type(ds.events) obspy.core.event.Catalog >>> print(ds.events) 4 Event(s) in Catalog: 1998-09-01T10:29:54.500000Z | -58.500, -26.100 | 5.5 Mwc 2012-04-04T14:21:42.300000Z | +41.818, +79.689 | 4.4 mb | manual 2012-04-04T14:18:37.000000Z | +39.342, +41.044 | 4.3 ML | manual 2012-04-04T14:08:46.000000Z | +38.017, +37.736 | 3.0 ML | manual Reading Waveforms and StationXML ******************************** Waveforms and station meta information can be accessed at a per-station granularity under the ``waveforms`` attribute. **Note that tab completion works everywhere so please play around in the IPython shell which is the best way to learn how to use pyasdf.** In the following we will access the data for the ``IU.ANMO`` station. Note that the dot is replaced by an underscore to work around syntax limitations imposed by Python. Once again, at attribute access everything is read from the ASDF file and parsed to an ObsPy object. .. code-block:: python >>> print(ds.waveforms.IU_ANMO) Contents of the data set for station IU.ANMO: - Has a StationXML file - 2 Waveform Tag(s): observed_processed synthetic) >>> type(ds.waveform.IU_ANMO.StationXML)) obspy.core.inventory.inventory.Inventory >>> print(ds.waveform.IU_ANMO.StationXML) Inventory created at 2014-12-10T17:04:09.000000Z Created by: IRIS WEB SERVICE: fdsnws-station | version: 1.1.9 http://service.iris.edu/fdsnws/station/1/query?network=IU&level=res... Sending institution: IRIS-DMC (IRIS-DMC) Contains: Networks (1): IU Stations (1): IU.ANMO (Albuquerque, New Mexico, USA) Channels (6): IU.ANMO..BH1, IU.ANMO..BH2, IU.ANMO..BHU, IU.ANMO..BHV, IU.ANMO..BHW, IU.ANMO..BHZ >>> type(ds.waveforms.IU_ANMO.synthetic) obspy.core.stream.Stream >>> print(ds.waveforms.IU_ANMO.synthetic) 3 Trace(s) in Stream: IU.ANMO.S3.MXE | 1998-09-01T10:29:52.250000Z - 1998-09-01T12:10:19.857558Z | 7.0 Hz, 42300 samples IU.ANMO.S3.MXN | 1998-09-01T10:29:52.250000Z - 1998-09-01T12:10:19.857558Z | 7.0 Hz, 42300 samples IU.ANMO.S3.MXZ | 1998-09-01T10:29:52.250000Z - 1998-09-01T12:10:19.857558Z | 7.0 Hz, 42300 samples Now attribute access is convenient for interactive use, but not that much for programmatic access. A number of better ways are available for that case: .. code-block:: python >>> ds.waveforms.list() ['IU.ADK', 'IU.AFI', 'IU.ANMO', 'IU.ANTO', 'IU.BBSR'] >>> print(ds.waveforms["IU.ANMO"]) Contents of the data set for station IU.ANMO: - Has a StationXML file - 2 Waveform Tag(s): observed_processed synthetic) >>> for station in ds.waveforms: ... print(station) ... break Contents of the data set for station IU.ADK: - Has a StationXML file - 2 Waveform Tag(s): observed_processed synthetic) The object returned for each station is just a helper object easing the access to the waveform and station data. .. code-block:: python >>> sta = ds.waveforms["IU.ANMO"] >>> type(sta) pyasdf.utils.WaveformAccessor ASDF distinguishes waveforms by tags. All waveforms belonging to a certain tag can be accessed with either attribute or key access. .. code-block:: python >>> print(sta.synthetic) 3 Trace(s) in Stream: IU.ANMO.S3.MXE | 1998-09-01T10:29:52.250000Z - 1998-09-01T12:10:19.857558Z | 7.0 Hz, 42300 samples IU.ANMO.S3.MXN | 1998-09-01T10:29:52.250000Z - 1998-09-01T12:10:19.857558Z | 7.0 Hz, 42300 samples IU.ANMO.S3.MXZ | 1998-09-01T10:29:52.250000Z - 1998-09-01T12:10:19.857558Z | 7.0 Hz, 42300 samples >>> sta.synthetic == sta["synthetic"] True Get a list of all tags for a certain station: .. code-block:: python >>> sta.get_waveform_tags() ['observed_processed', 'synthetic'] Sometimes more granular access is required. To that end one can also access waveforms at the individual trace level. .. code-block:: python >>> sta.list() ['IU.ANMO..BHZ__1998-09-01T10:24:49__1998-09-01T12:09:48__observed_processed', 'IU.ANMO.S3.MXE__1998-09-01T10:29:52__1998-09-01T12:10:19__synthetic', 'IU.ANMO.S3.MXN__1998-09-01T10:29:52__1998-09-01T12:10:19__synthetic', 'IU.ANMO.S3.MXZ__1998-09-01T10:29:52__1998-09-01T12:10:19__synthetic', 'StationXML'] >>> print(sta["IU.ANMO.S3.MXE__1998-09-01T10:29:52__1998-09-01T12:10:19__synthetic"]) 1 Trace(s) in Stream: IU.ANMO.S3.MXE | 1998-09-01T10:29:52.250000Z - 1998-09-01T12:10:19.857558Z | 7.0 Hz, 42300 samples The advantage of the ASDF format is that it is also able to store relations so it can for example tell what event a certain waveform is associated with: .. code-block:: python >>> cat = ds.events # The events have to be in memory for the reference to work. >>> print(sta.synthetic[0].stats.asdf.event_ids[0].get_referred_object()) Event: 1998-09-01T10:29:54.500000Z | -58.500, -26.100 | 5.5 Mwc resource_id: ResourceIdentifier(id="smi:service.iris.edu/fdsnws/event/1/query?eventid=656970") event_type: 'earthquake' --------- event_descriptions: 1 Elements focal_mechanisms: 1 Elements origins: 1 Elements magnitudes: 1 Elements Additionally it can also store the provenance associated with a certain waveform trace: .. code-block:: python >>> prov_id = sta.synthetic[0].stats.asdf.provenance_id >>> ds.provenance.get_provenance_document_for_id(prov_id) {'document': , 'name': '373da5fe_d424_4f44_9bca_4334d77ed10b'} >>> ds.provenance.get_provenance_document_for_id(prov_id)["document"].plot() .. graphviz:: _static/example_waveform_simulation.dot Reading Auxiliary Data ********************** ASDF can additionally also store non-waveform data. Access is via the ``auxiliary_data`` attribute. Once again, dictionary access is possible. .. code-block:: python >>> print(ds.auxiliary_data) Data set contains the following auxiliary data types: AdjointSource (12 item(s)) File (1 item(s)) >>> print(ds.auxiliary_data.AdjointSource) print(ds.auxiliary_data.AdjointSource) 12 auxiliary data item(s) of type 'AdjointSource' available: BW_ALFO_EHE BW_ALFO_EHN BW_ALFO_EHZ BW_BLA_EHE BW_BLA_EHN BW_BLA_EHZ IU_ANMO_EHE IU_ANMO_EHN IU_ANMO_EHZ TA_A023_BHE TA_A023_BHN TA_A023_BHZ >>> ds.auxiliary_data.list() ['AdjointSource', 'File'] >>> ds.auxiliary_data.AdjointSource == ds.auxiliary_data["AdjointSource"] True Accessing a single item once again by attribute or key access. .. code-block:: python >>> ds.auxiliary_data.AdjointSource.BW_ALFO_EHE Auxiliary Data of Type 'AdjointSource' Path 'BW_ALFO_EHE' Provenance ID: '{http://seisprov.org/seis_prov/0.1/#}sp012_as_cd84e87' Data shape: '(20000,)', dtype: 'float64' Parameters: adjoint_source_type: multitaper elevation_in_m: 473.232036071 latitude: 57.9589770294 local_depth_in_m: 0.0 longitude: 170.352381909 misfit_value: 1.45e-05 orientation: E sampling_rate_in_hz: 10.0 station_id: BW.ALFO..EHE units: m >>> ds.auxiliary_data.AdjointSource.list() ['BW_ALFO_EHE', 'BW_ALFO_EHN', 'BW_ALFO_EHZ', 'BW_BLA_EHE', 'BW_BLA_EHN', 'BW_BLA_EHZ', 'IU_ANMO_EHE', 'IU_ANMO_EHN', 'IU_ANMO_EHZ', 'TA_A023_BHE', 'TA_A023_BHN', 'TA_A023_BHZ'] >>> ds.auxiliary_data.AdjointSource.BW_ALFO_EHE == ds.auxiliary_data.AdjointSource["BW_ALFO_EHE"] True >>> adj_src = ds.auxiliary_data.AdjointSource.BW_ALFO_EHE >>> adj_src.data >>> adj_src.parameters {'adjoint_source_type': 'multitaper', 'elevation_in_m': 473.23203607130199, 'latitude': 57.958977029361542, 'local_depth_in_m': 0.0, 'longitude': 170.35238190943915, 'misfit_value': 1.45e-05, 'orientation': 'E', 'sampling_rate_in_hz': 10.0, 'station_id': 'BW.ALFO..EHE', 'units': 'm'}