Introduction to SEIS-PROV

W3C PROV

SEIS-PROV is a domain specific extension for using W3C PROV in the context of seismological data processing and generation. W3C PROV describes a generic data model for provenance which SEIS-PROV is based upon. Please see its website for more details. SEIS-PROV defines a new namespace with entities and activities specific to seismology.

W3C PROV offers a number of different serialization formats, all of which are equivalent in their information content. The seismological community is already used to XML with formats like QuakeML and StationXML so it makes sense to use the PROV-XML serialization to ease adoption. Nonetheless you are free to use any serialization format you desire.

This section aims to give a short introduction to SEIS-PROV and W3C PROV. Later sections will detail the available records and text and graphical representations. We will use examples familiar to seismologists where appropriate. The PROV W3C representations are fairly verbose and tool support will be vital for its success.

SEIS-PROV Namespace

The namespace of the SEIS-PROV specific types and attributes will most likely change at a certain point and should be considered temporary. Always use the prefix seis_prov to refer to it.

Note

  • prefix: seis_prov
  • namespace: http://seisprov.org/seis_prov/0.1/#

The current version is 0.1 and is not stable!

Approach to the Extension of W3C PROV

W3C PROV in theory offers ways to properly extend it with new entity types and relations. The downside of that approach is that most tools are not able to deal with it. Since we strive towards a usable and practical provenance description, tool support is vital and should be facilitated by any means possible.

SEIS-PROV extends W3C PROV in a fairly non-intrusive fashion mainly by adding new attributes to records under the seis_prov namespace. This can be seen as a set of new constraints on top of W3C PROV. It has the big advantage of working with existing tools for W3C PROV. The downside is that no standard tools like XML schemas can be used to fully validate SEIS-PROV files. It follows that other ways to validate SEIS-PROV files are needed which are detailed in the Validation section.

Provenance Records

W3C PROV in essence describes a graph consisting of different types of nodes, which are connected by different types of edges. There are three types of nodes in W3C PROV which depict different things. The edges describe different relations between the nodes.

We will first introduce the three different types, each with a short description and a plot.

Entities

An entity is an actual thing with some fixed aspects. In a seismological context an entity is usually some piece of waveform or other data for which provenance is described. In a time series analysis workflow for example the data after each step in the processing chain will be described by an entity.

All SEIS-PROV entities are normal prov:entity records with a special prov:type attribute.

The most used entity in SEIS-PROV is the seis_prov:waveform_trace entity, describing a single continuous piece of waveform data. SEIS-PROV furthermore defines seis_prov:cross_correlation, seis_prov:adjoint_source, and other entities. More entities will be added as the need arises.

Each type of entity has a set of (optional) attributes, the seis_prov:waveform_trace entity for example has attributes denoting the network, station, location, and channel SEED identifies, the start time, sampling rate, the number of samples, and other things.

Activities

Activities are action that can change or generate entities. In seismological data processing, each processing step can be seen as an activity that uses the data and generates a new version of it.

A further example for an activity would be a simulation run which generates some synthetic waveforms. Also an event relocation could be considered an activity but that can also be stored in the QuakeML file directly, thus an identifier which event was actually used should be enough. Model generation can be considered an activity, as can adjoint backwards simulations to generate gradients.

Activities can either use existing entities and generate new ones. The SEIS-PROV standard defines a number of activities from common processing packages like SAC and ObsPy. Further activities should be added with time. While it is not required we strongly recommend to associate each activity with a software agent otherwise reproducibility is severely hurt.

Agents

Agents are persons, organizations, or software programs responsible for some activity, entity, or another agent. One can define different relations between the nodes. A classical example for an agent would be which software performed the processing and which person steered the software. It could also be a group of people or an institution.

SEIS-PROV does not define any new agent types - the ones defined in W3C PROV are sufficient. SEIS-PROV requires each software agent to have seis_prov:software_name, seis_prov:sofware_version, and seis_prov:website attributes. A human readable prov:label is recommended. Agents can furthermore have an seis_prov:doi attribute.

Relations and the Rest of W3C PROV

W3C PROV has a lot more to offer, everything can be used in SEIS-PROV but will not be described here - please refer to the W3C PROV specification for more information.

The different types of records described in the previous sections are tied together using relations. There are a number of relations in the W3C PROV data model, the important ones for SEIS-PROV are:

  • Usage (used): Activities make use of entities, thus this is mostly used to note what entities or data went into an activity.
  • Generation (wasGeneratedBy): Entities are generated by activities, thus this is mostly used to show the output of an activitiy.
  • Association (wasAssociatedWith): Mostly used to show which agent is responsible for a certain activitiy, e.g. which software performed the filtering operation.
  • Delegation (actedOnBehalfOf): Mostly used to show what person was responsible for steering a piece of software.

If that is confusing it should become clearer in the following sections.