Introduction to SEIS-PROV¶
SEIS-PROV is a domain specific extension for using
W3C PROV in the context of
seismological data processing and generation.
W3C PROV describes a generic data model for provenance which
is based upon. Please see its website for more details.
SEIS-PROV defines a
new namespace with entities and activities specific to seismology.
W3C PROV offers a number of different serialization formats, all of which
are equivalent in their information content. The seismological community is
already used to XML with formats like QuakeML and StationXML so it makes sense
to use the PROV-XML serialization to ease
adoption. Nonetheless you are free to use any serialization format you desire.
This section aims to give a short introduction to
PROV. Later sections will detail the available records and text and graphical
representations. We will use examples familiar to seismologists where
appropriate. The PROV W3C representations are fairly verbose and tool support
will be vital for its success.
The namespace of the
SEIS-PROV specific types and attributes will most
likely change at a certain point and should be considered temporary. Always use
seis_prov to refer to it.
The current version is 0.1 and is not stable!
Approach to the Extension of W3C PROV¶
W3C PROV in theory offers ways to properly extend it with new entity types
and relations. The downside of that approach is that most tools are not able to
deal with it. Since we strive towards a usable and practical provenance
description, tool support is vital and should be facilitated by any means
W3C PROV in a fairly non-intrusive fashion mainly by
adding new attributes to records under the
seis_prov namespace. This can be
seen as a set of new constraints on top of
W3C PROV. It has the big
advantage of working with existing tools for
W3C PROV. The downside is that
no standard tools like XML schemas can be used to fully validate
files. It follows that other ways to validate
SEIS-PROV files are needed
which are detailed in the Validation section.
W3C PROV in essence describes a graph consisting of different types of
nodes, which are connected by different types of edges. There are three types
of nodes in
W3C PROV which depict different things. The edges describe
different relations between the nodes.
We will first introduce the three different types, each with a short description and a plot.
An entity is an actual thing with some fixed aspects. In a seismological context an entity is usually some piece of waveform or other data for which provenance is described. In a time series analysis workflow for example the data after each step in the processing chain will be described by an entity.
SEIS-PROV entities are normal
prov:entity records with a special
The most used entity in
SEIS-PROV is the
entity, describing a single continuous piece of waveform data.
seis_prov:adjoint_source, and other entities. More entities will be added
as the need arises.
Each type of entity has a set of (optional) attributes, the
seis_prov:waveform_trace entity for example has attributes denoting the
network, station, location, and channel SEED identifies, the start time,
sampling rate, the number of samples, and other things.
Activities are action that can change or generate entities. In seismological data processing, each processing step can be seen as an activity that uses the data and generates a new version of it.
A further example for an activity would be a simulation run which generates some synthetic waveforms. Also an event relocation could be considered an activity but that can also be stored in the QuakeML file directly, thus an identifier which event was actually used should be enough. Model generation can be considered an activity, as can adjoint backwards simulations to generate gradients.
Activities can either use existing entities and generate new ones. The
SEIS-PROV standard defines a number of activities from common processing
packages like SAC and ObsPy. Further activities should be added with time.
While it is not required we strongly recommend to associate each activity
with a software agent otherwise reproducibility is severely hurt.
Agents are persons, organizations, or software programs responsible for some activity, entity, or another agent. One can define different relations between the nodes. A classical example for an agent would be which software performed the processing and which person steered the software. It could also be a group of people or an institution.
SEIS-PROV does not define any new agent types - the ones defined in
PROV are sufficient.
SEIS-PROV requires each software agent to have
seis_prov:website attributes. A human readable
recommended. Agents can furthermore have an
Relations and the Rest of W3C PROV¶
W3C PROV has a lot more to offer, everything can be used in
but will not be described here - please refer to the
W3C PROV specification
for more information.
The different types of records described in the previous sections are tied
together using relations. There are a number of relations in the
data model, the important ones for
Usage (used): Activities make use of entities, thus this is mostly used to note what entities or data went into an activity.
Generation (wasGeneratedBy): Entities are generated by activities, thus this is mostly used to show the output of an activitiy.
Association (wasAssociatedWith): Mostly used to show which agent is responsible for a certain activitiy, e.g. which software performed the filtering operation.
Delegation (actedOnBehalfOf): Mostly used to show what person was responsible for steering a piece of software.
If that is confusing it should become clearer in the following sections.