Introduction to SEIS-PROV
=========================
W3C PROV
--------
``SEIS-PROV`` is a domain specific extension for using
`W3C PROV `_ in the context of
seismological data processing and generation.
``W3C PROV`` describes a generic data model for provenance which ``SEIS-PROV``
is based upon. Please see its website for more details. ``SEIS-PROV`` defines a
new namespace with entities and activities specific to seismology.
``W3C PROV`` offers a number of different serialization formats, all of which
are equivalent in their information content. The seismological community is
already used to XML with formats like QuakeML and StationXML so it makes sense
to use the `PROV-XML `_ serialization to ease
adoption. Nonetheless you are free to use any serialization format you desire.
This section aims to give a short introduction to ``SEIS-PROV`` and ``W3C
PROV``. Later sections will detail the available records and text and graphical
representations. We will use examples familiar to seismologists where
appropriate. The PROV W3C representations are fairly verbose and tool support
will be vital for its success.
SEIS-PROV Namespace
-------------------
The namespace of the ``SEIS-PROV`` specific types and attributes will most
likely change at a certain point and should be considered temporary. Always use
the prefix |NS_PREFIX| to refer to it.
.. note::
* **prefix:** |NS_PREFIX|
* **namespace:** |NS_URL|
The current version is |BOLDVERSION| and is not stable!
Approach to the Extension of W3C PROV
-------------------------------------
``W3C PROV`` in theory offers ways to properly extend it with new entity types
and relations. The downside of that approach is that most tools are not able to
deal with it. Since we strive towards a usable and practical provenance
description, tool support is vital and should be facilitated by any means
possible.
``SEIS-PROV`` extends ``W3C PROV`` in a fairly non-intrusive fashion mainly by
adding new attributes to records under the |NS_PREFIX| namespace. This can be
seen as a set of new constraints on top of ``W3C PROV``. It has the big
advantage of working with existing tools for ``W3C PROV``. The downside is that
no standard tools like XML schemas can be used to fully validate ``SEIS-PROV``
files. It follows that other ways to validate ``SEIS-PROV`` files are needed
which are detailed in the :doc:`validation` section.
Provenance Records
------------------
``W3C PROV`` in essence describes a graph consisting of different types of
nodes, which are connected by different types of edges. There are three types
of nodes in ``W3C PROV`` which depict different things. The edges describe
different relations between the nodes.
We will first introduce the three different types, each with a short
description and a plot.
Entities
^^^^^^^^
.. sidebar:: Entity Plot
.. graphviz:: _generated/dot/entities/waveform_trace_max.dot
Entities are depicted as yellow ellipses. Attributes are listed in a white
rectangle. This example show a waveform trace at a certain point in a
processing chain.
An entity is an actual thing with some fixed aspects. In a seismological
context an entity is usually some piece of waveform or other data for which
provenance is described. In a time series analysis workflow for example the
data after each step in the processing chain will be described by an entity.
All ``SEIS-PROV`` entities are normal ``prov:entity`` records with a special
``prov:type`` attribute.
The most used entity in ``SEIS-PROV`` is the ``seis_prov:waveform_trace``
entity, describing a single continuous piece of waveform data. ``SEIS-PROV``
furthermore defines ``seis_prov:cross_correlation``,
``seis_prov:adjoint_source``, and other entities. More entities will be added
as the need arises.
Each type of entity has a set of (optional) attributes, the
``seis_prov:waveform_trace`` entity for example has attributes denoting the
network, station, location, and channel SEED identifies, the start time,
sampling rate, the number of samples, and other things.
Activities
^^^^^^^^^^
.. sidebar:: Activity Plot
.. graphviz:: _generated/dot/activities/lowpass_filter_max.dot
Activities are shown as blue rectangles. The example shows a simple
Butterworth lowpass filter.
Activities are action that can change or generate entities. In seismological
data processing, each processing step can be seen as an activity that uses the
data and generates a new version of it.
A further example for an activity would be a simulation run which generates
some synthetic waveforms. Also an event relocation could be considered an
activity but that can also be stored in the QuakeML file directly, thus an
identifier which event was actually used should be enough. Model generation can
be considered an activity, as can adjoint backwards simulations to generate
gradients.
Activities can either use existing entities and generate new ones. The
``SEIS-PROV`` standard defines a number of activities from common processing
packages like SAC and ObsPy. Further activities should be added with time.
While it is not required we **strongly recommend** to associate each activity
with a software agent otherwise reproducibility is severely hurt.
Agents
^^^^^^
.. sidebar:: Agent Plot
.. graphviz:: _generated/dot/examples/simple_agent.dot
Agents are orange houses. The example shows a certain version of ObsPy.
Agents are persons, organizations, or software programs responsible for some
activity, entity, or another agent. One can define different relations between
the nodes. A classical example for an agent would be which software performed
the processing and which person steered the software. It could also be a group
of people or an institution.
``SEIS-PROV`` does not define any new agent types - the ones defined in ``W3C
PROV`` are sufficient. ``SEIS-PROV`` requires each software agent to have
``seis_prov:software_name``, ``seis_prov:sofware_version``, and
``seis_prov:website`` attributes. A human readable ``prov:label`` is
recommended. Agents can furthermore have an ``seis_prov:doi`` attribute.
Relations and the Rest of W3C PROV
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``W3C PROV`` has a lot more to offer, everything can be used in ``SEIS-PROV``
but will not be described here - please refer to the ``W3C PROV`` specification
for more information.
The different types of records described in the previous sections are tied
together using relations. There are a number of relations in the ``W3C PROV``
data model, the important ones for ``SEIS-PROV`` are:
* ``Usage (used)``: Activities make use of entities, thus this is mostly used
to note what entities or data went into an activity.
* ``Generation (wasGeneratedBy)``: Entities are generated by activities, thus
this is mostly used to show the output of an activitiy.
* ``Association (wasAssociatedWith)``: Mostly used to show which agent is
responsible for a certain activitiy, e.g. which software performed the
filtering operation.
* ``Delegation (actedOnBehalfOf)``: Mostly used to show what person was
responsible for steering a piece of software.
If that is confusing it should become clearer in the following sections.