v0.0.1-alpha

Open Science Archive

A standard for the deposition, validation, curation, publication, and discovery of scientific data.

The OSA specification defines a protocol for scientific data archives, enabling interoperability, reproducibility, and trust through structured semantic validation.

Table of Contents
Part 1 · Overview

Introduction

Purpose

OSA is a protocol for scientific data archives. It standardizes how data flows through deposition, validation, curation, and publication: the same workflow used by successful platforms like the Protein Data Bank, UniProt, and EMBL-EBI services.

The protocol separates infrastructure from domain logic. It defines the universal mechanics (how submissions move through stages, how validators report metrics, how nodes federate for discovery) while letting each scientific community plug in their own schemas and validation rules. A materials science archive and a genomics repository run the same protocol; only the domain-specific validators differ.

This means any scientific community can deploy professional-grade data infrastructure without rebuilding the data management logic from scratch.

OSA architecture diagram showing the data flow through validation, curation, storage and indexing, with pluggable components at each stage

Motivation

Scientific data infrastructure follows a common pattern. Successful platforms like the Protein Data Bank (PDB), UniProt, Gene Expression Omnibus (GEO), and services at EMBL-EBI all implement the same core workflow: structured deposition, automated validation, expert curation, and programmatic access.

Despite this shared pattern, each new scientific domain rebuilds this infrastructure from scratch. This fragmentation results in:

  • Duplicated effort: Generic pipeline logic is reimplemented rather than reused
  • Inconsistent quality: Ad-hoc validation rules vary wildly across repositories
  • Poor interoperability: Custom APIs prevent unified tooling and federated access
  • High barriers: Emerging fields lack resources to build “PDB-quality” infrastructure

The OSA protocol addresses this by separating infrastructure from domain logic. The protocol defines how data flows through deposition, validation, curation, and publication (the universal “shape” of scientific data management). Domain-specific rules (what makes a protein structure valid vs. a materials dataset) are injected as pluggable components, not hard-coded into the platform.

This separation enables:

  • Reusable implementations: A reference implementation serves as “PDB-in-a-box” for any domain
  • Shared tooling: A dataset browser built for biology works immediately in physics
  • Quality transparency: Machine-readable Traits let consumers filter by verified properties, not just file types
  • Institutional flexibility: Existing platforms can expose data via OSA adapters without migration

By standardizing the infrastructure layer, OSA allows scientific domains to focus resources on what matters: their specific validation rules, curation workflows, and discovery interfaces, not rebuilding basic data pipelines.

Scope

This specification is implementation-agnostic. It defines what must be observable over the network, not how systems are organized internally.

In Scope

  • Protocol resources: Structure and semantics of Depositions, Records, Vocabularies, and Tools
  • State machines: Valid state transitions and invariants
  • API contracts: HTTP endpoints and request/response formats
  • Execution contracts: OCI container interfaces for Validators and Curation Tools
  • Discovery protocols: Node discovery and federation

Out of Scope

  • Internal architecture: How implementations organize code, services, or databases
  • Storage mechanisms: Whether data is stored on S3, local disk, or elsewhere
  • Authentication: Authentication mechanisms will be specified in a future OEP
  • Performance characteristics: Caching strategies, indexing approaches, etc.
  • Domain-specific validators: The protocol defines how to package validators, not what they check

Conformance Language

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Audience

This specification is intended for:

  • Implementers building OSA-compliant ArchiveNodes or Index Nodes
  • Tool developers creating Validators or Curation Tools
  • Client developers building applications that interact with OSA nodes
  • Governance bodies establishing policies for OSA ecosystems

Architecture

Actors

The OSA protocol defines five types of actors:

ArchiveNode : A service that accepts data submissions, orchestrates validation and curation, and publishes immutable Records. The primary write-path actor.

Index Node : A service that computes derived attributes about datasets, stores the results with provenance, and exposes a queryable, federated API. Index Nodes can process data from ArchiveNodes or external sources (e.g., GEO, SRA). They do not store raw data, only computed attribute values.

Validator : An OCI container that computes vocabulary attributes from datasets. Validators emit metric values (e.g., mapped-reads-percent: 78.4), not pass/fail verdicts.

Curation Tool : An interactive OCI container that provides web-based interfaces for human reviewers to inspect and modify datasets.

Client : Any application (web app, CLI, library) that interacts with ArchiveNodes or Index Nodes to submit or retrieve data.

High-Level Flow

A typical dataset journey through OSA involves:

  1. Submission: A Client creates a Deposition on an ArchiveNode, uploads files, and submits for review.
  2. Validation: The ArchiveNode executes Validators to compute quality attributes for the data.
  3. Curation: A human curator uses a Curation Tool (proxied by the ArchiveNode) to review, annotate, or fix issues.
  4. Publication: Once approved, the ArchiveNode creates an immutable Record with a permanent identifier.
  5. Indexing: Index Nodes compute additional attributes and enable federated search across multiple sources.
┌────────┐
│ Client │
└────┬───┘
     │ (1) Create Deposition

┌─────────────┐
│ ArchiveNode │◄──(2) Run Validators
└─────────────┘

     │ (3) Proxy Curation Tool

┌──────────────┐
│   Curator    │
└──────────────┘

     │ (4) Approve → Create Record

┌──────────┐     (5) Index
│ Index Node │◄─────────────┐
└──────────┘              │
     ▲                    │
     │                    │
     └────────────────────┘
        (6) Search/Discover

Key Concepts

Separation of Primary and Derived Data : ArchiveNodes hold primary data and manage the submission lifecycle. Index Nodes compute and index derived attributes without storing raw data. This separation enables specialisation: institutions run archives, quality-focused groups run indexes.

Metrics, Not Verdicts : Validators emit numeric measurements (e.g., mapped-reads-percent: 78.4), not pass/fail judgments. Users filter datasets at query time with their own thresholds. This enables flexible, use-case-specific quality filtering.

Pluggable Domain Logic : Instead of hard-coding validation rules or curation interfaces, OSA uses OCI containers that can be developed, versioned, and shared independently.

Federation via Gossip : Index Nodes discover each other and share computed attributes through a heartbeat gossip protocol. No central coordination required.

Provenance by Default : Every computed attribute is tagged with which validator produced it, which node ran it, and when. Users can verify claims or filter by trusted sources.

Terminology

Resources

Deposition : A mutable, in-progress dataset submission. The primary resource on the write path. Transitions through states (DRAFT, SUBMITTED, UNDER_REVIEW, APPROVED) before becoming a Record.

Record : An immutable, versioned, published dataset. The primary resource on the read path. Created from an approved Deposition.

Vocabulary : A collection of named attributes with defined types and semantics, inspired by OpenTelemetry’s semantic conventions. Vocabularies define attributes, they do not imply who computes them. Attributes are referenced using the vocab-srn#attribute-name syntax (e.g., urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent).

Attributed Value : A measurement paired with its attribute reference and provenance. Validators emit attributed values. Each value records which validator computed it, which node ran it, and when.

Trait : A named, saved query over vocabulary attributes. Traits are convenience abstractions for common filter patterns (e.g., “high-quality-rnaseq” = mapped-reads-percent >= 70 AND duplicate-percent <= 30). Traits are evaluated at query time, not computed by validators.

Executables

Validator : An OCI container that computes vocabulary attributes from datasets. Validators declare which attributes they emit and produce attributed values with typed measurements. They do not emit pass/fail verdicts.

Curation Tool : An OCI container that exposes a web interface for human inspection and modification of Depositions.

CurationToolRef : A resource describing a Curation Tool: its OCI image, exposed port, and capabilities (read-only vs read-write).

Infrastructure

Structured Resource Name (SRN) : A URN-based, globally unique, location-independent identifier for any OSA resource (e.g., urn:osa:data.example.org:rec:abc123@v1). The domain component enables DNS-based discovery.

Node Document : A JSON file at /.well-known/osa-node.json that advertises a node’s capabilities and API endpoint. Enables decentralized discovery.

Gossip : The protocol by which Index Nodes discover each other and share information about which nodes compute which attributes. Operates via periodic heartbeats to known peers.

Part 2 · Core Protocol

Resources

This section specifies what resources exist in the OSA protocol and what properties they MUST have. Implementations MAY add additional properties but MUST include all required fields.

Deposition

A Deposition represents a dataset in progress. It is mutable until submitted for validation.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier (type: dep)
statusstringCurrent state: DRAFT, SUBMITTED, UNDER_REVIEW, or APPROVED
metadataobjectUser-provided descriptive metadata
filesarrayList of file objects (see below)
created_atISO 8601 datetimeTimestamp of creation
updated_atISO 8601 datetimeTimestamp of last modification

File Object Structure

Each entry in files MUST include:

PropertyTypeDescription
namestringFilename
sizeintegerSize in bytes
checksumstringSHA-256 hash (hex-encoded)
uploaded_atISO 8601 datetimeUpload timestamp

Optional Properties

Implementations MAY include:

  • validation_runs: Array of ValidationRun objects
  • curator_id: User ID of assigned curator (when in UNDER_REVIEW state)
  • submitted_at: Timestamp when status changed to SUBMITTED

Record

A Record represents an immutable, published dataset. It is created from an approved Deposition.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier (type: rec) with version (e.g., @v1)
statusstringCurrent state: PUBLIC, EMBARGOED, or WITHDRAWN
metadataobjectFinal descriptive metadata
filesarrayList of published file objects (same structure as Deposition files)
provenanceobjectOrigin information (see below)
published_atISO 8601 datetimeTimestamp of publication

Provenance Object Structure

PropertyTypeDescription
source_depositionSRNThe Deposition this Record was created from
approved_bystringUser ID of the approving curator
approved_atISO 8601 datetimeApproval timestamp
attributesarrayList of attributed values computed at time of approval

Attributed Value Object

Each entry in the attributes array represents a computed measurement:

PropertyTypeDescription
attributestringFull attribute reference (vocab-srn#attribute-name)
valueanyThe computed value (type matches vocabulary definition)
validatorSRNThe Validator that computed this value
computed_atISO 8601 datetimeWhen the value was computed

The attributes field enables filtering and discovery based on computed quality metrics. Users can query for datasets matching their own thresholds (e.g., “mapped-reads-percent > 70”) rather than relying on fixed pass/fail verdicts.

Record Versioning

Records are immutable. If changes are needed, a new version MUST be created with an incremented version number (e.g., @v1@v2). The new version MUST reference the previous version in its provenance.

Vocabulary

A Vocabulary defines a set of named attributes with standardized semantics, inspired by OpenTelemetry Semantic Conventions. Vocabularies define attributes, they do not imply who computes them. Any node can compute values for any vocabulary’s attributes.

Attribute Reference Syntax

Attributes are referenced using the vocab-srn#attribute-name syntax:

urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent
└─────────────────────────────┘ └──────────────────┘
       vocabulary SRN              attribute name

This syntax:

  • Is globally unique (vocabulary SRN ensures uniqueness)
  • Supports federation (domain in SRN enables DNS-based discovery)
  • Keeps attributes scoped to their vocabulary (the vocabulary is the trust/curation boundary)

Required Properties

PropertyTypeDescription
srnSRNUnique identifier (type: vocab)
titlestringHuman-readable name (e.g., “RNA-seq Quality Metrics”)
descriptionstringPurpose and scope of this vocabulary
attributesarrayList of attribute definitions (see below)

Attribute Definition Object

PropertyTypeDescription
namestringAttribute name within this vocabulary (e.g., mapped-reads-percent)
typestringData type: string, int, float, boolean, datetime, enum
descriptionstringWhat this attribute represents
unitstringUnit of measurement (optional, e.g., percent, kelvin, bytes)
rangearrayValid range for numeric types (optional, e.g., [0, 100])

Example Vocabulary

{
  "srn": "urn:osa:osa.org:vocab:rnaseq@1",
  "title": "RNA-seq Quality Metrics",
  "description": "Quality metrics for RNA sequencing data",
  "attributes": [
    {
      "name": "mapped-reads-percent",
      "type": "float",
      "unit": "percent",
      "range": [0, 100],
      "description": "Percentage of reads aligned to reference genome"
    },
    {
      "name": "duplicate-percent",
      "type": "float",
      "unit": "percent",
      "range": [0, 100],
      "description": "Percentage of PCR/optical duplicate reads"
    },
    {
      "name": "rrna-contamination-percent",
      "type": "float",
      "unit": "percent",
      "range": [0, 100],
      "description": "Percentage of reads mapping to ribosomal RNA"
    },
    {
      "name": "read-count",
      "type": "int",
      "description": "Total number of reads in the dataset"
    }
  ]
}

Attributes from this vocabulary would be referenced as:

  • urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent
  • urn:osa:osa.org:vocab:rnaseq@1#duplicate-percent

Trait

A Trait is a named, saved query over vocabulary attributes. Traits provide a convenient shorthand for common filter patterns.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier (type: trait)
titlestringHuman-readable name (e.g., “High Quality RNA-seq”)
descriptionstringWhat this trait represents
queryobjectAttribute conditions (see below)

Query Object

The query object maps attribute references to conditions:

{
  "srn": "urn:osa:geo-index.org:trait:high-quality-rnaseq@1",
  "title": "High Quality RNA-seq",
  "description": "RNA-seq datasets with good mapping and low duplication",
  "query": {
    "urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent": { "gte": 70 },
    "urn:osa:osa.org:vocab:rnaseq@1#duplicate-percent": { "lte": 30 }
  }
}

Supported operators: eq, neq, gt, gte, lt, lte, in, exists.

Traits vs Direct Queries

Users can always query attribute values directly without using Traits. Traits are optional UX conveniences that communities may publish as shared standards.

CurationToolRef

A CurationToolRef describes an interactive review tool.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier
titlestringHuman-readable name (e.g., “3D Molecule Viewer”)
imagestringOCI image reference (e.g., docker.io/osa/ngl-viewer:1.2.0)
default_portintegerPort the container’s web server listens on
capabilitiesarrayList of strings: read-only and/or read-write

Identifiers

All resources in the OSA protocol MUST be addressable by a Structured Resource Name (SRN).

SRN Grammar

SRNs MUST follow this URN-based grammar:

urn:osa:{node-id}:{type}:{local-id}[@{version}]

Components

urn:osa : The fixed scheme prefix. All OSA identifiers begin with this.

{node-id} : The globally unique identifier of the originating node, typically a domain name (e.g., osa.org, imperial-mat-sci.ac.uk). Node IDs MUST be DNS-safe.

{type} : A short string identifying the resource type:

  • dep: Deposition
  • rec: Record
  • vocab: Vocabulary
  • schema: Schema definition
  • trait: Trait
  • val: Validator
  • tool: Curation Tool

{local-id} : A node-unique, opaque identifier. Implementations MAY use UUIDs, sequential IDs, or other schemes. Local IDs MUST be URL-safe.

@{version} (optional) : A version identifier. REQUIRED for Records, OPTIONAL for other resources. Versions SHOULD follow Semantic Versioning 2.0 (e.g., @v1.0.0, @v2.3.1).

Examples

urn:osa:osa.org:vocab:materials-core@v1.0.0
urn:osa:osa.org:trait:high-quality-rnaseq@v1.0.0
urn:osa:imperial-mat-sci.ac.uk:dep:xyz789
urn:osa:imperial-mat-sci.ac.uk:rec:xyz789@v1
urn:osa:osa.org:trait:iso8601-dates

Versioning Semantics

Records MUST include versions: Every Record SRN MUST include a version component (e.g., @v1). This enables immutable references.

Other resources MAY include versions: Schemas, Vocabularies, and Tools MAY use versions to track evolution while maintaining backwards compatibility.

Version resolution: When an SRN without a version is dereferenced, the node SHOULD return the latest version.

Lifecycles

Deposition Lifecycle

A Deposition progresses through the following states:

┌─────────┐    submit    ┌───────────┐              ┌──────────────┐
│  DRAFT  │─────────────▶│ SUBMITTED │─────────────▶│ UNDER_REVIEW │
└─────────┘              └───────────┘   (curator   └──────────────┘
     ▲                                     claims)            │
     │                                                        │
     │                                                        ▼
     │                                                  ┌──────────┐
     │                                                  │ APPROVED │
     │                                                  └──────────┘
     │                                                        │
     │                                                        ▼
     └────────────────────────────────────────────    [Record Created]
              (request changes)

State: DRAFT

Entry conditions: Automatically entered when a Deposition is created.

Permitted operations:

  • Metadata MAY be modified
  • Files MAY be uploaded or deleted
  • Validators MAY be run (for pre-submission checks)

Transition to SUBMITTED: A Client MAY request transition to SUBMITTED. The ArchiveNode MUST validate that required metadata fields are present before allowing the transition.

State: SUBMITTED

Entry conditions: Depositor has indicated the submission is complete.

Observable requirements:

  • The Deposition MUST be immutable to the original depositor
  • Configured Validators MUST be executed
  • ValidationRuns MUST be created and linked to the Deposition

Transition to UNDER_REVIEW: The ArchiveNode MAY transition to UNDER_REVIEW when validation is complete.

Transition to DRAFT: If validation fails, the ArchiveNode MAY allow a curator to request changes, returning the Deposition to DRAFT with feedback.

State: UNDER_REVIEW

Entry conditions: The Deposition is awaiting or undergoing human review.

Permitted operations:

  • A curator MAY instantiate Curation Tools to inspect the data
  • Curators MAY modify metadata or files (via Curation Tools) to fix issues
  • The curator MAY add annotations or comments

Transition to APPROVED: The curator MAY approve the Deposition.

Transition to DRAFT: The curator MAY request changes from the depositor.

State: APPROVED

Entry conditions: A curator has approved the Deposition.

Observable requirement: The ArchiveNode MUST create a Record from the approved Deposition. This is a terminal state for the Deposition.

Validation Gate

Implementations MAY cache ValidationRuns and reuse them if the Deposition data has not changed. Implementations MUST re-run Validators if files or metadata have been modified since the last run.

Record Lifecycle

Records are immutable after creation. They have a simpler lifecycle:

State: PUBLIC

The Record is openly accessible. This is the default state for newly created Records.

State: WITHDRAWN

The Record has been retracted. The metadata remains visible (with a “WITHDRAWN” marker), but files are no longer accessible. Withdrawals MUST include a reason in the metadata.

Part 3 · Protocol Bindings

ArchiveNode HTTP API

This section defines the HTTP API that conforming ArchiveNodes MUST implement.

General Conventions

Base URL: All endpoints are relative to the ArchiveNode’s base URL (e.g., https://archive.example.org/api/v1).

Authentication: Requests MUST include a Bearer token in the Authorization header:

Authorization: Bearer <token>

Content Type: Request and response bodies MUST use application/json unless otherwise specified.

Error Responses: Errors MUST return appropriate HTTP status codes and a JSON body:

{
  "error": "error_code",
  "message": "Human-readable description"
}

Deposition Endpoints

Create Deposition

POST /depositions

Creates a new Deposition in DRAFT state.

Request Body:

{
  "metadata": {
    "title": "My Dataset"
  }
}

Response (201 Created):

{
  "srn": "urn:osa:example.org:dep:abc123",
  "status": "DRAFT",
  "metadata": {
    "title": "My Dataset"
  },
  "files": [],
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:00Z"
}

Get Deposition

GET /depositions/{id}

Retrieves a Deposition by its local ID.

Response (200 OK): Full Deposition object (as above).

Update Deposition Metadata

PATCH /depositions/{id}

Updates metadata fields. Only valid in DRAFT state (or UNDER_REVIEW if the requester is a curator).

Request Body:

{
  "metadata": {
    "title": "Crystal Structure of Protein X",
    "authors": ["Alice", "Bob"]
  }
}

Response (200 OK): Updated Deposition object.

Upload File

POST /depositions/{id}/files

Uploads a file to the Deposition.

Request: multipart/form-data with a file field.

Response (201 Created):

{
  "name": "data.cif",
  "size": 1048576,
  "checksum": "sha256:abcdef123456...",
  "uploaded_at": "2024-01-15T10:35:00Z"
}

Delete File

DELETE /depositions/{id}/files/{filename}

Removes a file from the Deposition. Only valid in DRAFT state.

Response (204 No Content)

Submit for Review

POST /depositions/{id}/actions/submit

Transitions the Deposition to SUBMITTED state, triggering validation.

Response (200 OK):

{
  "status": "SUBMITTED",
  "message": "Validation in progress"
}

List Validation Runs

GET /depositions/{id}/validations

Returns all ValidationRuns for this Deposition.

Response (200 OK):

{
  "validations": [
    {
      "validator": "urn:osa:osa.org:val:rnaseq-qc@1.0",
      "executed_at": "2024-01-15T10:36:00Z",
      "attributes": [
        {
          "attribute": "urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent",
          "value": 78.4
        }
      ]
    }
  ]
}

Record Endpoints

List Records

GET /records

Returns paginated list of public Records.

Query Parameters:

  • page: Page number (default: 1)
  • per_page: Results per page (default: 20, max: 100)

Response (200 OK):

{
  "records": [
    {
      "srn": "urn:osa:example-archive:rec:xyz789@v1",
      "status": "PUBLIC",
      "metadata": { "title": "..." },
      "published_at": "2024-01-15T12:00:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "per_page": 20,
    "total": 150
  }
}

Get Record

GET /records/{id}

Retrieves a specific Record by local ID. If no version is specified, returns the latest version.

GET /records/{id}@{version}

Retrieves a specific version of a Record.

Response (200 OK): Full Record object.

Download Record File

GET /records/{id}/files/{filename}

Downloads a file from a Record.

Response (200 OK): File contents (with appropriate Content-Type and Content-Disposition headers).

Curation Endpoints

List Available Tools

GET /depositions/{id}/tools

Returns Curation Tools available for this Deposition.

Response (200 OK):

{
  "tools": [
    {
      "srn": "urn:osa:osa.org:tool:ngl-viewer@1.2.0",
      "title": "3D Molecule Viewer",
      "capabilities": ["read-only"]
    }
  ]
}

Start Curation Session

POST /depositions/{id}/sessions

Starts a Curation Tool session (launches OCI container and returns proxy endpoint).

Request Body:

{
  "tool_srn": "urn:osa:osa.org:tool:ngl-viewer@1.2.0"
}

Response (201 Created):

{
  "session_id": "sess_abc123",
  "status": "provisioning",
  "proxy_endpoint": "/curation/sessions/sess_abc123",
  "expires_at": "2024-01-15T14:00:00Z"
}

Access Curation Tool

GET /curation/sessions/{session_id}/{path}

Proxies requests to the running Curation Tool container.

The ArchiveNode MUST:

  1. Verify the requester owns the session
  2. Forward requests to the container’s web server
  3. Rewrite paths according to OSAP_BASE_PATH

Stop Curation Session

DELETE /curation/sessions/{session_id}

Terminates the Curation Tool container.

Response (204 No Content)

Index Node Protocol

An Index Node computes derived attributes about datasets, stores the results with provenance, and exposes a queryable, federated API. Index Nodes do not store raw data, only computed attribute values.

Responsibilities

A conforming Index Node MUST:

  1. Run validators against datasets to compute vocabulary attributes
  2. Store attributed values with full provenance (which validator, which node, when)
  3. Expose a Search API for querying by attribute conditions
  4. Participate in gossip to enable federated discovery

Index Nodes MAY:

  • Index data from ArchiveNodes (OSA-native records)
  • Index data from external sources (GEO, SRA, ArrayExpress) via source adapters
  • Aggregate results from multiple data sources

Data Model

Index Nodes store attributed values, not raw data:

dataset_id: GSE12345
source: geo
attributes:
  - attribute: urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent
    value: 78.4
    provenance:
      validator: urn:osa:salmon.org:val:salmon@2.1
      node: urn:osa:geo-index.org:node:main
      computed_at: 2025-12-01T12:00:00Z
  - attribute: urn:osa:osa.org:vocab:rnaseq@1#duplicate-percent
    value: 12.3
    provenance:
      validator: urn:osa:picard.org:val:markdup@3.0
      node: urn:osa:geo-index.org:node:main
      computed_at: 2025-12-01T12:05:00Z

Search API

Search by Attribute Conditions

GET /search

Searches datasets by attribute value conditions.

Query Parameters:

  • q: Attribute condition(s) in format attribute:op:value (repeatable)
  • source: Filter by data source (e.g., geo, osa)
  • page: Page number (default: 1)
  • per_page: Results per page (default: 20, max: 100)

Example:

GET /search
  ?q=urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent:gte:70
  &q=urn:osa:osa.org:vocab:rnaseq@1#duplicate-percent:lte:30
  &source=geo
  &limit=100

Response (200 OK):

{
  "results": [
    {
      "dataset_id": "GSE12345",
      "source": "geo",
      "attributes": {
        "urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent": {
          "value": 78.4,
          "provenance": {
            "validator": "urn:osa:salmon.org:val:salmon@2.1",
            "node": "urn:osa:geo-index.org:node:main",
            "computed_at": "2025-12-01T12:00:00Z"
          }
        },
        "urn:osa:osa.org:vocab:rnaseq@1#duplicate-percent": {
          "value": 12.3,
          "provenance": { ... }
        }
      }
    }
  ],
  "pagination": {
    "page": 1,
    "per_page": 20,
    "total": 42
  },
  "federated_from": ["urn:osa:geo-index.org:node:main"]
}

Search by Trait

GET /search?trait={trait-srn}

Convenience endpoint that expands a Trait to its underlying attribute conditions.

Example:

GET /search?trait=urn:osa:geo-index.org:trait:high-quality-rnaseq@1

Equivalent to querying with the Trait’s query conditions directly.

Gossip Protocol

Index Nodes discover each other and share information about which nodes compute which attributes through a gossip protocol.

Heartbeat

Each Index Node periodically sends heartbeat messages to its known peers:

POST /gossip/heartbeat

{
  "node_id": "urn:osa:inst-a.edu:node:main",
  "computes": [
    "urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent",
    "urn:osa:osa.org:vocab:rnaseq@1#duplicate-percent"
  ],
  "peers": [
    "urn:osa:inst-b.org:node:metrics",
    "urn:osa:geo-index.org:node:main"
  ],
  "timestamp": "2025-12-07T10:00:00Z"
}

The computes field lists which vocab#attr combinations this node can compute. The peers field shares known peers for network discovery.

Peer Discovery

When a node starts computing attributes from a new vocabulary, it SHOULD:

  1. Resolve the vocabulary author’s domain via DNS
  2. Fetch their Node Document at /.well-known/osa-node.json
  3. Add them as a gossip peer

This ensures vocabulary authors naturally become hubs for their vocabulary’s ecosystem.

Gossip Catalog

Each node maintains a local catalog of which nodes compute which attributes:

catalog:
  urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent:
    - node: urn:osa:inst-a.edu:node:main
      last_seen: 2025-12-07T10:00:00Z
    - node: urn:osa:geo-index.org:node:main
      last_seen: 2025-12-06T15:00:00Z

Catalog entries expire after a configurable TTL without a heartbeat refresh.

Federated Queries

When a node receives a query for an attribute it doesn’t compute locally:

  1. Check gossip catalog for nodes that compute the attribute
  2. Fan out the query to those nodes
  3. Merge results, preserving provenance
  4. Return unified response with federated_from field

POST /federation/query

{
  "attributes": [
    {
      "attribute": "urn:osa:inst-b.org:vocab:qc@1#rrna-percent",
      "op": "lt",
      "value": 5
    }
  ],
  "dataset_ids": ["GSE12345", "GSE67890"],
  "source": "geo"
}

Trust Model

Every attributed value includes full provenance. Users can:

  • Filter by trusted nodes: “only values from nodes I trust”
  • Filter by validator: “only values computed by validator X”
  • Verify independently: re-run the validator themselves

Nodes MAY implement trust policies for accepting gossiped information (accept all, allowlist, require signatures).

Validator Contract

This section defines the execution contract for Validator OCI containers.

Purpose

Validators are headless, automated programs that compute vocabulary attributes from datasets. They run in sandboxed OCI containers and communicate via files. Validators emit metric values, not pass/fail verdicts, filtering is done at query time.

Validator Manifest

Each validator MUST include a manifest file at /osa/manifest.json within the container:

{
  "srn": "urn:osa:salmon.org:val:salmon@2.1.0",
  "name": "Salmon Quantification",
  "description": "Computes RNA-seq alignment and quantification metrics",
  "emits": [
    "urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent",
    "urn:osa:osa.org:vocab:rnaseq@1#duplicate-percent",
    "urn:osa:osa.org:vocab:sequencing@1#read-count"
  ]
}

The emits field declares which vocab#attr combinations this validator produces, enabling discovery and gossip.

Container Requirements

Packaging

Validators MUST be packaged as OCI-compliant container images (compatible with Docker, Podman, etc.).

Entrypoint

The container’s entrypoint MUST:

  1. Read input data from $OSAP_IN
  2. Compute vocabulary attributes
  3. Write results to $OSAP_OUT/result.json
  4. Exit with code 0 if computation completed (regardless of data quality)

Environment Variables

The host MUST inject:

VariableDescription
OSAP_INPath to input directory containing data files
OSAP_OUTPath to output directory (writable)

Input Format

The $OSAP_IN directory contains:

  • files/: Data files to process
  • metadata.json: Dataset metadata (optional)
  • config.json: Per-run configuration (optional)

Output Format

Validators MUST write $OSAP_OUT/result.json with computed attribute values:

{
  "attributes": [
    {
      "attribute": "urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent",
      "value": 78.4
    },
    {
      "attribute": "urn:osa:osa.org:vocab:rnaseq@1#duplicate-percent",
      "value": 12.3
    },
    {
      "attribute": "urn:osa:osa.org:vocab:sequencing@1#read-count",
      "value": 45000000
    }
  ],
  "logs": [
    "Processed 45M reads in 12m34s",
    "Reference: GRCh38"
  ]
}

Required fields:

  • attributes: Array of computed attribute values

Optional fields:

  • logs: Array of human-readable diagnostic messages
  • errors: Array of error messages (computation still completed)

Example Validation Run

# Host prepares input
$ ls $OSAP_IN/files/
sample_R1.fastq.gz  sample_R2.fastq.gz

# Host runs container
$ docker run --rm \
  --network none \
  -v /path/to/input:/input:ro \
  -v /path/to/output:/output \
  -e OSAP_IN=/input \
  -e OSAP_OUT=/output \
  salmon.org/osa/salmon:2.1.0

# Validator writes result
$ cat $OSAP_OUT/result.json
{
  "attributes": [
    {"attribute": "urn:osa:osa.org:vocab:rnaseq@1#mapped-reads-percent", "value": 78.4}
  ]
}

Sandboxing

Nodes MUST run Validators with:

  • No network access (no outbound connections)
  • Read-only input ($OSAP_IN mounted read-only)
  • Limited resources (CPU/RAM limits enforced)
  • Timeout (execution killed after limit, e.g., 30 minutes)
  • Isolated filesystem (no access to host beyond input/output)

Error Handling

ScenarioBehavior
Exit code non-zeroMark as error, log stderr, no attributes stored
No result.json writtenMark as error, “No result produced”
Timeout exceededKill container, mark as error, “Timeout exceeded”
result.json malformedMark as error, “Invalid output format”

Errors are recorded with the dataset. Users can see “validator X failed on this dataset” without it blocking other validators.

Curation Tool Contract

This section defines the execution contract for Curation Tool OCI containers.

Purpose

Curation Tools are interactive web applications that allow human curators to inspect, annotate, and modify Depositions. Unlike Validators, they are long-running and expose HTTP interfaces.

Container Requirements

Packaging

Curation Tools MUST be packaged as OCI-compliant container images.

Web Server

The container MUST run a web server listening on the port specified in its CurationToolRef.default_port (e.g., 8080).

Environment Variables

The ArchiveNode MUST inject:

VariableDescription
OSAP_DEPOSITION_IDThe SRN of the Deposition being curated
OSAP_API_URLThe internal URL of the ArchiveNode’s API (e.g., http://localhost:5000/api/v1)
OSAP_API_TOKENA temporary Bearer token with write access to the Deposition
OSAP_BASE_PATHThe public path prefix (e.g., /curation/sessions/sess_abc123)

Base Path Handling

All assets (HTML, CSS, JS, images) MUST be served relative to $OSAP_BASE_PATH.

Example: If the tool serves a stylesheet at /style.css, and OSAP_BASE_PATH=/curation/sessions/sess_abc123, the client must access it at:

https://archive.example.org/curation/sessions/sess_abc123/style.css

Many web frameworks support base path configuration (e.g., Flask’s APPLICATION_ROOT, Express’s app.use(basePath, ...)).

Data Access

Reading Data

Curation Tools MAY access Deposition data via:

  1. API calls to $OSAP_API_URL using $OSAP_API_TOKEN
  2. Mounted files (if the ArchiveNode mounts data at /data read-only)

Modifying Data

To modify the Deposition (e.g., fix metadata, delete files), the tool MUST:

  1. Make API calls to $OSAP_API_URL (e.g., PATCH /depositions/{id})
  2. Include Authorization: Bearer $OSAP_API_TOKEN

Tools MUST NOT attempt to write state to the container’s local filesystem. Any data written locally will be lost when the session ends.

Security

Token Scope

The provided OSAP_API_TOKEN:

  • MUST grant read access to the Deposition being curated
  • MUST grant write access ONLY if the tool’s capabilities includes read-write
  • MUST expire when the curation session ends
  • SHOULD be limited to the specific Deposition (not all Depositions)

Isolation

Curation Tool containers MUST be isolated:

  • Network restrictions: Outbound internet access SHOULD be blocked unless explicitly required
  • Ephemeral storage: Any data written to the container’s filesystem MUST be discarded when the session ends
  • Resource limits: CPU/RAM limits SHOULD be enforced

Example Curation Session

# ArchiveNode starts container
$ docker run --rm -d \
  -p 8080:8080 \
  -e OSAP_DEPOSITION_ID=urn:osa:example:dep:abc123 \
  -e OSAP_API_URL=http://host.docker.internal:5000/api/v1 \
  -e OSAP_API_TOKEN=eyJhbGc... \
  -e OSAP_BASE_PATH=/curation/sessions/sess_xyz \
  myregistry.io/osa/molecule-viewer:1.2.0

# ArchiveNode proxies requests
# User visits: https://archive.example.org/curation/sessions/sess_xyz
# ArchiveNode forwards to: http://localhost:8080/ (with path rewriting)

Client Requirements

A Client is any application that interacts with ArchiveNodes or Index Nodes. This includes web applications, command-line tools, libraries, and scripts.

Required Behaviors

Conforming Clients MUST:

  1. Use SRNs for resource references: When referencing Depositions or Records, use full SRNs (not local IDs alone).

  2. Authenticate via Bearer tokens: Include Authorization: Bearer <token> in all requests to ArchiveNodes.

  3. Handle standard HTTP status codes:

    • 401 Unauthorized → Prompt for authentication
    • 403 Forbidden → Insufficient permissions
    • 404 Not Found → Resource does not exist
    • 422 Unprocessable Entity → Validation errors (check response body for details)
  4. Respect rate limits: If an ArchiveNode returns 429 Too Many Requests, back off exponentially.

Optional Behaviors

Clients MAY:

  • Cache Record metadata locally (but SHOULD check ETag or Last-Modified headers)
  • Support multiple ArchiveNodes simultaneously
  • Implement retry logic for transient failures (5xx errors)

Example: Submitting a Dataset

import requests

API_BASE = "https://archive.example.org/api/v1"
TOKEN = "your-bearer-token"

# 1. Create Deposition
resp = requests.post(
    f"{API_BASE}/depositions",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={"metadata": {"title": "My Crystal Structure"}}
)
deposition = resp.json()
dep_id = deposition["srn"].split(":")[-1]  # Extract local ID

# 2. Upload file
with open("data.cif", "rb") as f:
    requests.post(
        f"{API_BASE}/depositions/{dep_id}/files",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files={"file": f}
    )

# 3. Update metadata
requests.patch(
    f"{API_BASE}/depositions/{dep_id}",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={"metadata": {"title": "My Crystal Structure"}}
)

# 4. Submit
requests.post(
    f"{API_BASE}/depositions/{dep_id}/actions/submit",
    headers={"Authorization": f"Bearer {TOKEN}"}
)
Part 4 · Governance

Security & Privacy

Authentication

ArchiveNodes MUST support Bearer token authentication. Tokens SHOULD be obtained via an external OIDC-compatible identity provider.

ArchiveNodes MAY support additional authentication methods (e.g., API keys, mTLS) but MUST support Bearer tokens for interoperability.

Authorization

ArchiveNodes MUST enforce:

  • Depositor isolation: Users can only read/modify Depositions they created (unless they have curator privileges)
  • Curator privileges: Curators can view and modify any Deposition in UNDER_REVIEW state
  • Public read access: Records in PUBLIC state SHOULD be readable without authentication

Node Identity

Each ArchiveNode MUST have a globally unique NodeID and MUST publish a Node Document at:

https://{domain}/.well-known/osa-node.json

Example:

{
  "node_id": "urn:osa:imperial-mat-sci.ac.uk:node:main",
  "version": "1.0.0",
  "api_base": "https://imperial-mat-sci.ac.uk/osa/api/v1",
  "capabilities": ["archive", "index"],
  "peers": [
    "urn:osa:geo-index.org:node:main"
  ]
}

Curation Tool Security

Proxy Authentication

The ArchiveNode’s proxy endpoint (/curation/sessions/{session_id}) MUST verify that the requesting user is the owner of the session.

Tool Isolation

Curation Tools MUST be isolated:

  • Network restrictions: Outbound internet access SHOULD be blocked
  • Ephemeral storage: Any data written to the container’s filesystem MUST be discarded when the session ends
  • No persistent state: Tools MUST NOT maintain state between sessions

CSRF Protection

Because Curation Tools are proxied on the ArchiveNode’s domain, they share the same origin as the main application. ArchiveNodes MUST:

  • Enforce SameSite=Strict on session cookies
  • Validate CSRF tokens on all state-changing operations
  • Use separate session tokens for curation (not the main user session)

Data Privacy

ArchiveNodes SHOULD:

  • Encrypt data at rest and in transit (TLS 1.3+)
  • Provide mechanisms for embargoing sensitive data
  • Support metadata redaction for withdrawn Records
  • Log all access to private Depositions for audit purposes

Conformance

Conformance Classes

ArchiveNode

A conforming ArchiveNode MUST:

Index Node

A conforming Index Node MUST:

  • Implement the Search API (Index Node Protocol)
  • Index Records from at least one data source (ArchiveNode or external archive via Source Adapter)
  • Store attributed values with full provenance (validator, node, timestamp)
  • Support filtering by vocabulary attributes via query parameters
  • Resolve SRNs (Identifiers)
  • Return results in the specified JSON format
  • Publish a Node Document at /.well-known/osa-node.json
  • Participate in gossip protocol with at least one peer

Validator

A conforming Validator MUST:

  • Be packaged as an OCI container
  • Read input from $OSAP_IN
  • Write result.json to $OSAP_OUT containing an attributes array of attributed values
  • Each attributed value MUST include attribute (vocab#attr reference) and value
  • Exit with code 0
  • Operate without network access

Curation Tool

A conforming Curation Tool MUST:

  • Be packaged as an OCI container
  • Run a web server on the specified port
  • Serve assets relative to $OSAP_BASE_PATH
  • Use $OSAP_API_TOKEN for all state-changing operations
  • Not write persistent state to local disk

Client

A conforming Client MUST:

  • Use SRNs for resource references
  • Authenticate via Bearer tokens
  • Handle standard HTTP status codes

Version Compatibility

This specification is version 0.0.1-alpha.

Breaking changes (as defined by Semantic Versioning 2.0) will increment the major version. Implementations SHOULD advertise their supported spec versions in the Node Document.

Extensibility

OSA Enhancement Proposals (OEPs)

Changes to this specification or standard contracts follow the OEP process. OEPs are published at https://oeps.opensciencearchive.org/.

Namespaced Extensions

Implementations MAY add custom fields to protocol resources using namespaced keys:

{
  "metadata": {
    "title": "My Dataset",
    "x-institution-grant-id": "GRANT-12345",
    "x-institution-internal-id": "abc-xyz-789"
  }
}

Extension keys MUST:

  • Start with x- followed by a unique namespace identifier
  • Use lowercase and hyphens (e.g., x-my-org-field-name)
  • Not conflict with standardized keys

Extensions MUST NOT:

  • Change the semantics of required protocol fields
  • Break interoperability (other implementations MUST be able to ignore unknown fields)