v0.0.1-draft

Open Science Archive

A standard for the deposition, validation, curation, publication, and discovery of scientific data.

The OSA specification defines a protocol for scientific data archives, enabling interoperability, reproducibility, and trust through structured semantic validation.

Table of Contents
Part 1: Overview

Introduction

Purpose

This specification defines the Open Science Archive (OSA) protocol—an open, interoperable standard for the deposition, validation, curation, publication, and discovery of scientific data.

OSA is designed to enable multiple implementations while ensuring compatibility through:

  • Standardized wire formats for data exchange
  • Well-defined behavioral contracts for actors in the system
  • Pluggable validation and curation via OCI containers
  • Federated registries for sharing reusable components

Motivation

Scientific data infrastructure follows a common pattern. Successful platforms like the Protein Data Bank (PDB), UniProt, Gene Expression Omnibus (GEO), and services at EMBL-EBI all implement the same core workflow: structured deposition, automated validation, expert curation, and programmatic access.

Despite this shared pattern, each new scientific domain rebuilds this infrastructure from scratch. This fragmentation results in:

  • Duplicated effort: Generic pipeline logic is reimplemented rather than reused
  • Inconsistent quality: Ad-hoc validation rules vary wildly across repositories
  • Poor interoperability: Custom APIs prevent unified tooling and federated access
  • High barriers: Emerging fields lack resources to build “PDB-quality” infrastructure

The OSA protocol addresses this by separating infrastructure from domain logic. The protocol defines how data flows through deposition, validation, curation, and publication—the universal “shape” of scientific data management. Domain-specific rules (what makes a protein structure valid vs. a materials dataset) are injected as pluggable components, not hard-coded into the platform.

This separation enables:

  • Reusable implementations: A reference implementation serves as “PDB-in-a-box” for any domain
  • Shared tooling: A dataset browser built for biology works immediately in physics
  • Quality transparency: Machine-readable SemanticGuarantees let consumers filter by verified properties, not just file types
  • Institutional flexibility: Existing platforms can expose data via OSA adapters without migration

By standardizing the infrastructure layer, OSA allows scientific domains to focus resources on what matters: their specific validation rules, curation workflows, and discovery interfaces—not rebuilding basic data pipelines.

Scope

This specification is implementation-agnostic. It defines what must be observable over the network, not how systems are organized internally.

In Scope

  • Protocol resources: Structure and semantics of Depositions, Records, Profiles, and Tools
  • State machines: Valid state transitions and invariants
  • API contracts: HTTP endpoints and request/response formats
  • Execution contracts: OCI container interfaces for Validators and Curation Tools
  • Registry protocols: Discovery and versioning of shared components

Out of Scope

  • Internal architecture: How implementations organize code, services, or databases
  • Storage mechanisms: Whether data is stored on S3, local disk, or elsewhere
  • Identity providers: OSA assumes external OIDC-compatible authentication
  • Performance characteristics: Caching strategies, indexing approaches, etc.
  • Domain-specific validators: The protocol defines how to package validators, not what they check

Conformance Language

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Audience

This specification is intended for:

  • Implementers building OSA-compliant ArchiveNodes or ViewNodes
  • Tool developers creating Validators or Curation Tools
  • Client developers building applications that interact with OSA nodes
  • Governance bodies establishing policies for OSA ecosystems

Architecture

Actors

The OSA protocol defines five types of actors:

ArchiveNode : A service that accepts data submissions, orchestrates validation and curation, and publishes immutable Records. The primary write-path actor.

ViewNode : A read-only service that indexes and presents Records from one or more ArchiveNodes, enabling search and discovery.

Validator : An OCI container that performs automated quality checks on datasets, testing them against SemanticGuarantees.

Curation Tool : An interactive OCI container that provides web-based interfaces for human reviewers to inspect and modify datasets.

Client : Any application (web app, CLI, library) that interacts with ArchiveNodes or ViewNodes to submit or retrieve data.

High-Level Flow

A typical dataset journey through OSA involves:

  1. Submission: A Client creates a Deposition on an ArchiveNode, uploads files, and submits for review.
  2. Validation: The ArchiveNode executes Validators (defined by the SubmissionProfile) to test the data against SemanticGuarantees.
  3. Curation: A human curator uses a Curation Tool (proxied by the ArchiveNode) to review, annotate, or fix issues.
  4. Publication: Once approved, the ArchiveNode creates an immutable Record with a permanent identifier.
  5. Discovery: ViewNodes index Records from multiple ArchiveNodes, enabling federated search.
┌────────┐
│ Client │
└────┬───┘
     │ (1) Create Deposition

┌─────────────┐
│ ArchiveNode │◄──(2) Run Validators
└─────────────┘

     │ (3) Proxy Curation Tool

┌──────────────┐
│   Curator    │
└──────────────┘

     │ (4) Approve → Create Record

┌──────────┐     (5) Index
│ ViewNode │◄─────────────┐
└──────────┘              │
     ▲                    │
     │                    │
     └────────────────────┘
        (6) Search/Discover

Key Concepts

Separation of Write and Read Paths : ArchiveNodes handle mutable, workflow-heavy submissions. ViewNodes handle immutable, query-optimized discovery. This separation enables specialized implementations for each concern.

Pluggable Domain Logic : Instead of hard-coding validation rules or curation interfaces, OSA uses OCI containers that can be developed, versioned, and shared independently.

Registry-Based Discovery : Schemas, Validators, and Curation Tools are registered with SRNs, enabling reuse across archives and federated governance.

Provenance by Default : Every Record links back to its source Deposition, the Validators that checked it, and the curator who approved it.

Terminology

Resources

Deposition : A mutable, in-progress dataset submission. The primary resource on the write path. Transitions through states (DRAFT, SUBMITTED, UNDER_REVIEW, APPROVED) before becoming a Record.

Record : An immutable, versioned, published dataset. The primary resource on the read path. Created from an approved Deposition.

SubmissionProfile : A template that defines requirements for a type of submission, including the Schema, required SemanticGuarantees, and available Curation Tools.

SemanticGuarantee : A verifiable assertion about data quality or correctness (e.g., “All dates are ISO 8601”, “No missing required fields”). Enforced by a Validator.

ValidationRun : The result of executing a Validator against a Deposition. Contains pass/fail status and diagnostic messages.

Executables

Validator : An OCI container that takes a dataset as input and produces a pass/fail result, testing conformance to a SemanticGuarantee.

Curation Tool : An OCI container that exposes a web interface for human inspection and modification of Depositions.

CurationToolRef : A registry entry describing a Curation Tool: its OCI image, exposed port, and capabilities (read-only vs read-write).

Infrastructure

Structured Resource Name (SRN) : A URN-based, globally unique, location-independent identifier for any OSA resource (e.g., urn:osa:example-archive:rec:abc123@v1).

Registry : A versioned, append-only collection of Schemas, SemanticGuarantees, CurationToolRefs, and SubmissionProfiles. Enables discovery and reuse.

NodeID : A globally unique identifier for an ArchiveNode, used as the authority component in SRNs.

Part 2: Core Protocol

Resources

This section specifies what resources exist in the OSA protocol and what properties they MUST have. Implementations MAY add additional properties but MUST include all required fields.

Deposition

A Deposition represents a dataset in progress. It is mutable until submitted for validation.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier (type: dep)
statusstringCurrent state: DRAFT, SUBMITTED, UNDER_REVIEW, or APPROVED
profileSRNThe SubmissionProfile this Deposition targets
metadataobjectUser-provided descriptive metadata (structure defined by Profile’s Schema)
filesarrayList of file objects (see below)
created_atISO 8601 datetimeTimestamp of creation
updated_atISO 8601 datetimeTimestamp of last modification

File Object Structure

Each entry in files MUST include:

PropertyTypeDescription
namestringFilename
sizeintegerSize in bytes
checksumstringSHA-256 hash (hex-encoded)
uploaded_atISO 8601 datetimeUpload timestamp

Optional Properties

Implementations MAY include:

  • validation_runs: Array of ValidationRun objects
  • curator_id: User ID of assigned curator (when in UNDER_REVIEW state)
  • submitted_at: Timestamp when status changed to SUBMITTED

Record

A Record represents an immutable, published dataset. It is created from an approved Deposition.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier (type: rec) with version (e.g., @v1)
statusstringCurrent state: PUBLIC, EMBARGOED, or WITHDRAWN
profileSRNThe SubmissionProfile (copied from source Deposition)
metadataobjectFinal descriptive metadata
filesarrayList of published file objects (same structure as Deposition files)
provenanceobjectOrigin information (see below)
published_atISO 8601 datetimeTimestamp of publication

Provenance Object Structure

PropertyTypeDescription
source_depositionSRNThe Deposition this Record was created from
approved_bystringUser ID of the approving curator
approved_atISO 8601 datetimeApproval timestamp
guaranteesarrayList of SemanticGuarantee SRNs with passing ValidationRuns at time of approval

The guarantees field enables filtering and discovery based on verified data quality properties. This allows consumers to programmatically select datasets that meet specific semantic requirements (e.g., “all timestamps are ISO 8601 compliant”) without inspecting individual files.

Record Versioning

Records are immutable. If changes are needed, a new version MUST be created with an incremented version number (e.g., @v1@v2). The new version MUST reference the previous version in its provenance.

SubmissionProfile

A SubmissionProfile bundles requirements for a submission type.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier
titlestringHuman-readable name (e.g., “Crystallography Dataset”)
schemaSRNReference to a Schema definition
guaranteesarrayList of requirement objects (see below)
curation_toolsarrayList of CurationToolRef SRNs available for this profile

Guarantee Requirement Object

PropertyTypeDescription
guarantee_srnSRNReference to a SemanticGuarantee
requiredbooleanIf true, this guarantee MUST pass before approval

SemanticGuarantee

A SemanticGuarantee defines a testable assertion about data quality.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier
titlestringHuman-readable name
descriptionstringWhat this guarantee asserts
validatorSRNReference to the Validator OCI image that tests this guarantee

CurationToolRef

A CurationToolRef describes an interactive review tool.

Required Properties

PropertyTypeDescription
srnSRNUnique identifier
titlestringHuman-readable name (e.g., “3D Molecule Viewer”)
imagestringOCI image reference (e.g., docker.io/osa/ngl-viewer:1.2.0)
default_portintegerPort the container’s web server listens on
capabilitiesarrayList of strings: read-only and/or read-write

Identifiers

All resources in the OSA protocol MUST be addressable by a Structured Resource Name (SRN).

SRN Grammar

SRNs MUST follow this URN-based grammar:

urn:osa:{node-id}:{type}:{local-id}[@{version}]

Components

urn:osa : The fixed scheme prefix. All OSA identifiers begin with this.

{node-id} : The globally unique identifier of the originating ArchiveNode (e.g., osa-registry, imperial-mat-sci). Node IDs MUST be DNS-safe (alphanumeric and hyphens only).

{type} : A short string identifying the resource type:

  • dep — Deposition
  • rec — Record
  • schema — Schema definition
  • tool — Curation Tool
  • val — Validator
  • guarantee — SemanticGuarantee
  • profile — SubmissionProfile

{local-id} : A node-unique, opaque identifier. Implementations MAY use UUIDs, sequential IDs, or other schemes. Local IDs MUST be URL-safe.

@{version} (optional) : A version identifier. REQUIRED for Records, OPTIONAL for other resources. Versions SHOULD follow Semantic Versioning 2.0 (e.g., @v1.0.0, @v2.3.1).

Examples

urn:osa:osa-registry:profile:crystallography@v1.0.0
urn:osa:imperial-mat-sci:dep:xyz789
urn:osa:imperial-mat-sci:rec:xyz789@v1
urn:osa:osa-registry:guarantee:iso8601-dates

Versioning Semantics

Records MUST include versions: Every Record SRN MUST include a version component (e.g., @v1). This enables immutable references.

Other resources MAY include versions: Schemas, Profiles, and Tools MAY use versions to track evolution while maintaining backwards compatibility.

Version resolution: When an SRN without a version is dereferenced (e.g., in a Profile reference), the registry MUST return the latest version.

Lifecycles

Deposition Lifecycle

A Deposition progresses through the following states:

┌─────────┐    submit    ┌───────────┐              ┌──────────────┐
│  DRAFT  │─────────────▶│ SUBMITTED │─────────────▶│ UNDER_REVIEW │
└─────────┘              └───────────┘   (curator   └──────────────┘
     ▲                                     claims)            │
     │                                                        │
     │                                                        ▼
     │                                                  ┌──────────┐
     │                                                  │ APPROVED │
     │                                                  └──────────┘
     │                                                        │
     │                                                        ▼
     └────────────────────────────────────────────    [Record Created]
              (request changes)

State: DRAFT

Entry conditions: Automatically entered when a Deposition is created.

Permitted operations:

  • Metadata MAY be modified
  • Files MAY be uploaded or deleted
  • Validators MAY be run (for pre-submission checks)

Transition to SUBMITTED: A Client MAY request transition to SUBMITTED. The ArchiveNode MUST validate that required metadata fields are present before allowing the transition.

State: SUBMITTED

Entry conditions: Depositor has indicated the submission is complete.

Observable requirements:

  • The Deposition MUST be immutable to the original depositor
  • All required Validators (as defined by the SubmissionProfile) MUST be executed
  • ValidationRuns MUST be created and linked to the Deposition

Transition to UNDER_REVIEW: The ArchiveNode MAY transition to UNDER_REVIEW when:

  • All required SemanticGuarantees have passing ValidationRuns, OR
  • The SubmissionProfile requires manual curation regardless of validation status

Transition to DRAFT: If validation fails and the SubmissionProfile allows resubmission, the ArchiveNode MAY allow a curator to request changes, returning the Deposition to DRAFT with feedback.

State: UNDER_REVIEW

Entry conditions: The Deposition is awaiting or undergoing human review.

Permitted operations:

  • A curator MAY instantiate Curation Tools to inspect the data
  • Curators MAY modify metadata or files (via Curation Tools) to fix issues
  • The curator MAY add annotations or comments

Transition to APPROVED: The curator MAY approve the Deposition. Before approval, the ArchiveNode MUST verify that all required SemanticGuarantees are satisfied.

Transition to DRAFT: The curator MAY request changes from the depositor.

State: APPROVED

Entry conditions: A curator has approved the Deposition.

Observable requirement: The ArchiveNode MUST create a Record from the approved Deposition. This is a terminal state for the Deposition.

Validation Gate

Before a Deposition can transition to APPROVED, the following invariant MUST hold:

For every SemanticGuarantee in the SubmissionProfile where required: true, there MUST exist a ValidationRun with status: "pass".

Implementations MAY cache ValidationRuns and reuse them if the Deposition data has not changed. Implementations MUST re-run Validators if files or metadata have been modified since the last run.

Record Lifecycle

Records are immutable after creation. They have a simpler lifecycle:

State: PUBLIC

The Record is openly accessible. This is the default state for newly created Records.

State: EMBARGOED

The Record exists but is not publicly discoverable until a specified date. Embargoes are optional and implementation-specific.

State: WITHDRAWN

The Record has been retracted. The metadata remains visible (with a “WITHDRAWN” marker), but files are no longer accessible. Withdrawals MUST include a reason in the metadata.

Registries

OSA Registries provide authoritative, versioned collections of reusable protocol resources.

Registry Responsibilities

A Registry is a service that:

  1. Stores Schemas, SemanticGuarantees, CurationToolRefs, and SubmissionProfiles
  2. Resolves SRNs to their resource definitions
  3. Enforces versioning using Semantic Versioning 2.0
  4. Maintains immutability (entries are append-only; versions cannot be modified after publication)

Registry Discovery

Every ArchiveNode MUST publish a Node Document at /.well-known/osa-node.json that includes:

{
  "node_id": "imperial-mat-sci",
  "registries": [
    "https://registry.osa.org",
    "https://imperial-mat-sci.ac.uk/osa-registry"
  ]
}

When resolving an SRN, clients SHOULD:

  1. Check if the SRN’s node-id matches a known registry
  2. Query that registry’s resolution endpoint
  3. Fall back to the Global Registry (urn:osa:osa-registry)

Registry Types

Global Registry : The canonical registry at urn:osa:osa-registry, maintained by the OSA governance body. Contains broadly applicable Schemas, Guarantees, and Tools.

Local Registries : Individual ArchiveNodes or institutions MAY maintain their own registries for domain-specific resources. Local registry SRNs use the node’s own node-id as authority.

Registry Entries

All registry entries MUST include:

  • srn: The resource’s identifier
  • version: Semantic version (e.g., 1.0.0)
  • published_at: ISO 8601 timestamp
  • schema: Resource-specific schema (e.g., SubmissionProfile structure)

Governance

Changes to Global Registry entries (especially breaking changes to core Schemas or Profiles) MUST follow the OSA Enhancement Proposal (OEP) process (see §Extensibility).

Part 3: Protocol Bindings

ArchiveNode HTTP API

This section defines the HTTP API that conforming ArchiveNodes MUST implement.

General Conventions

Base URL: All endpoints are relative to the ArchiveNode’s base URL (e.g., https://archive.example.org/api/v1).

Authentication: Requests MUST include a Bearer token in the Authorization header:

Authorization: Bearer <token>

Content Type: Request and response bodies MUST use application/json unless otherwise specified.

Error Responses: Errors MUST return appropriate HTTP status codes and a JSON body:

{
  "error": "error_code",
  "message": "Human-readable description"
}

Deposition Endpoints

Create Deposition

POST /depositions

Creates a new Deposition in DRAFT state.

Request Body:

{
  "profile": "urn:osa:osa-registry:profile:crystallography@v1.0.0"
}

Response (201 Created):

{
  "srn": "urn:osa:example-archive:dep:abc123",
  "status": "DRAFT",
  "profile": "urn:osa:osa-registry:profile:crystallography@v1.0.0",
  "metadata": {},
  "files": [],
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:00Z"
}

Get Deposition

GET /depositions/{id}

Retrieves a Deposition by its local ID.

Response (200 OK): Full Deposition object (as above).

Update Deposition Metadata

PATCH /depositions/{id}

Updates metadata fields. Only valid in DRAFT state (or UNDER_REVIEW if the requester is a curator).

Request Body:

{
  "metadata": {
    "title": "Crystal Structure of Protein X",
    "authors": ["Alice", "Bob"]
  }
}

Response (200 OK): Updated Deposition object.

Upload File

POST /depositions/{id}/files

Uploads a file to the Deposition.

Request: multipart/form-data with a file field.

Response (201 Created):

{
  "name": "data.cif",
  "size": 1048576,
  "checksum": "sha256:abcdef123456...",
  "uploaded_at": "2024-01-15T10:35:00Z"
}

Delete File

DELETE /depositions/{id}/files/{filename}

Removes a file from the Deposition. Only valid in DRAFT state.

Response (204 No Content)

Submit for Review

POST /depositions/{id}/actions/submit

Transitions the Deposition to SUBMITTED state, triggering validation.

Response (200 OK):

{
  "status": "SUBMITTED",
  "message": "Validation in progress"
}

List Validation Runs

GET /depositions/{id}/validations

Returns all ValidationRuns for this Deposition.

Response (200 OK):

{
  "validations": [
    {
      "guarantee": "urn:osa:osa-registry:guarantee:iso8601-dates",
      "status": "pass",
      "executed_at": "2024-01-15T10:36:00Z",
      "messages": ["All date fields are valid ISO 8601"]
    }
  ]
}

Record Endpoints

List Records

GET /records

Returns paginated list of public Records.

Query Parameters:

  • page: Page number (default: 1)
  • per_page: Results per page (default: 20, max: 100)

Response (200 OK):

{
  "records": [
    {
      "srn": "urn:osa:example-archive:rec:xyz789@v1",
      "status": "PUBLIC",
      "metadata": { "title": "..." },
      "published_at": "2024-01-15T12:00:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "per_page": 20,
    "total": 150
  }
}

Get Record

GET /records/{id}

Retrieves a specific Record by local ID. If no version is specified, returns the latest version.

GET /records/{id}@{version}

Retrieves a specific version of a Record.

Response (200 OK): Full Record object.

Download Record File

GET /records/{id}/files/{filename}

Downloads a file from a Record.

Response (200 OK): File contents (with appropriate Content-Type and Content-Disposition headers).

Curation Endpoints

List Available Tools

GET /depositions/{id}/tools

Returns Curation Tools available for this Deposition (as defined by its SubmissionProfile).

Response (200 OK):

{
  "tools": [
    {
      "srn": "urn:osa:osa-registry:tool:ngl-viewer@1.2.0",
      "title": "3D Molecule Viewer",
      "capabilities": ["read-only"]
    }
  ]
}

Start Curation Session

POST /depositions/{id}/sessions

Starts a Curation Tool session (launches OCI container and returns proxy endpoint).

Request Body:

{
  "tool_srn": "urn:osa:osa-registry:tool:ngl-viewer@1.2.0"
}

Response (201 Created):

{
  "session_id": "sess_abc123",
  "status": "provisioning",
  "proxy_endpoint": "/curation/sessions/sess_abc123",
  "expires_at": "2024-01-15T14:00:00Z"
}

Access Curation Tool

GET /curation/sessions/{session_id}/{path}

Proxies requests to the running Curation Tool container.

The ArchiveNode MUST:

  1. Verify the requester owns the session
  2. Forward requests to the container’s web server
  3. Rewrite paths according to OSAP_BASE_PATH

Stop Curation Session

DELETE /curation/sessions/{session_id}

Terminates the Curation Tool container.

Response (204 No Content)

ViewNode Protocol

A ViewNode provides read-only, indexed access to Records from one or more ArchiveNodes.

Responsibilities

A conforming ViewNode MUST:

  1. Fetch Records from ArchiveNodes via their HTTP API
  2. Index metadata to enable search
  3. Expose a Search API (defined below)
  4. Update its index when new Records are published

ViewNodes MAY:

  • Maintain local copies of Record files
  • Compute derived metadata (e.g., thumbnails, text extracts)
  • Aggregate Records from multiple ArchiveNodes

Search API

Search Records

GET /search

Searches indexed Records.

Query Parameters:

  • q: Query string (implementation-specific syntax)
  • filters: JSON-encoded filter object (optional)
  • guarantees: Comma-separated list of SemanticGuarantee SRNs (returns only Records satisfying all listed guarantees)
  • page: Page number (default: 1)
  • per_page: Results per page (default: 20, max: 100)

Response (200 OK):

{
  "results": [
    {
      "srn": "urn:osa:example-archive:rec:abc123@v1",
      "title": "Crystal Structure of Protein X",
      "published_at": "2024-01-15T12:00:00Z",
      "archive_node": "https://example-archive.org",
      "guarantees": [
        "urn:osa:osa-registry:guarantee:iso8601-dates",
        "urn:osa:osa-registry:guarantee:valid-metadata"
      ]
    }
  ],
  "pagination": {
    "page": 1,
    "per_page": 20,
    "total": 42
  }
}

Example with guarantee filtering:

GET /search?q=protein+structures&guarantees=urn:osa:osa-registry:guarantee:iso8601-dates,urn:osa:osa-registry:guarantee:valid-cif

Returns only protein structure Records that have validated against both the ISO 8601 date guarantee and the CIF format guarantee.

Get Indexed Record

GET /records/{srn}

Retrieves a Record by its full SRN (URL-encoded).

Example: GET /records/urn%3Aosa%3Aexample-archive%3Arec%3Aabc123%40v1

Response (200 OK): Full Record object, with additional field:

{
  "srn": "urn:osa:example-archive:rec:abc123@v1",
  ...,
  "source_archive": "https://example-archive.org/api/v1"
}

Synchronization

ViewNodes MUST implement one of the following strategies:

Pull-based: Periodically poll ArchiveNodes for new Records via GET /records.

Push-based: ArchiveNodes MAY notify ViewNodes via webhooks (implementation-specific).

Federation protocol (optional): A future version of this spec may define a standardized sync protocol.

Validator Contract

This section defines the execution contract for Validator OCI containers.

Purpose

Validators are headless, automated programs that test datasets against SemanticGuarantees. They run in sandboxed OCI containers and communicate via files.

Container Requirements

Packaging

Validators MUST be packaged as OCI-compliant container images (compatible with Docker, Podman, etc.).

Entrypoint

The container’s entrypoint MUST:

  1. Read input data from $OSAP_IN
  2. Perform validation checks
  3. Write results to $OSAP_OUT/result.json
  4. Exit with code 0 (regardless of pass/fail; the result determines pass/fail)

Environment Variables

The ArchiveNode MUST inject:

VariableDescription
OSAP_INPath to input directory containing metadata.json and data files
OSAP_OUTPath to output directory (writable)

Input Format

The $OSAP_IN directory contains:

  • metadata.json: The Deposition’s metadata object
  • One or more data files (as uploaded by the depositor)

Output Format

Validators MUST write $OSAP_OUT/result.json:

{
  "status": "pass",
  "messages": [
    "Checked 500 rows, all valid.",
    "No missing required fields."
  ]
}

Required fields:

  • status: Either "pass" or "fail"
  • messages: Array of human-readable strings (diagnostic info)

Optional fields:

  • errors: Array of specific error objects (structure is validator-defined)

Example Validation Run

# ArchiveNode prepares input
$ ls $OSAP_IN
metadata.json  data.csv

# ArchiveNode runs container
$ docker run --rm \
  -v /path/to/input:/input:ro \
  -v /path/to/output:/output \
  -e OSAP_IN=/input \
  -e OSAP_OUT=/output \
  myregistry.io/osa/csv-validator:1.0.0

# Validator writes result
$ cat $OSAP_OUT/result.json
{
  "status": "pass",
  "messages": ["All 1000 rows validated successfully"]
}

Sandboxing

ArchiveNodes MUST run Validators with:

  • No network access (no outbound connections)
  • Read-only input ($OSAP_IN mounted read-only)
  • Limited resources (CPU/RAM limits)
  • Isolated filesystem (no access to host filesystem beyond input/output)

ArchiveNodes SHOULD set execution timeouts (e.g., 10 minutes) to prevent runaway validators.

Error Handling

If the Validator container:

  • Exits with non-zero code → Treat as status: "fail" with message “Validator crashed”
  • Fails to write result.json → Treat as status: "fail" with message “No result produced”
  • Times out → Treat as status: "fail" with message “Validation timeout exceeded”

Curation Tool Contract

This section defines the execution contract for Curation Tool OCI containers.

Purpose

Curation Tools are interactive web applications that allow human curators to inspect, annotate, and modify Depositions. Unlike Validators, they are long-running and expose HTTP interfaces.

Container Requirements

Packaging

Curation Tools MUST be packaged as OCI-compliant container images.

Web Server

The container MUST run a web server listening on the port specified in its CurationToolRef.default_port (e.g., 8080).

Environment Variables

The ArchiveNode MUST inject:

VariableDescription
OSAP_DEPOSITION_IDThe SRN of the Deposition being curated
OSAP_API_URLThe internal URL of the ArchiveNode’s API (e.g., http://localhost:5000/api/v1)
OSAP_API_TOKENA temporary Bearer token with write access to the Deposition
OSAP_BASE_PATHThe public path prefix (e.g., /curation/sessions/sess_abc123)

Base Path Handling

All assets (HTML, CSS, JS, images) MUST be served relative to $OSAP_BASE_PATH.

Example: If the tool serves a stylesheet at /style.css, and OSAP_BASE_PATH=/curation/sessions/sess_abc123, the client must access it at:

https://archive.example.org/curation/sessions/sess_abc123/style.css

Many web frameworks support base path configuration (e.g., Flask’s APPLICATION_ROOT, Express’s app.use(basePath, ...)).

Data Access

Reading Data

Curation Tools MAY access Deposition data via:

  1. API calls to $OSAP_API_URL using $OSAP_API_TOKEN
  2. Mounted files (if the ArchiveNode mounts data at /data read-only)

Modifying Data

To modify the Deposition (e.g., fix metadata, delete files), the tool MUST:

  1. Make API calls to $OSAP_API_URL (e.g., PATCH /depositions/{id})
  2. Include Authorization: Bearer $OSAP_API_TOKEN

Tools MUST NOT attempt to write state to the container’s local filesystem. Any data written locally will be lost when the session ends.

Security

Token Scope

The provided OSAP_API_TOKEN:

  • MUST grant read access to the Deposition being curated
  • MUST grant write access ONLY if the tool’s capabilities includes read-write
  • MUST expire when the curation session ends
  • SHOULD be limited to the specific Deposition (not all Depositions)

Isolation

Curation Tool containers MUST be isolated:

  • Network restrictions: Outbound internet access SHOULD be blocked unless explicitly required
  • Ephemeral storage: Any data written to the container’s filesystem MUST be discarded when the session ends
  • Resource limits: CPU/RAM limits SHOULD be enforced

Example Curation Session

# ArchiveNode starts container
$ docker run --rm -d \
  -p 8080:8080 \
  -e OSAP_DEPOSITION_ID=urn:osa:example:dep:abc123 \
  -e OSAP_API_URL=http://host.docker.internal:5000/api/v1 \
  -e OSAP_API_TOKEN=eyJhbGc... \
  -e OSAP_BASE_PATH=/curation/sessions/sess_xyz \
  myregistry.io/osa/molecule-viewer:1.2.0

# ArchiveNode proxies requests
# User visits: https://archive.example.org/curation/sessions/sess_xyz
# ArchiveNode forwards to: http://localhost:8080/ (with path rewriting)

Client Requirements

A Client is any application that interacts with ArchiveNodes or ViewNodes. This includes web applications, command-line tools, libraries, and scripts.

Required Behaviors

Conforming Clients MUST:

  1. Use SRNs for resource references: When referencing Depositions, Records, or Profiles, use full SRNs (not local IDs alone).

  2. Authenticate via Bearer tokens: Include Authorization: Bearer <token> in all requests to ArchiveNodes.

  3. Handle standard HTTP status codes:

    • 401 Unauthorized → Prompt for authentication
    • 403 Forbidden → Insufficient permissions
    • 404 Not Found → Resource does not exist
    • 422 Unprocessable Entity → Validation errors (check response body for details)
  4. Respect rate limits: If an ArchiveNode returns 429 Too Many Requests, back off exponentially.

Optional Behaviors

Clients MAY:

  • Cache Record metadata locally (but SHOULD check ETag or Last-Modified headers)
  • Support multiple ArchiveNodes simultaneously
  • Implement retry logic for transient failures (5xx errors)

Example: Submitting a Dataset

import requests

API_BASE = "https://archive.example.org/api/v1"
TOKEN = "your-bearer-token"

# 1. Create Deposition
resp = requests.post(
    f"{API_BASE}/depositions",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={"profile": "urn:osa:osa-registry:profile:crystallography@v1.0.0"}
)
deposition = resp.json()
dep_id = deposition["srn"].split(":")[-1]  # Extract local ID

# 2. Upload file
with open("data.cif", "rb") as f:
    requests.post(
        f"{API_BASE}/depositions/{dep_id}/files",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files={"file": f}
    )

# 3. Update metadata
requests.patch(
    f"{API_BASE}/depositions/{dep_id}",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={"metadata": {"title": "My Crystal Structure"}}
)

# 4. Submit
requests.post(
    f"{API_BASE}/depositions/{dep_id}/actions/submit",
    headers={"Authorization": f"Bearer {TOKEN}"}
)
Part 4: Governance

Security & Privacy

Authentication

ArchiveNodes MUST support Bearer token authentication. Tokens SHOULD be obtained via an external OIDC-compatible identity provider.

ArchiveNodes MAY support additional authentication methods (e.g., API keys, mTLS) but MUST support Bearer tokens for interoperability.

Authorization

ArchiveNodes MUST enforce:

  • Depositor isolation: Users can only read/modify Depositions they created (unless they have curator privileges)
  • Curator privileges: Curators can view and modify any Deposition in UNDER_REVIEW state
  • Public read access: Records in PUBLIC state SHOULD be readable without authentication

Node Identity

Each ArchiveNode MUST have a globally unique NodeID and MUST publish a Node Document at:

https://{domain}/.well-known/osa-node.json

Example:

{
  "node_id": "imperial-mat-sci",
  "version": "1.0.0",
  "api_base": "https://imperial-mat-sci.ac.uk/osa/api/v1",
  "registries": [
    "https://registry.osa.org",
    "https://imperial-mat-sci.ac.uk/osa-registry"
  ]
}

Curation Tool Security

Proxy Authentication

The ArchiveNode’s proxy endpoint (/curation/sessions/{session_id}) MUST verify that the requesting user is the owner of the session.

Tool Isolation

Curation Tools MUST be isolated:

  • Network restrictions: Outbound internet access SHOULD be blocked
  • Ephemeral storage: Any data written to the container’s filesystem MUST be discarded when the session ends
  • No persistent state: Tools MUST NOT maintain state between sessions

CSRF Protection

Because Curation Tools are proxied on the ArchiveNode’s domain, they share the same origin as the main application. ArchiveNodes MUST:

  • Enforce SameSite=Strict on session cookies
  • Validate CSRF tokens on all state-changing operations
  • Use separate session tokens for curation (not the main user session)

Data Privacy

ArchiveNodes SHOULD:

  • Encrypt data at rest and in transit (TLS 1.3+)
  • Provide mechanisms for embargoing sensitive data
  • Support metadata redaction for withdrawn Records
  • Log all access to private Depositions for audit purposes

Conformance

Conformance Classes

ArchiveNode

A conforming ArchiveNode MUST:

  • Implement all endpoints in §ArchiveNode HTTP API
  • Enforce the Deposition lifecycle (§Lifecycles)
  • Execute Validators according to §Validator Contract
  • Proxy Curation Tools according to §Curation Tool Contract
  • Support SRN resolution (§Identifiers)
  • Publish a Node Document at /.well-known/osa-node.json

ViewNode

A conforming ViewNode MUST:

  • Implement the Search API (§ViewNode Protocol)
  • Index Records from at least one ArchiveNode
  • Index the guarantees field from Record provenance to enable quality-based filtering
  • Support filtering by SemanticGuarantees via the guarantees query parameter
  • Resolve SRNs (§Identifiers)
  • Return results in the specified JSON format

Validator

A conforming Validator MUST:

  • Be packaged as an OCI container
  • Read input from $OSAP_IN
  • Write result.json to $OSAP_OUT with required fields
  • Exit with code 0
  • Operate without network access

Curation Tool

A conforming Curation Tool MUST:

  • Be packaged as an OCI container
  • Run a web server on the specified port
  • Serve assets relative to $OSAP_BASE_PATH
  • Use $OSAP_API_TOKEN for all state-changing operations
  • Not write persistent state to local disk

Client

A conforming Client MUST:

  • Use SRNs for resource references
  • Authenticate via Bearer tokens
  • Handle standard HTTP status codes

Compliance Testing

The OSA project maintains a test suite at https://github.com/open-science-archive/compliance-tests.

Implementations MUST pass all tests in the relevant conformance class to claim OSA compliance.

Version Compatibility

This specification is version 0.0.4 (draft).

Breaking changes (as defined by Semantic Versioning 2.0) will increment the major version. Implementations SHOULD advertise their supported spec versions in the Node Document.

Extensibility

OSA Enhancement Proposals (OEPs)

Changes to this specification, the Global Registry, or standard contracts MUST follow the OEP process:

  1. Draft: Author submits a proposal to the OSA governance repository
  2. Community Review: 14-day public comment period
  3. Revision: Author addresses feedback
  4. Last Call: 7-day final review
  5. Accepted: Governance committee votes (requires 2/3 majority)

Accepted OEPs are versioned and published at https://oeps.osa.org/.

Namespaced Extensions

Implementations MAY add custom fields to protocol resources using namespaced keys:

{
  "metadata": {
    "title": "My Dataset",
    "x-institution-grant-id": "GRANT-12345",
    "x-institution-internal-id": "abc-xyz-789"
  }
}

Extension keys MUST:

  • Start with x- followed by a unique namespace identifier
  • Use lowercase and hyphens (e.g., x-my-org-field-name)
  • Not conflict with standardized keys

Extensions MUST NOT:

  • Change the semantics of required protocol fields
  • Break interoperability (other implementations MUST be able to ignore unknown fields)

Future Directions

Topics under consideration for future versions:

  • Federation protocol: Standardized push-based synchronization between ArchiveNodes and ViewNodes
  • Access control policies: Fine-grained permissions (e.g., group-based access, time-limited embargoes)
  • Provenance chains: Tracking derived datasets and lineage across Records
  • Binary attachment format: Standardized packaging for Record exports (e.g., BagIt integration)