ReplicateScience
ExperimentsPapersScientistsInstitutionsMethodologyDevelopers
Submit paperGet API key
ReplicateScience

A programmable protocol library. Open-access papers in, reproducible protocols out. Built by ConductScience.

systems operationalv 0.16.0

Browse

  • Experiments
  • Papers
  • Scientists
  • Institutions

Build

  • API docs
  • Python SDK
  • Status
  • Changelog

About

  • Methodology
  • Submit a paper
  • ConductScience
  • Contact

Legal

  • Privacy
  • Terms
  • License

© 2026 ConductScience. All rights reserved.

replicatescience.com / protocols / evidence-backed methods

Methodology & Accuracy

ReplicateScience publishes a page only when the source paper, extracted evidence, rendered protocol, and public page checks agree. Pages that do not meet that bar stay out of the public index.

Current Public Dataset

323
Live pages checked
100%
Passed public QA
0
Live blockers
0
Browser-check titles

Last verified May 31, 2026. The live QA crawler checked every indexed experiment page, confirmed the RDL renderer, source identifiers, and required page structure, and found no public blockers.

View accuracy snapshot JSONView snapshot source on GitHub

What We Mean by Accurate

A published RDL page is source-backed, not simply AI-generated. The page must carry the source paper identity, parsed source evidence, structured methods coverage, materials or equipment coverage, outputs, analysis evidence, and a public QA pass. Unsupported details are blocked or labeled instead of being presented as source facts.

This does not mean every remaining internal record is accurate. It means the public set has passed the current source-fidelity gate. Internal records that lack source access, parsed supplements, or enough structured method evidence remain blocked until the evidence improves.

Internal Data Coverage

322
Published full-RDL jobs
1,482
Held back from publication
18%
Currently publishable

Published confidence

  • 322 publishable schema 2.0.0 RDL packages
  • 323 live pages passed source-page QA
  • 0 remaining unpublished candidates pass the current rank gate
  • 216,603 source evidence units stored across the corpus

Known limits

  • 1,363 jobs still need source access or source repair
  • 116 jobs need stronger structured extraction
  • 3 jobs need manual review
  • 467 fetched source files are stored but not yet parsed
  • 1,012 source documents are blocked by fetch, browser challenge, or OA package availability

Source Data for These Numbers

Live QA report

Artifact: artifacts/source-page-qa/full-completion-post-deploy-live-2026-05-31/report.json

SHA-256: 7E15B5E1EF3F90007A09DE437D1B74ED96E51E337A365C8D1ECF9B86E0189FF3

Command: npm run qa:source-pages -- --out artifacts/source-page-qa/full-completion-post-deploy-live-2026-05-31 --concurrency 4

Final rank gate

Artifact: artifacts/full-rdl-rank/full-completion-final-2026-05-31/summary.json

SHA-256: B459C64CD40054726F8747E63F7194183908D5252571143AFE7240D4B6AEF562

Command: npm run rdl:rank-full -- --status blocked_needs_structured_extraction,blocked_needs_manual_review --limit 500 --top 300 --out artifacts/full-rdl-rank/full-completion-final-2026-05-31

The public JSON snapshot is the shareable source of truth for this page. It contains the database aggregate counts, QA summary, rank-gate summary, source-document coverage, artifact hashes, and the interpretation limits used here.

Publication Gate

1. Source identity

DOI, PMCID, title, and source URL must resolve to the same paper. Pages with browser-check or placeholder titles are blocked.

2. Evidence coverage

The page must have source-backed subjects, groups, procedure steps, materials or equipment, outputs, and analysis evidence. Non-executable reviews, guidelines, proceedings, tutorials, and reporting standards are rejected.

3. Rendered-page QA

The public crawler checks indexed pages after publication. Any page with incomplete protocol data, source identity failure, or wrong template is removed from public eligibility and returned to the rebuild queue.

Historical Extraction Benchmark

The older benchmark below tracks an internal extractor/scorer series from March 2026. It is useful for engineering history, but it is not the publication accuracy score for today's RDL pages.

67
Overall Score
406
Matched Pairs
1,529
Experiments
39
Weakest: equipment Completeness

Overall Score Over Time

Dimension Breakdown

AutoResearch Iterations

#MutatorComponentBeforeAfterDeltaDecision
1few-shotfewShotExamples7576+1reject
2user-promptuserPromptTemplate79790tied
3schemaoutputSchemaDescription77770tied
4manual-passpasses7578+3accept
5manual-passpasses+step-expansion7883+5accept

Current Bottlenecks & Next Steps

stepCoverage (31/100) — Vocabulary Mismatch

Papers describe procedures in natural language ("locomotor activity was recorded for 10 min") while protocols.io has software-specific steps ("Open Biobserve Viewer", "Press F1"). The scorer needs expanded synonym groups and TF-IDF weighting to bridge this gap.

parameterAccuracy (51/100) — Missing Details

Papers often omit specific parameters (exact temperatures, concentrations) that protocols.io specifies. Multi-pass extraction with step expansion may recover more details.

Planned Improvements

  • Expand synonym groups for software-action mappings
  • Add TF-IDF weighting to reduce stopword dominance in Dice coefficient
  • Multi-pass extraction for finer-grained step decomposition
  • Targeted few-shot examples filtered by experiment type
  • Consider embedding similarity (all-MiniLM-L6-v2) if scorer improvements plateau

Release Notes

v1.9.02026-03-26
67/100

SCORER METHODOLOGY CHANGE — not comparable to v1.8.0. equipmentCompleteness: category-based comparison replaces binary has-equipment check (76→39 honest). parameterAccuracy: word-boundary regex + semantic temp equivalents replaces substring matching (51→46, false positives removed). Overall 67 reflects true extraction quality.

v1.8.02026-03-25
74/100

OpenAI embedding similarity (text-embedding-3-small) replaces Dice coefficient for step matching. stepCoverage 33→61 (+85%), overall 67→74

v1.7.02026-03-25
67/100

Scorer improvements (stopwords, 40+ synonyms), multi-pass extraction, targeted few-shot. 6 loop iterations all rejected — prompt mutations converged, scorer is the bottleneck

v1.6.02026-03-25
67/100

AutoResearch v2 — mutator architecture, 22 synonym groups

v1.5.02026-03-18
63/100

Expanded PIO to 134 GTs, config-driven extraction

v1.4.02026-03-10
58/100

PIO preference over ConductScience GT, more experiment types

v1.3.02026-03-01
55/100

Multi-strategy step matching, synonym expansion

v1.2.02026-02-22
50/100

Expanded protocols.io to 57 GTs, improved fuzzy matching

v1.1.02026-02-18
45/100

Added protocols.io ground truth ingestion

v1.0.02026-02-15
42/100

Initial release — ConductScience GT only