Methodology & Accuracy

ReplicateScience publishes a page only when the source paper, extracted evidence, rendered protocol, and public page checks agree. Pages that do not meet that bar stay out of the public index.

Current Public Dataset

323

Live pages checked

100%

Passed public QA

Live blockers

Browser-check titles

Last verified May 31, 2026. The live QA crawler checked every indexed experiment page, confirmed the RDL renderer, source identifiers, and required page structure, and found no public blockers.

View accuracy snapshot JSON View snapshot source on GitHub

What We Mean by Accurate

A published RDL page is source-backed, not simply AI-generated. The page must carry the source paper identity, parsed source evidence, structured methods coverage, materials or equipment coverage, outputs, analysis evidence, and a public QA pass. Unsupported details are blocked or labeled instead of being presented as source facts.

This does not mean every remaining internal record is accurate. It means the public set has passed the current source-fidelity gate. Internal records that lack source access, parsed supplements, or enough structured method evidence remain blocked until the evidence improves.

Internal Data Coverage

322

Published full-RDL jobs

1,482

Held back from publication

18%

Currently publishable

Published confidence

322 publishable schema 2.0.0 RDL packages
323 live pages passed source-page QA
0 remaining unpublished candidates pass the current rank gate
216,603 source evidence units stored across the corpus

Known limits

1,363 jobs still need source access or source repair
116 jobs need stronger structured extraction
3 jobs need manual review
467 fetched source files are stored but not yet parsed
1,012 source documents are blocked by fetch, browser challenge, or OA package availability

Source Data for These Numbers

Live QA report

Artifact: artifacts/source-page-qa/full-completion-post-deploy-live-2026-05-31/report.json

SHA-256: 7E15B5E1EF3F90007A09DE437D1B74ED96E51E337A365C8D1ECF9B86E0189FF3

Command: npm run qa:source-pages -- --out artifacts/source-page-qa/full-completion-post-deploy-live-2026-05-31 --concurrency 4

Final rank gate

Artifact: artifacts/full-rdl-rank/full-completion-final-2026-05-31/summary.json

SHA-256: B459C64CD40054726F8747E63F7194183908D5252571143AFE7240D4B6AEF562

Command: npm run rdl:rank-full -- --status blocked_needs_structured_extraction,blocked_needs_manual_review --limit 500 --top 300 --out artifacts/full-rdl-rank/full-completion-final-2026-05-31

The public JSON snapshot is the shareable source of truth for this page. It contains the database aggregate counts, QA summary, rank-gate summary, source-document coverage, artifact hashes, and the interpretation limits used here.

Publication Gate

1. Source identity

DOI, PMCID, title, and source URL must resolve to the same paper. Pages with browser-check or placeholder titles are blocked.

2. Evidence coverage

The page must have source-backed subjects, groups, procedure steps, materials or equipment, outputs, and analysis evidence. Non-executable reviews, guidelines, proceedings, tutorials, and reporting standards are rejected.

3. Rendered-page QA

The public crawler checks indexed pages after publication. Any page with incomplete protocol data, source identity failure, or wrong template is removed from public eligibility and returned to the rebuild queue.

Historical Extraction Benchmark

The older benchmark below tracks an internal extractor/scorer series from March 2026. It is useful for engineering history, but it is not the publication accuracy score for today's RDL pages.

Overall Score

406

Matched Pairs

1,529

Experiments

Weakest: equipment Completeness

Overall Score Over Time

Dimension Breakdown

AutoResearch Iterations

#	Mutator	Component	Before	After	Delta	Decision
1	few-shot	fewShotExamples	75	76	+1	reject
2	user-prompt	userPromptTemplate	79	79	0	tied
3	schema	outputSchemaDescription	77	77	0	tied
4	manual-pass	passes	75	78	+3	accept
5	manual-pass	passes+step-expansion	78	83	+5	accept

Current Bottlenecks & Next Steps

stepCoverage (31/100) — Vocabulary Mismatch

Papers describe procedures in natural language ("locomotor activity was recorded for 10 min") while protocols.io has software-specific steps ("Open Biobserve Viewer", "Press F1"). The scorer needs expanded synonym groups and TF-IDF weighting to bridge this gap.

parameterAccuracy (51/100) — Missing Details

Papers often omit specific parameters (exact temperatures, concentrations) that protocols.io specifies. Multi-pass extraction with step expansion may recover more details.

Planned Improvements

Expand synonym groups for software-action mappings
Add TF-IDF weighting to reduce stopword dominance in Dice coefficient
Multi-pass extraction for finer-grained step decomposition
Targeted few-shot examples filtered by experiment type
Consider embedding similarity (all-MiniLM-L6-v2) if scorer improvements plateau

Release Notes

v1.9.02026-03-26

67/100

SCORER METHODOLOGY CHANGE — not comparable to v1.8.0. equipmentCompleteness: category-based comparison replaces binary has-equipment check (76→39 honest). parameterAccuracy: word-boundary regex + semantic temp equivalents replaces substring matching (51→46, false positives removed). Overall 67 reflects true extraction quality.

v1.8.02026-03-25

74/100

OpenAI embedding similarity (text-embedding-3-small) replaces Dice coefficient for step matching. stepCoverage 33→61 (+85%), overall 67→74

v1.7.02026-03-25

67/100

Scorer improvements (stopwords, 40+ synonyms), multi-pass extraction, targeted few-shot. 6 loop iterations all rejected — prompt mutations converged, scorer is the bottleneck

v1.6.02026-03-25

67/100

AutoResearch v2 — mutator architecture, 22 synonym groups

v1.5.02026-03-18

63/100

Expanded PIO to 134 GTs, config-driven extraction

v1.4.02026-03-10

58/100

PIO preference over ConductScience GT, more experiment types

v1.3.02026-03-01

55/100

Multi-strategy step matching, synonym expansion

v1.2.02026-02-22

50/100

Expanded protocols.io to 57 GTs, improved fuzzy matching

v1.1.02026-02-18

45/100

Added protocols.io ground truth ingestion

v1.0.02026-02-15

42/100

Initial release — ConductScience GT only

Current Public Dataset

323

Live pages checked

100%

Passed public QA

Live blockers

Browser-check titles

Last verified May 31, 2026. The live QA crawler checked every indexed experiment page, confirmed the RDL renderer, source identifiers, and required page structure, and found no public blockers.

What We Mean by Accurate

Internal Data Coverage

322

Published full-RDL jobs

1,482

Held back from publication

18%

Currently publishable

Published confidence

322 publishable schema 2.0.0 RDL packages
323 live pages passed source-page QA
0 remaining unpublished candidates pass the current rank gate
216,603 source evidence units stored across the corpus

Known limits

1,363 jobs still need source access or source repair
116 jobs need stronger structured extraction
3 jobs need manual review
467 fetched source files are stored but not yet parsed
1,012 source documents are blocked by fetch, browser challenge, or OA package availability

Source Data for These Numbers

Live QA report

Artifact: artifacts/source-page-qa/full-completion-post-deploy-live-2026-05-31/report.json

SHA-256: 7E15B5E1EF3F90007A09DE437D1B74ED96E51E337A365C8D1ECF9B86E0189FF3

Command: npm run qa:source-pages -- --out artifacts/source-page-qa/full-completion-post-deploy-live-2026-05-31 --concurrency 4

Final rank gate

Artifact: artifacts/full-rdl-rank/full-completion-final-2026-05-31/summary.json

SHA-256: B459C64CD40054726F8747E63F7194183908D5252571143AFE7240D4B6AEF562

Command: npm run rdl:rank-full -- --status blocked_needs_structured_extraction,blocked_needs_manual_review --limit 500 --top 300 --out artifacts/full-rdl-rank/full-completion-final-2026-05-31

Publication Gate

1. Source identity

DOI, PMCID, title, and source URL must resolve to the same paper. Pages with browser-check or placeholder titles are blocked.

2. Evidence coverage

3. Rendered-page QA

Mutator

Component

Before

After

Delta

Decision

few-shot

fewShotExamples

reject

user-prompt

userPromptTemplate

tied

schema

outputSchemaDescription

tied

manual-pass

passes

manual-pass

passes+step-expansion

Current Bottlenecks & Next Steps

stepCoverage (31/100) — Vocabulary Mismatch

parameterAccuracy (51/100) — Missing Details

Papers often omit specific parameters (exact temperatures, concentrations) that protocols.io specifies. Multi-pass extraction with step expansion may recover more details.

Planned Improvements

Expand synonym groups for software-action mappings
Add TF-IDF weighting to reduce stopword dominance in Dice coefficient
Multi-pass extraction for finer-grained step decomposition
Targeted few-shot examples filtered by experiment type
Consider embedding similarity (all-MiniLM-L6-v2) if scorer improvements plateau

Release Notes

v1.9.02026-03-26

67/100

v1.8.02026-03-25

74/100

OpenAI embedding similarity (text-embedding-3-small) replaces Dice coefficient for step matching. stepCoverage 33→61 (+85%), overall 67→74

v1.7.02026-03-25

67/100

Scorer improvements (stopwords, 40+ synonyms), multi-pass extraction, targeted few-shot. 6 loop iterations all rejected — prompt mutations converged, scorer is the bottleneck

v1.6.02026-03-25

67/100

AutoResearch v2 — mutator architecture, 22 synonym groups

v1.5.02026-03-18

63/100

Expanded PIO to 134 GTs, config-driven extraction

v1.4.02026-03-10

58/100

PIO preference over ConductScience GT, more experiment types

v1.3.02026-03-01

55/100

Multi-strategy step matching, synonym expansion

v1.2.02026-02-22

50/100

Expanded protocols.io to 57 GTs, improved fuzzy matching

v1.1.02026-02-18

45/100

Added protocols.io ground truth ingestion

v1.0.02026-02-15

42/100

Initial release — ConductScience GT only