ReplicateScience publishes a page only when the source paper, extracted evidence, rendered protocol, and public page checks agree. Pages that do not meet that bar stay out of the public index.
Last verified May 31, 2026. The live QA crawler checked every indexed experiment page, confirmed the RDL renderer, source identifiers, and required page structure, and found no public blockers.
A published RDL page is source-backed, not simply AI-generated. The page must carry the source paper identity, parsed source evidence, structured methods coverage, materials or equipment coverage, outputs, analysis evidence, and a public QA pass. Unsupported details are blocked or labeled instead of being presented as source facts.
This does not mean every remaining internal record is accurate. It means the public set has passed the current source-fidelity gate. Internal records that lack source access, parsed supplements, or enough structured method evidence remain blocked until the evidence improves.
Artifact: artifacts/source-page-qa/full-completion-post-deploy-live-2026-05-31/report.json
SHA-256: 7E15B5E1EF3F90007A09DE437D1B74ED96E51E337A365C8D1ECF9B86E0189FF3
Command: npm run qa:source-pages -- --out artifacts/source-page-qa/full-completion-post-deploy-live-2026-05-31 --concurrency 4
Artifact: artifacts/full-rdl-rank/full-completion-final-2026-05-31/summary.json
SHA-256: B459C64CD40054726F8747E63F7194183908D5252571143AFE7240D4B6AEF562
Command: npm run rdl:rank-full -- --status blocked_needs_structured_extraction,blocked_needs_manual_review --limit 500 --top 300 --out artifacts/full-rdl-rank/full-completion-final-2026-05-31
The public JSON snapshot is the shareable source of truth for this page. It contains the database aggregate counts, QA summary, rank-gate summary, source-document coverage, artifact hashes, and the interpretation limits used here.
DOI, PMCID, title, and source URL must resolve to the same paper. Pages with browser-check or placeholder titles are blocked.
The page must have source-backed subjects, groups, procedure steps, materials or equipment, outputs, and analysis evidence. Non-executable reviews, guidelines, proceedings, tutorials, and reporting standards are rejected.
The public crawler checks indexed pages after publication. Any page with incomplete protocol data, source identity failure, or wrong template is removed from public eligibility and returned to the rebuild queue.
The older benchmark below tracks an internal extractor/scorer series from March 2026. It is useful for engineering history, but it is not the publication accuracy score for today's RDL pages.
| # | Mutator | Component | Before | After | Delta | Decision |
|---|---|---|---|---|---|---|
| 1 | few-shot | fewShotExamples | 75 | 76 | +1 | reject |
| 2 | user-prompt | userPromptTemplate | 79 | 79 | 0 | tied |
| 3 | schema | outputSchemaDescription | 77 | 77 | 0 | tied |
| 4 | manual-pass | passes | 75 | 78 | +3 | accept |
| 5 | manual-pass | passes+step-expansion | 78 | 83 | +5 | accept |
Papers describe procedures in natural language ("locomotor activity was recorded for 10 min") while protocols.io has software-specific steps ("Open Biobserve Viewer", "Press F1"). The scorer needs expanded synonym groups and TF-IDF weighting to bridge this gap.
Papers often omit specific parameters (exact temperatures, concentrations) that protocols.io specifies. Multi-pass extraction with step expansion may recover more details.
SCORER METHODOLOGY CHANGE — not comparable to v1.8.0. equipmentCompleteness: category-based comparison replaces binary has-equipment check (76→39 honest). parameterAccuracy: word-boundary regex + semantic temp equivalents replaces substring matching (51→46, false positives removed). Overall 67 reflects true extraction quality.
OpenAI embedding similarity (text-embedding-3-small) replaces Dice coefficient for step matching. stepCoverage 33→61 (+85%), overall 67→74
Scorer improvements (stopwords, 40+ synonyms), multi-pass extraction, targeted few-shot. 6 loop iterations all rejected — prompt mutations converged, scorer is the bottleneck
AutoResearch v2 — mutator architecture, 22 synonym groups
Expanded PIO to 134 GTs, config-driven extraction
PIO preference over ConductScience GT, more experiment types
Multi-strategy step matching, synonym expansion
Expanded protocols.io to 57 GTs, improved fuzzy matching
Added protocols.io ground truth ingestion
Initial release — ConductScience GT only