APP

datasheet parser

Parses manufacturer PDF datasheets into structured wiki markdown with extracted diagrams, electrical specs, pin descriptions, and design charts, then publishes them to the Adom Wiki. Delegates extraction to the ds-extract service (docling + pdfplumber + PyMuPDF + confidence-routed rules engine).

Datasheet Parser demo โ€” full pipeline walkthrough: queue, ds-extract, vision escalation, wiki publish (59s)

๐Ÿ’ฌ Sample prompts Paste any of these into Claude Code to use this app
Parse this datasheet Parse the BME680 datasheet and publish it
Convert to wiki Convert the BHI360 datasheet to wiki format
Extract specs Extract the electrical specs from this VL53L8CX PDF
Standardize Standardize the AP63357DV datasheet for our wiki
Show datasheet Show me the datasheet for the bq27441
โšก Install this skill

Paste this into Claude Code (VS Code panel, Adom editor, or terminal) to install:

Fetch the Adom Wiki app "datasheet parser" (slug: datasheet-parser) at https://wiki-ufypy5dpx93o.adom.cloud/wiki/apps/datasheet-parser. This is a knowledge-only app โ€” no binary. Call GET https://wiki-ufypy5dpx93o.adom.cloud/api/v1/pages/datasheet-parser, extract the .page.skill_source field, and save it to ~/.claude/skills/datasheet-parser/SKILL.md (create the directory). Then confirm by showing the first 10 lines of the saved file.

name: datasheet-parser description: Parses manufacturer PDF datasheets into structured wiki markdown with extracted diagrams, electrical specs, pin descriptions, and design charts, then publishes them to the Adom Wiki. Use when the user says "parse a datasheet", "convert a datasheet", "download datasheet for [part]", "standardize a datasheet", "extract specs from datasheet", or "show me the datasheet for [part]". Delegates extraction to the ds-extract service (docling + pdfplumber + PyMuPDF + confidence-routed rules engine) and only uses Claude vision on the escalation-queue bboxes.

Datasheet Parser

Parse manufacturer PDF datasheets into structured wiki markdown. Claude's role here is orchestrator + reviewer, not extractor. The heavy lifting โ€” rendering pages, running OCR, detecting layout/tables/figures, enumerating figure bboxes from the PDF object tree, computing confidence signals โ€” happens on the ds-extract service. Claude only gets called on the bbox crops the service couldn't resolve on its own.

Architecture

PDF โ”€โ”€โ–บ POST /extract (ds-extract)
          โ”‚
          โ”œโ”€โ–บ 1141 blocks typed by docling
          โ”œโ”€โ–บ table cells from pdfplumber (fallback when docling empty)
          โ”œโ”€โ–บ figure bboxes from PyMuPDF object tree
          โ”œโ”€โ–บ cross-extractor agreement (containment)
          โ””โ”€โ–บ rules engine โ†’ escalation_queue (~40 bboxes, not 67 pages)
          โ”‚
Claude โ—„โ”€โ”€โ”˜
  โ”‚
  โ”œโ”€โ–บ Batch escalation_queue crops via /extract-region โ†’ single vision call
  โ”œโ”€โ–บ Merge answers back into blocks
  โ””โ”€โ–บ Generate wiki markdown from the merged structured data
      โ””โ”€โ–บ adom-wiki page publish + asset upload

Everything before the "Claude" node is deterministic and runs on the service in a few minutes of CPU. Vision tokens are only spent on the ~40 bbox crops the service's confidence routing couldn't resolve.

Service URL

https://ds-extract-fa4sdo7pnkrl.adom.cloud/

Interactive Invocation

When the user triggers this skill with a bare phrase like "parse the LM358 datasheet" โ€” and they have not already specified --visualize โ€” use AskUserQuestion to get their preference:

Open the live visualizer while parsing?

  • Yes โ€” open live view in a webview tab โ†’ set --visualize
  • No โ€” run silently โ†’ proceed without the flag

If the user already specified, or the skill was invoked from process-datasheets, honor that and skip the question. Confirm in one line โ€” "Parsing BQ76920 โ€” starting." โ€” and proceed.

Arguments

  • --visualize โ€” Open the datasheet-visualizer webview tab and emit live progress events through each step
  • --no-visualize โ€” Run silently (suppresses the interactive question)

Queue Integration

A shared queue at https://wtqihf5e8fsv.adom.cloud coordinates parsing across agents.

ds-queue list                             # what needs parsing
ds-queue claim --by $(hostname)           # claim next item (returns id + pdf_url + part)
ds-queue complete <id> --by $(hostname) --wiki-slug "datasheets/<part>"
ds-queue fail <id> --by $(hostname) --reason "<what broke>"

Claim FIRST to prevent duplicate work when parsing from the queue.

Workflow

Step 1: Acquire the PDF

Three sources:

  • User URL โ†’ curl -sL -o /tmp/<part>.pdf "$URL"
  • Local path โ†’ use directly
  • Queue item โ€” check the pdf_url field:
    • http* โ†’ download via curl
    • /uploads/* โ†’ curl -sL -o /tmp/<part>.pdf "https://wtqihf5e8fsv.adom.cloud<upload_path>"
  • No source โ†’ WebSearch for the manufacturer's official PDF. Prefer ti.com, bosch-sensortec.com, st.com, nxp.com, microchip.com, analog.com over aggregators.

Step 2: Extract via ds-extract service

curl -sS -F pdf=@/tmp/<part>.pdf https://ds-extract-fa4sdo7pnkrl.adom.cloud/extract \
  -o /tmp/<part>-extract.json

This takes ~5โ€“7 minutes on CPU for a 60-page datasheet (cached by sha256; re-requests are instant). Full service contract โ€” endpoints, JSON schema, escalation-queue shape, how to re-run rules without re-extracting: ds-extract-reference.md.

<<<<<<< Updated upstream Abridged shape:

emit_stage() { :; }
emit_page_rendered() { :; }
emit_page_annotated() { :; }
emit_crop() { :; }
emit_section() { :; }
emit_published() { :; }
emit_done() { :; }

Step 1: Find the Datasheet

emit_stage download start

There are three sources for the PDF:

A) User provides a URL โ€” use it directly.

B) User provides a local file path โ€” use it directly, skip Step 2.

C) Queue item โ€” check the pdf_url field:

  • If it starts with http โ†’ it's a download URL, proceed to Step 2.
  • If it starts with /uploads/ โ†’ it's stored on the ds-queue server. Download it from the queue server:
    DS_QUEUE_URL="https://wtqihf5e8fsv.adom.cloud"
    curl -sL -o /tmp/<partname>.pdf "$DS_QUEUE_URL<upload_path>"
    

D) No URL provided โ€” use WebSearch to find the official manufacturer datasheet PDF URL. Prefer the manufacturer's site over third-party aggregators. Common sources: ti.com, bosch-sensortec.com, st.com, nxp.com, microchip.com, analog.com.

Step 2: Download the PDF

curl -sL -o /tmp/<partname>.pdf "<URL>"
ls -la /tmp/<partname>.pdf
emit_stage download done

Skip this step if the PDF is already local (source B or C with /uploads/ path already downloaded in Step 1).

Step 3: Extract Raw Text

emit_stage extract start
pdftotext /tmp/<partname>.pdf /tmp/<partname>.txt

Read the extracted text to get a rough overview of the datasheet contents. This text will have formatting issues, wrong reading order, mangled tables, etc. โ€” that's expected. It serves as a guide for what's on each page, not the source of truth.

Also check page count:

pdfinfo /tmp/<partname>.pdf | grep Pages
emit_stage extract done

Step 4: Render Full-Page PNGs and Perform Visual OCR

This is the core processing step. Full-page PNGs are rendered from the PDF, then Claude's native vision reads each page to extract accurate, well-structured content.

4a. Render pages with pdftoppm

emit_stage render start
mkdir -p /home/adom/project/project-content/datasheets/<partname>
pdftoppm -png -r 300 /tmp/<partname>.pdf /home/adom/project/project-content/datasheets/<partname>/<partname>

This produces <partname>-1.png, <partname>-2.png, etc. at 300 DPI โ€” high enough for Claude to read all text, table cells, and fine diagram details.

CRITICAL โ€” resize before reading. The Read tool hard-limits at 2000 px per side in many-image requests. A 300-DPI letter/A4 render is ~2480ร—3508 px, which will crash the conversation the first time you try to Read a batch of pages. Produce a downscaled mirror at โ‰ค1500 px long edge and Read from that mirror during Step 4b; keep the full-resolution originals for crop operations in Step 4c.

mkdir -p /home/adom/project/project-content/datasheets/<partname>/pages-1500
for png in /home/adom/project/project-content/datasheets/<partname>/<partname>-*.png; do
  convert "$png" -resize '1500x1500>' "/home/adom/project/project-content/datasheets/<partname>/pages-1500/$(basename "$png")"
done

The > flag only shrinks, never upscales โ€” already-small renders pass through untouched.

If --visualize is set, also emit a downscaled preview for each page and a page_rendered event:

for png in /home/adom/project/project-content/datasheets/<partname>/<partname>-*.png; do
  n=$(basename "$png" | sed 's/.*-//;s/\.png$//')
  convert "$png" -resize 800x "$VIZ_PAGES_DIR/$n.png"
  emit_page_rendered "$n" "$n.png"
done
emit_stage render done

4b. Page-by-page visual analysis

emit_stage ocr start

Process each page PNG with Claude vision. For each page, provide:

  • The full-page PNG (via the Read tool)
  • The corresponding section of pdftotext output as a rough reference

For each page, Claude must extract:

  1. Corrected text โ€” proper reading order (columns, sidebars, captions), with all formatting issues from pdftotext resolved by trusting the visual. When pdftotext and the visual disagree, always trust the visual.

  2. Tables โ€” reconstruct as proper markdown tables with correct column alignment, merged cells expanded, and all values accurate. Pay close attention to min/typ/max columns, units, and footnote markers.

  3. Formulas โ€” convert any mathematical formulas to LaTeX. Wrap in $$...$$ for display math or $...$ for inline. Read the formula directly from the PNG โ€” do not rely on pdftotext for formula content.

  4. Translations โ€” if any text is in a non-English language, translate it to English. Do not use any third-party translation tool. Claude translates natively. Preserve the original meaning precisely; for technical terms, keep the standard English engineering term.

  5. Diagram manifest โ€” for every diagram, figure, chart, graph, photo, or illustration on the page, output a bounding box and metadata:

    DIAGRAM: {
      id: "page<N>_fig<M>",
      category: "<category>",
      caption: "<descriptive caption>",
      bbox: { x: <left_px>, y: <top_px>, w: <width_px>, h: <height_px> }
    }
    

    Bounding box accuracy is critical. Follow these rules to avoid mis-crops:

    • Anchor on visual edges, not captions. The bbox top (y) should be the top pixel of the diagram's graphical content (border, axis line, top of an image element), NOT the "Figure N" caption or preceding body text. Similarly the bottom should be the bottom of the graphical content. Include the figure caption/label only if it's visually attached to the diagram (e.g., directly below with no text paragraphs between).
    • Use structural landmarks. At 300 DPI, a typical A4 page is ~2480ร—3508 px. Use known elements (page margins ~100-150px, headers/footers, column gutters) to cross-check your Y estimates. If a diagram sits in the bottom third of the page, y should be >~2300 โ€” if your estimate is <1500, something is wrong.
    • Separate body text from diagram labels. Paragraph text (full-width lines of body copy) is NOT part of the diagram. Callout labels, axis labels, and annotations that are spatially within or adjacent to the diagram ARE part of it.
    • Include all associated elements: arrows, callout text, labels, dimension lines, legends, axis labels, and titles that logically belong to the diagram.
    • Add 5% padding on all sides to catch stray annotations, but clamp so padding doesn't extend into unrelated text regions or off the page.

    Categories:

    CategoryExamples
    overviewProduct photos, block diagrams, system architecture
    schematicInternal schematics, application circuits, reference designs
    characteristicV-I curves, temperature curves, frequency response
    waveformOscilloscope captures, timing diagrams
    mechanicalPackage drawings, dimensional drawings, footprints
    integrationMounting guides, thermal design, layout recommendations
    electricalPin diagrams, pinout views, connector details
    timingTiming diagrams, sequencing charts, state machines
    designDesign nomographs, selection charts

Token efficiency tips:

  • For simple pages (mostly one large table), processing is fast โ€” one call per page.
  • For very long datasheets (>50 pages), batch 2-3 consecutive pages of pure spec tables into a single call since they share context.
  • Skip pages that are entirely blank, legal boilerplate, or revision history โ€” note them as skipped.
  • Use parallel subagents to process pages concurrently when possible.

If --visualize is set, emit a page_annotated event after each page's OCR completes, with the bboxes normalized to [0,1]. Example โ€” compute normalized coords from the 300 DPI page geometry (e.g. xn = x / 2480, yn = y / 3508 for A4):

BOXES='[{"id":"b1","kind":"table","label":"Electrical Characteristics","bbox":[0.10,0.28,0.82,0.56]},
        {"id":"b2","kind":"diagram","label":"Figure 4: Pinout","bbox":[0.55,0.72,0.38,0.18]}]'
emit_page_annotated <page_number> "$BOXES"

At the end of 4b: emit_stage ocr done.

4c. Crop diagrams โ€” two-pass approach

emit_stage crop start

Diagram cropping uses two passes: a fast programmatic pass for captioned figures, then a vision validation pass to catch issues and find uncaptioned diagrams.

Pass 1: Automated crop with crop_figures.py

python3 ~/.claude/skills/datasheet-parser/crop_figures.py \
  /tmp/<partname>.pdf \
  /home/adom/project/project-content/datasheets/<partname>/ \
  --dpi 300 --padding 20

This uses PyMuPDF to find Figure N captions in the PDF, identify nearby vector drawings and embedded images, exclude body text / headings, and render cropped PNGs directly. It produces a figures-manifest.json listing all extracted figures. Runs in seconds, zero vision tokens.

Pass 2: Vision validation + uncaptioned diagrams

Read each cropped PNG from Pass 1 and verify:

  • Diagram content is fully visible (not clipped)
  • No unrelated body text / headings leaked in
  • Caption is included

For any bad crops, use the bounding box from the page PNG to re-crop with convert:

convert <partname>-<page>.png -crop <W>x<H>+<X>+<Y> +repage <output>.png

Also identify uncaptioned diagrams found during the page-by-page visual OCR (Step 4b) that crop_figures.py could not detect (e.g., inline diagrams without "Figure N" labels). Crop these manually using convert with bounding boxes from the visual analysis.

Post-processing (apply to all cropped diagrams):

# Anti-alias, resize, optimize, trim whitespace, add padding โ€” single command
convert <diagram>.png -blur 0x0.5 -unsharp 0x1+0.5+0.05 \
  -resize 1200x1200\> -colors 128 -trim +repage \
  -bordercolor white -border 30 <diagram>.png

Target: each image 30โ€“150KB after optimization.

If --visualize is set, emit a crop event for each diagram as it's validated or rejected:

emit_crop "<crop_id>" "<bbox_id_from_ocr>" validated    # or: detected, rejected

At the end of 4c: emit_stage crop done.

4d. Synthesize across pages

emit_stage synthesize start

After all pages are processed, combine the per-page outputs into a unified document:

  • Merge tables that span multiple pages (watch for repeated headers)
  • Deduplicate page headers/footers
  • Resolve cross-references ("see Table 3", "refer to Figure 12")
  • Organize content into the standardized wiki sections (see Step 5)
emit_stage synthesize done

Step 5: Generate Wiki Markdown

emit_stage generate start

Build a structured markdown file that the wiki server renders into the tabbed UI. The wiki's parseDatasheetSections() function splits on ## headings and maps them to tabs.

CRITICAL: The heading names must match exactly โ€” the wiki uses DS_TAB_MAP to route sections to tabs:

HeadingTab
## DescriptionOverview
## Key SpecificationsOverview
## FeaturesOverview
## Pin ConfigurationPinout
## PinoutPinout
## Absolute Maximum RatingsSpecifications
## Recommended Operating ConditionsSpecifications
## Electrical CharacteristicsSpecifications
## Power ConsumptionSpecifications
## Power DomainsSpecifications
## Thermal InformationSpecifications
## Timing AccuracySpecifications
## Communication InterfaceSpecifications
## PackagesSpecifications
## Software APISoftware
## ApplicationsApplications
## Key FormulasApplications
Any unrecognized headingSpecifications (fallback)

DO NOT include ## Diagrams โ€” the Diagrams tab is populated automatically from screenshot assets uploaded to the wiki.

Badge metadata (first 15 lines)

The wiki scans the first 15 lines for **Key:** Value patterns to extract badges:

**Source:** [Texas Instruments Datasheet (SNAS548D)](https://www.ti.com/lit/ds/symlink/lm555.pdf)
**Manufacturer:** Texas Instruments
**Part Number:** LM555
**Document:** SNAS548D โ€” Rev D, January 2015

Special rendering rules

  1. Key Specifications โ€” use a 2-column table (Parameter | Value). The wiki renders this as stat cards:

    ## Key Specifications
    
    | Parameter | Value |
    | --- | --- |
    | Supply Voltage | 4.5V to 16V |
    | Output Current | 200mA |
    | Pin Count | 8 |
    
  2. Pin Configuration โ€” table must have a Name column. Optional: Pin or #, Type or Direction, Description or Desc or Function. Renders as pin cards:

    ## Pin Configuration
    
    | Pin | Name | Type | Description |
    | --- | --- | --- | --- |
    | 1 | GND | Power | Ground reference |
    | 2 | TRIG | Input | Trigger input |
    
  3. Features โ€” use - unordered lists. Renders with triangle bullet markers:

    ## Features
    
    - Direct replacement for SE555/NE555
    - Timing from microseconds through hours
    
  4. Formulas โ€” use LaTeX notation. Display math with $$...$$, inline with $...$:

    ## Key Formulas
    
    ### Monostable Mode
    $$t = 1.1 \times R \times C$$
    
    ### Astable Mode
    $$f = \frac{1.44}{(R_A + 2R_B) \times C}$$
    
    $$\text{Duty Cycle} = \frac{R_A + R_B}{R_A + 2R_B}$$
    
  5. Tables โ€” use standard GFM pipe tables. All other tables render as styled HTML tables.

  6. Inline formatting โ€” **bold**, *italic*, `code`, [text](url) are supported.

Complete example

**Source:** [Texas Instruments Datasheet (SNAS548D)](https://www.ti.com/lit/ds/symlink/lm555.pdf)
**Manufacturer:** Texas Instruments
**Part Number:** LM555
**Document:** SNAS548D โ€” Rev D, January 2015

## Description

The LM555 is a highly stable device for generating accurate time delays or oscillation.

## Key Specifications

| Parameter | Value |
| --- | --- |
| Supply Voltage | 4.5V to 16V |
| Output Current | 200mA |
| Pin Count | 8 |
| Temperature Stability | 0.005%/ยฐC |

## Features

- Direct replacement for SE555/NE555
- Timing from microseconds through hours
- Operates in both astable and monostable modes
- Output can source or sink 200mA

## Pin Configuration

| Pin | Name | Type | Description |
| --- | --- | --- | --- |
| 1 | GND | Power | Ground reference |
| 2 | TRIG | Input | Trigger input; starts timing below 1/3 VCC |
| 3 | OUT | Output | Timer output; sources/sinks 200mA |
| 4 | RESET | Input | Active-low reset |
| 5 | CTRL | Input | Control voltage; bypass with 10nF |
| 6 | THR | Input | Threshold; ends cycle above 2/3 VCC |
| 7 | DIS | Output | Open-collector discharge |
| 8 | VCC | Power | Supply 4.5V to 16V |

## Absolute Maximum Ratings

| Parameter | Min | Max | Unit |
| --- | --- | --- | --- |
| Supply Voltage | โ€” | 18 | V |
| Power Dissipation | โ€” | 600 | mW |
| Storage Temperature | -65 | 150 | ยฐC |

## Electrical Characteristics

| Parameter | Conditions | Min | Typ | Max | Unit |
| --- | --- | --- | --- | --- | --- |
| Supply Voltage | โ€” | 4.5 | โ€” | 16 | V |
| Supply Current (Low) | VCC=5V, RL=โˆž | โ€” | 3 | 6 | mA |
| Supply Current (High) | VCC=15V, RL=โˆž | โ€” | 10 | 15 | mA |

## Thermal Information

| Package | RฮธJA | Unit |
| --- | --- | --- |
| PDIP-8 | 97 | ยฐC/W |
| SOIC-8 | 149 | ยฐC/W |

## Packages

| Package | Pins | Body Size |
| --- | --- | --- |
| PDIP (P) | 8 | 9.81mm ร— 6.35mm |
| SOIC (D) | 8 | 4.90mm ร— 3.91mm |

## Applications

- Precision timing
- Pulse generation
- Sequential timing
- PWM generation

## Key Formulas

### Monostable Mode
$$t = 1.1 \times R \times C$$

### Astable Mode
$$f = \frac{1.44}{(R_A + 2R_B) \times C}$$

$$\text{Duty Cycle} = \frac{R_A + R_B}{R_A + 2R_B}$$

Save this file as:

/home/adom/project/project-content/datasheets/<partname>/<partname>.md

And also write a temporary copy for publishing:

cp project-content/datasheets/<partname>/<partname>.md /tmp/<partname>-body.md

If --visualize is set, emit a section event for each generated section so the wiki-preview pane can fill it in live. Use the tab IDs from DS_TAB_MAP: overview, pinout, specifications, software, applications.

# Example: emit the overview section after writing it
HTML_B64=$(markdown-to-html "/tmp/overview.md" | base64 -w0)   # use any mdโ†’html tool available
emit_section overview "$HTML_B64"

If no converter is available, the simplest path is to send the raw markdown as HTML wrapped in <pre> โ€” the UI will render it verbatim and that's good enough for the live-preview case.

At the end of Step 5: emit_stage generate done.

Step 6: Prepare Metadata JSON

Create a metadata JSON file for the wiki page:

Stashed changes

{
  "pdf_hash": "sha256...",
  "page_count": 67,
  "pages": [
    {
      "page": 1, "width_pt": 612, "height_pt": 792,
      "page_png_url": "/artifact/<hash>/p1.png",
      "blocks": [
        {
          "block_id": "p1_b4", "type": "title",
          "bbox_pt": [x0, y0, x1, y1],
          "text": "BQ76920, BQ76930, BQ76940",
          "confidence": 0.9,
          "cross_agreement": {"pdftotext": 1.0, "pdfplumber": 1.0},
          "findings": [...]
        }
      ],
      "object_figures": [...]
    }
  ],
  "document_findings": [...],
  "escalation_queue": [
    {
      "block_id": "p8_b4",
      "page": 8, "bbox_pt": [...], "block_type": "table",
      "priority": 20, "rule": "table.has_cells",
      "question": "Extract this table as markdown..."
    }
  ],
  "rules_summary": {
    "block_findings": {"info": 113, "warn": 145, "escalate": 40},
    "escalations": 40
  }
}

Step 3: Process the escalation queue

For every entry in escalation_queue, fetch a tight bbox crop and ask Claude vision the rule-specific question. Each crop is ~100โ€“500 px wide โ€” tiny. Batching strategy, merge semantics per block type: ds-extract-reference.md ยง Processing the queue.

# For each escalation:
curl -sS -X POST https://ds-extract-fa4sdo7pnkrl.adom.cloud/extract-region \
  -H 'Content-Type: application/json' \
  -d '{"pdf_hash": "...", "page": 8, "bbox_pt": [x0, y0, x1, y1], "dpi": 300}' \
  -o /tmp/crop-<block_id>.png

Then Read each crop PNG and answer the question field. Batch where possible โ€” reading 5โ€“10 small crops in one turn is cheap.

For tables: expect a markdown table back. For figures: a "complete / partial / phantom" verdict plus a brief description. For low-agreement text: the verbatim text.

Write each verdict back into the corresponding block's text or table_cells field in memory.

Step 4: Assemble the wiki markdown

Walk the merged pages[].blocks and group by docling's semantic section structure. The wiki renders sections by exact heading match โ€” see publish-and-events-reference.md ยง Heading-to-tab routing for the full DS_TAB_MAP.

Key rules (the wiki silently drops content that breaks these):

  • Key Specifications must be a 2-column table (Parameter | Value)
  • Pin Configuration must have a Name column; optional Pin, Type, Description
  • Formulas use LaTeX: $$...$$ (display) or $...$ (inline)
  • Features use - unordered lists
  • Do NOT include ## Diagrams โ€” that tab is populated from screenshot assets

Save the body to /home/adom/project/project-content/datasheets/<part>/<part>.md plus a temp copy at /tmp/<part>-body.md.

Step 5: Publish

adom-wiki page publish "datasheets/<part>" \
  --title "<Part> โ€” <one-line description>" \
  --brief "<2โ€“3 sentence summary>" \
  --body-md /tmp/<part>-body.md \
  --changelog "Parsed from <manufacturer> <doc number>" \
  --sample-prompt "Show me the datasheet for <part>" \
  --sample-prompt "Parse the <family> datasheet" \
  --sample-prompt "<3โ€“5 trigger phrases total>"

# Set metadata
adom-wiki page edit --field metadata \
  --body-md /home/adom/project/project-content/datasheets/<part>/metadata.json \
  "datasheets/<part>"

Step 6: Upload the hero + screenshot assets

For each figure block in the extraction that passed rules (or that Claude validated), fetch its crop and upload:

# Hero: pick the functional block diagram or overview
adom-wiki asset upload datasheets/<part> --asset-type hero-image \
  --file <best-figure>.png \
  --caption "<description>"

# Screenshots: every notable figure
adom-wiki asset upload datasheets/<part> --asset-type screenshot \
  --file <figure>.png \
  --caption "<figure caption from paired caption block>"

Target 5โ€“20 screenshots. Prefer the figures with captions already paired by ds-extract. Always explicitly set hero_asset_id to avoid the implicit-fallback bug โ€” see publish-and-events-reference.md ยง Hero image pitfall.

Step 7: Store artifacts + complete the queue

# PDF, extracted JSON, and all crops go to project-content
cp /tmp/<part>.pdf /home/adom/project/project-content/datasheets/<part>/
mv /tmp/<part>-extract.json /home/adom/project/project-content/datasheets/<part>/

# Mark queue item complete (if this came from the queue)
ds-queue complete <id> --by $(hostname) --wiki-slug "datasheets/<part>"

Full directory layout + queue semantics: publish-and-events-reference.md ยง Step 7.

Live Visualization

When --visualize is set, run the datasheet-visualizer service and emit stage/page/section/crop events so the user can watch live. Details (service install, event schema, stage names, emit helpers, batch behavior between datasheets): see publish-and-events-reference.md ยง Live Visualization.

Troubleshooting

SymptomCauseFix
/extract 500 or hangs > 10 minService container OOM or docling crashedcurl /health; if down, SSH in and bash ~/ds-extract/start.sh. See service logs at /tmp/ds-extract.log
Vendor PDF GET hangs / HTTP/2 stream INTERNAL_ERROR (curl exit 92)ST / Akamai (and some other Akamai-fronted vendor CDNs) silently RST the connection on curl/reqwest's HTTP/2 fingerprintFall through to wget โ€” its HTTP/1.1 negotiation looks browser-enough to get through. Pattern: wget -q --tries=2 --timeout=30 --user-agent "Mozilla/5.0 (X11; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0" -O /tmp/<part>.pdf <url>. Confirmed 2026-04-26 against https://www.st.com/resource/en/datasheet/vl53l8cx.pdf โ€” curl/reqwest both reset; wget got the full 3.6 MB %PDF in ~12 s. The aci component orchestrator now does this fall-through automatically.
Mouser/DigiKey URL returns ~13 KB HTML "Access denied"Vendor anti-bot frontingAlways pass a Firefox UA; if still blocked, switch to the manufacturer's canonical CDN (e.g. diodes.com/assets/Datasheets/<part>.pdf, datasheets.raspberrypi.com/<part>/<part>-datasheet.pdf). Always verify the response starts with %PDF- magic bytes before posting to ds-extract โ€” the 13 KB HTML page would otherwise eat 7 minutes in the parser
Many tables in escalation queuedocling + pdfplumber both failed on this table styleExpected โ€” hand off to Claude vision on those bboxes. If > 50% of tables fail, the PDF is likely image-only (scanned) and needs OCR preprocessing
Wiki page has no tabsHeading names don't match DS_TAB_MAP exactlySee DS_TAB_MAP in publish-and-events-reference.md โ€” heading strings are case-sensitive
Hero image shows something unexpectedhero_asset_id not set explicitly and wiki picked an older assetAlways call page edit --field hero_asset_id <id> after uploading the hero
Slug collision with existing molecule pagedatasheets/<slug> and an existing molecule share the slugDatasheet content merges into the molecule page (unified pages table). Decide intentionally: keep merged, or publish to <slug>-datasheet
adom-wiki page publish rejects "requires 3-8 sample prompts"Missing --sample-prompt flagsAdd 3โ€“5 --sample-prompt flags (trigger phrases)

Dependencies

The ds-extract service handles all Python/OCR deps. On the user's container you only need:

  • curl (every container has this)
  • adom-wiki CLI (installed by node install.mjs)
  • ds-queue CLI (same)

Videos

Datasheet Parser demo โ€” captioned walkthrough of the PDF-to-wiki workflow

AI Skill โ€” how Claude uses this app

Edit AI Skill

name: datasheet-parser description: Parses manufacturer PDF datasheets into structured wiki markdown with extracted diagrams, electrical specs, pin descriptions, and design charts, then publishes them to the Adom Wiki. Use when the user says "parse a datasheet", "convert a datasheet", "download datasheet for [part]", "standardize a datasheet", "extract specs from datasheet", or "show me the datasheet for [part]". Delegates extraction to the ds-extract service (docling + pdfplumber + PyMuPDF + confidence-routed rules engine) and only uses Claude vision on the escalation-queue bboxes.

Datasheet Parser

Parse manufacturer PDF datasheets into structured wiki markdown. Claude's role here is orchestrator + reviewer, not extractor. The heavy lifting โ€” rendering pages, running OCR, detecting layout/tables/figures, enumerating figure bboxes from the PDF object tree, computing confidence signals โ€” happens on the ds-extract service. Claude only gets called on the bbox crops the service couldn't resolve on its own.

Architecture

PDF โ”€โ”€โ–บ POST /extract (ds-extract)
          โ”‚
          โ”œโ”€โ–บ 1141 blocks typed by docling
          โ”œโ”€โ–บ table cells from pdfplumber (fallback when docling empty)
          โ”œโ”€โ–บ figure bboxes from PyMuPDF object tree
          โ”œโ”€โ–บ cross-extractor agreement (containment)
          โ””โ”€โ–บ rules engine โ†’ escalation_queue (~40 bboxes, not 67 pages)
          โ”‚
Claude โ—„โ”€โ”€โ”˜
  โ”‚
  โ”œโ”€โ–บ Batch escalation_queue crops via /extract-region โ†’ single vision call
  โ”œโ”€โ–บ Merge answers back into blocks
  โ””โ”€โ–บ Generate wiki markdown from the merged structured data
      โ””โ”€โ–บ adom-wiki page publish + asset upload

Everything before the "Claude" node is deterministic and runs on the service in a few minutes of CPU. Vision tokens are only spent on the ~40 bbox crops the service's confidence routing couldn't resolve.

Service URL

https://ds-extract-fa4sdo7pnkrl.adom.cloud/

Interactive Invocation

When the user triggers this skill with a bare phrase like "parse the LM358 datasheet" โ€” and they have not already specified --visualize โ€” use AskUserQuestion to get their preference:

Open the live visualizer while parsing?

  • Yes โ€” open live view in a webview tab โ†’ set --visualize
  • No โ€” run silently โ†’ proceed without the flag

If the user already specified, or the skill was invoked from process-datasheets, honor that and skip the question. Confirm in one line โ€” "Parsing BQ76920 โ€” starting." โ€” and proceed.

Arguments

  • --visualize โ€” Open the datasheet-visualizer webview tab and emit live progress events through each step
  • --no-visualize โ€” Run silently (suppresses the interactive question)

Queue Integration

A shared queue at https://wtqihf5e8fsv.adom.cloud coordinates parsing across agents.

ds-queue list                             # what needs parsing
ds-queue claim --by $(hostname)           # claim next item (returns id + pdf_url + part)
ds-queue complete <id> --by $(hostname) --wiki-slug "datasheets/<part>"
ds-queue fail <id> --by $(hostname) --reason "<what broke>"

Claim FIRST to prevent duplicate work when parsing from the queue.

Workflow

Step 1: Acquire the PDF

Three sources:

  • User URL โ†’ curl -sL -o /tmp/<part>.pdf "$URL"
  • Local path โ†’ use directly
  • Queue item โ€” check the pdf_url field:
    • http* โ†’ download via curl
    • /uploads/* โ†’ curl -sL -o /tmp/<part>.pdf "https://wtqihf5e8fsv.adom.cloud<upload_path>"
  • No source โ†’ WebSearch for the manufacturer's official PDF. Prefer ti.com, bosch-sensortec.com, st.com, nxp.com, microchip.com, analog.com over aggregators.

Step 2: Extract via ds-extract service

curl -sS -F pdf=@/tmp/<part>.pdf https://ds-extract-fa4sdo7pnkrl.adom.cloud/extract \
  -o /tmp/<part>-extract.json

This takes ~5โ€“7 minutes on CPU for a 60-page datasheet (cached by sha256; re-requests are instant). Full service contract โ€” endpoints, JSON schema, escalation-queue shape, how to re-run rules without re-extracting: ds-extract-reference.md.

<<<<<<< Updated upstream Abridged shape:

emit_stage() { :; }
emit_page_rendered() { :; }
emit_page_annotated() { :; }
emit_crop() { :; }
emit_section() { :; }
emit_published() { :; }
emit_done() { :; }

Step 1: Find the Datasheet

emit_stage download start

There are three sources for the PDF:

A) User provides a URL โ€” use it directly.

B) User provides a local file path โ€” use it directly, skip Step 2.

C) Queue item โ€” check the pdf_url field:

  • If it starts with http โ†’ it's a download URL, proceed to Step 2.
  • If it starts with /uploads/ โ†’ it's stored on the ds-queue server. Download it from the queue server:
    DS_QUEUE_URL="https://wtqihf5e8fsv.adom.cloud"
    curl -sL -o /tmp/<partname>.pdf "$DS_QUEUE_URL<upload_path>"
    

D) No URL provided โ€” use WebSearch to find the official manufacturer datasheet PDF URL. Prefer the manufacturer's site over third-party aggregators. Common sources: ti.com, bosch-sensortec.com, st.com, nxp.com, microchip.com, analog.com.

Step 2: Download the PDF

curl -sL -o /tmp/<partname>.pdf "<URL>"
ls -la /tmp/<partname>.pdf
emit_stage download done

Skip this step if the PDF is already local (source B or C with /uploads/ path already downloaded in Step 1).

Step 3: Extract Raw Text

emit_stage extract start
pdftotext /tmp/<partname>.pdf /tmp/<partname>.txt

Read the extracted text to get a rough overview of the datasheet contents. This text will have formatting issues, wrong reading order, mangled tables, etc. โ€” that's expected. It serves as a guide for what's on each page, not the source of truth.

Also check page count:

pdfinfo /tmp/<partname>.pdf | grep Pages
emit_stage extract done

Step 4: Render Full-Page PNGs and Perform Visual OCR

This is the core processing step. Full-page PNGs are rendered from the PDF, then Claude's native vision reads each page to extract accurate, well-structured content.

4a. Render pages with pdftoppm

emit_stage render start
mkdir -p /home/adom/project/project-content/datasheets/<partname>
pdftoppm -png -r 300 /tmp/<partname>.pdf /home/adom/project/project-content/datasheets/<partname>/<partname>

This produces <partname>-1.png, <partname>-2.png, etc. at 300 DPI โ€” high enough for Claude to read all text, table cells, and fine diagram details.

CRITICAL โ€” resize before reading. The Read tool hard-limits at 2000 px per side in many-image requests. A 300-DPI letter/A4 render is ~2480ร—3508 px, which will crash the conversation the first time you try to Read a batch of pages. Produce a downscaled mirror at โ‰ค1500 px long edge and Read from that mirror during Step 4b; keep the full-resolution originals for crop operations in Step 4c.

mkdir -p /home/adom/project/project-content/datasheets/<partname>/pages-1500
for png in /home/adom/project/project-content/datasheets/<partname>/<partname>-*.png; do
  convert "$png" -resize '1500x1500>' "/home/adom/project/project-content/datasheets/<partname>/pages-1500/$(basename "$png")"
done

The > flag only shrinks, never upscales โ€” already-small renders pass through untouched.

If --visualize is set, also emit a downscaled preview for each page and a page_rendered event:

for png in /home/adom/project/project-content/datasheets/<partname>/<partname>-*.png; do
  n=$(basename "$png" | sed 's/.*-//;s/\.png$//')
  convert "$png" -resize 800x "$VIZ_PAGES_DIR/$n.png"
  emit_page_rendered "$n" "$n.png"
done
emit_stage render done

4b. Page-by-page visual analysis

emit_stage ocr start

Process each page PNG with Claude vision. For each page, provide:

  • The full-page PNG (via the Read tool)
  • The corresponding section of pdftotext output as a rough reference

For each page, Claude must extract:

  1. Corrected text โ€” proper reading order (columns, sidebars, captions), with all formatting issues from pdftotext resolved by trusting the visual. When pdftotext and the visual disagree, always trust the visual.

  2. Tables โ€” reconstruct as proper markdown tables with correct column alignment, merged cells expanded, and all values accurate. Pay close attention to min/typ/max columns, units, and footnote markers.

  3. Formulas โ€” convert any mathematical formulas to LaTeX. Wrap in $$...$$ for display math or $...$ for inline. Read the formula directly from the PNG โ€” do not rely on pdftotext for formula content.

  4. Translations โ€” if any text is in a non-English language, translate it to English. Do not use any third-party translation tool. Claude translates natively. Preserve the original meaning precisely; for technical terms, keep the standard English engineering term.

  5. Diagram manifest โ€” for every diagram, figure, chart, graph, photo, or illustration on the page, output a bounding box and metadata:

    DIAGRAM: {
      id: "page<N>_fig<M>",
      category: "<category>",
      caption: "<descriptive caption>",
      bbox: { x: <left_px>, y: <top_px>, w: <width_px>, h: <height_px> }
    }
    

    Bounding box accuracy is critical. Follow these rules to avoid mis-crops:

    • Anchor on visual edges, not captions. The bbox top (y) should be the top pixel of the diagram's graphical content (border, axis line, top of an image element), NOT the "Figure N" caption or preceding body text. Similarly the bottom should be the bottom of the graphical content. Include the figure caption/label only if it's visually attached to the diagram (e.g., directly below with no text paragraphs between).
    • Use structural landmarks. At 300 DPI, a typical A4 page is ~2480ร—3508 px. Use known elements (page margins ~100-150px, headers/footers, column gutters) to cross-check your Y estimates. If a diagram sits in the bottom third of the page, y should be >~2300 โ€” if your estimate is <1500, something is wrong.
    • Separate body text from diagram labels. Paragraph text (full-width lines of body copy) is NOT part of the diagram. Callout labels, axis labels, and annotations that are spatially within or adjacent to the diagram ARE part of it.
    • Include all associated elements: arrows, callout text, labels, dimension lines, legends, axis labels, and titles that logically belong to the diagram.
    • Add 5% padding on all sides to catch stray annotations, but clamp so padding doesn't extend into unrelated text regions or off the page.

    Categories:

    CategoryExamples
    overviewProduct photos, block diagrams, system architecture
    schematicInternal schematics, application circuits, reference designs
    characteristicV-I curves, temperature curves, frequency response
    waveformOscilloscope captures, timing diagrams
    mechanicalPackage drawings, dimensional drawings, footprints
    integrationMounting guides, thermal design, layout recommendations
    electricalPin diagrams, pinout views, connector details
    timingTiming diagrams, sequencing charts, state machines
    designDesign nomographs, selection charts

Token efficiency tips:

  • For simple pages (mostly one large table), processing is fast โ€” one call per page.
  • For very long datasheets (>50 pages), batch 2-3 consecutive pages of pure spec tables into a single call since they share context.
  • Skip pages that are entirely blank, legal boilerplate, or revision history โ€” note them as skipped.
  • Use parallel subagents to process pages concurrently when possible.

If --visualize is set, emit a page_annotated event after each page's OCR completes, with the bboxes normalized to [0,1]. Example โ€” compute normalized coords from the 300 DPI page geometry (e.g. xn = x / 2480, yn = y / 3508 for A4):

BOXES='[{"id":"b1","kind":"table","label":"Electrical Characteristics","bbox":[0.10,0.28,0.82,0.56]},
        {"id":"b2","kind":"diagram","label":"Figure 4: Pinout","bbox":[0.55,0.72,0.38,0.18]}]'
emit_page_annotated <page_number> "$BOXES"

At the end of 4b: emit_stage ocr done.

4c. Crop diagrams โ€” two-pass approach

emit_stage crop start

Diagram cropping uses two passes: a fast programmatic pass for captioned figures, then a vision validation pass to catch issues and find uncaptioned diagrams.

Pass 1: Automated crop with crop_figures.py

python3 ~/.claude/skills/datasheet-parser/crop_figures.py \
  /tmp/<partname>.pdf \
  /home/adom/project/project-content/datasheets/<partname>/ \
  --dpi 300 --padding 20

This uses PyMuPDF to find Figure N captions in the PDF, identify nearby vector drawings and embedded images, exclude body text / headings, and render cropped PNGs directly. It produces a figures-manifest.json listing all extracted figures. Runs in seconds, zero vision tokens.

Pass 2: Vision validation + uncaptioned diagrams

Read each cropped PNG from Pass 1 and verify:

  • Diagram content is fully visible (not clipped)
  • No unrelated body text / headings leaked in
  • Caption is included

For any bad crops, use the bounding box from the page PNG to re-crop with convert:

convert <partname>-<page>.png -crop <W>x<H>+<X>+<Y> +repage <output>.png

Also identify uncaptioned diagrams found during the page-by-page visual OCR (Step 4b) that crop_figures.py could not detect (e.g., inline diagrams without "Figure N" labels). Crop these manually using convert with bounding boxes from the visual analysis.

Post-processing (apply to all cropped diagrams):

# Anti-alias, resize, optimize, trim whitespace, add padding โ€” single command
convert <diagram>.png -blur 0x0.5 -unsharp 0x1+0.5+0.05 \
  -resize 1200x1200\> -colors 128 -trim +repage \
  -bordercolor white -border 30 <diagram>.png

Target: each image 30โ€“150KB after optimization.

If --visualize is set, emit a crop event for each diagram as it's validated or rejected:

emit_crop "<crop_id>" "<bbox_id_from_ocr>" validated    # or: detected, rejected

At the end of 4c: emit_stage crop done.

4d. Synthesize across pages

emit_stage synthesize start

After all pages are processed, combine the per-page outputs into a unified document:

  • Merge tables that span multiple pages (watch for repeated headers)
  • Deduplicate page headers/footers
  • Resolve cross-references ("see Table 3", "refer to Figure 12")
  • Organize content into the standardized wiki sections (see Step 5)
emit_stage synthesize done

Step 5: Generate Wiki Markdown

emit_stage generate start

Build a structured markdown file that the wiki server renders into the tabbed UI. The wiki's parseDatasheetSections() function splits on ## headings and maps them to tabs.

CRITICAL: The heading names must match exactly โ€” the wiki uses DS_TAB_MAP to route sections to tabs:

HeadingTab
## DescriptionOverview
## Key SpecificationsOverview
## FeaturesOverview
## Pin ConfigurationPinout
## PinoutPinout
## Absolute Maximum RatingsSpecifications
## Recommended Operating ConditionsSpecifications
## Electrical CharacteristicsSpecifications
## Power ConsumptionSpecifications
## Power DomainsSpecifications
## Thermal InformationSpecifications
## Timing AccuracySpecifications
## Communication InterfaceSpecifications
## PackagesSpecifications
## Software APISoftware
## ApplicationsApplications
## Key FormulasApplications
Any unrecognized headingSpecifications (fallback)

DO NOT include ## Diagrams โ€” the Diagrams tab is populated automatically from screenshot assets uploaded to the wiki.

Badge metadata (first 15 lines)

The wiki scans the first 15 lines for **Key:** Value patterns to extract badges:

**Source:** [Texas Instruments Datasheet (SNAS548D)](https://www.ti.com/lit/ds/symlink/lm555.pdf)
**Manufacturer:** Texas Instruments
**Part Number:** LM555
**Document:** SNAS548D โ€” Rev D, January 2015

Special rendering rules

  1. Key Specifications โ€” use a 2-column table (Parameter | Value). The wiki renders this as stat cards:

    ## Key Specifications
    
    | Parameter | Value |
    | --- | --- |
    | Supply Voltage | 4.5V to 16V |
    | Output Current | 200mA |
    | Pin Count | 8 |
    
  2. Pin Configuration โ€” table must have a Name column. Optional: Pin or #, Type or Direction, Description or Desc or Function. Renders as pin cards:

    ## Pin Configuration
    
    | Pin | Name | Type | Description |
    | --- | --- | --- | --- |
    | 1 | GND | Power | Ground reference |
    | 2 | TRIG | Input | Trigger input |
    
  3. Features โ€” use - unordered lists. Renders with triangle bullet markers:

    ## Features
    
    - Direct replacement for SE555/NE555
    - Timing from microseconds through hours
    
  4. Formulas โ€” use LaTeX notation. Display math with $$...$$, inline with $...$:

    ## Key Formulas
    
    ### Monostable Mode
    $$t = 1.1 \times R \times C$$
    
    ### Astable Mode
    $$f = \frac{1.44}{(R_A + 2R_B) \times C}$$
    
    $$\text{Duty Cycle} = \frac{R_A + R_B}{R_A + 2R_B}$$
    
  5. Tables โ€” use standard GFM pipe tables. All other tables render as styled HTML tables.

  6. Inline formatting โ€” **bold**, *italic*, `code`, [text](url) are supported.

Complete example

**Source:** [Texas Instruments Datasheet (SNAS548D)](https://www.ti.com/lit/ds/symlink/lm555.pdf)
**Manufacturer:** Texas Instruments
**Part Number:** LM555
**Document:** SNAS548D โ€” Rev D, January 2015

## Description

The LM555 is a highly stable device for generating accurate time delays or oscillation.

## Key Specifications

| Parameter | Value |
| --- | --- |
| Supply Voltage | 4.5V to 16V |
| Output Current | 200mA |
| Pin Count | 8 |
| Temperature Stability | 0.005%/ยฐC |

## Features

- Direct replacement for SE555/NE555
- Timing from microseconds through hours
- Operates in both astable and monostable modes
- Output can source or sink 200mA

## Pin Configuration

| Pin | Name | Type | Description |
| --- | --- | --- | --- |
| 1 | GND | Power | Ground reference |
| 2 | TRIG | Input | Trigger input; starts timing below 1/3 VCC |
| 3 | OUT | Output | Timer output; sources/sinks 200mA |
| 4 | RESET | Input | Active-low reset |
| 5 | CTRL | Input | Control voltage; bypass with 10nF |
| 6 | THR | Input | Threshold; ends cycle above 2/3 VCC |
| 7 | DIS | Output | Open-collector discharge |
| 8 | VCC | Power | Supply 4.5V to 16V |

## Absolute Maximum Ratings

| Parameter | Min | Max | Unit |
| --- | --- | --- | --- |
| Supply Voltage | โ€” | 18 | V |
| Power Dissipation | โ€” | 600 | mW |
| Storage Temperature | -65 | 150 | ยฐC |

## Electrical Characteristics

| Parameter | Conditions | Min | Typ | Max | Unit |
| --- | --- | --- | --- | --- | --- |
| Supply Voltage | โ€” | 4.5 | โ€” | 16 | V |
| Supply Current (Low) | VCC=5V, RL=โˆž | โ€” | 3 | 6 | mA |
| Supply Current (High) | VCC=15V, RL=โˆž | โ€” | 10 | 15 | mA |

## Thermal Information

| Package | RฮธJA | Unit |
| --- | --- | --- |
| PDIP-8 | 97 | ยฐC/W |
| SOIC-8 | 149 | ยฐC/W |

## Packages

| Package | Pins | Body Size |
| --- | --- | --- |
| PDIP (P) | 8 | 9.81mm ร— 6.35mm |
| SOIC (D) | 8 | 4.90mm ร— 3.91mm |

## Applications

- Precision timing
- Pulse generation
- Sequential timing
- PWM generation

## Key Formulas

### Monostable Mode
$$t = 1.1 \times R \times C$$

### Astable Mode
$$f = \frac{1.44}{(R_A + 2R_B) \times C}$$

$$\text{Duty Cycle} = \frac{R_A + R_B}{R_A + 2R_B}$$

Save this file as:

/home/adom/project/project-content/datasheets/<partname>/<partname>.md

And also write a temporary copy for publishing:

cp project-content/datasheets/<partname>/<partname>.md /tmp/<partname>-body.md

If --visualize is set, emit a section event for each generated section so the wiki-preview pane can fill it in live. Use the tab IDs from DS_TAB_MAP: overview, pinout, specifications, software, applications.

# Example: emit the overview section after writing it
HTML_B64=$(markdown-to-html "/tmp/overview.md" | base64 -w0)   # use any mdโ†’html tool available
emit_section overview "$HTML_B64"

If no converter is available, the simplest path is to send the raw markdown as HTML wrapped in <pre> โ€” the UI will render it verbatim and that's good enough for the live-preview case.

At the end of Step 5: emit_stage generate done.

Step 6: Prepare Metadata JSON

Create a metadata JSON file for the wiki page:

Stashed changes

{
  "pdf_hash": "sha256...",
  "page_count": 67,
  "pages": [
    {
      "page": 1, "width_pt": 612, "height_pt": 792,
      "page_png_url": "/artifact/<hash>/p1.png",
      "blocks": [
        {
          "block_id": "p1_b4", "type": "title",
          "bbox_pt": [x0, y0, x1, y1],
          "text": "BQ76920, BQ76930, BQ76940",
          "confidence": 0.9,
          "cross_agreement": {"pdftotext": 1.0, "pdfplumber": 1.0},
          "findings": [...]
        }
      ],
      "object_figures": [...]
    }
  ],
  "document_findings": [...],
  "escalation_queue": [
    {
      "block_id": "p8_b4",
      "page": 8, "bbox_pt": [...], "block_type": "table",
      "priority": 20, "rule": "table.has_cells",
      "question": "Extract this table as markdown..."
    }
  ],
  "rules_summary": {
    "block_findings": {"info": 113, "warn": 145, "escalate": 40},
    "escalations": 40
  }
}

Step 3: Process the escalation queue

For every entry in escalation_queue, fetch a tight bbox crop and ask Claude vision the rule-specific question. Each crop is ~100โ€“500 px wide โ€” tiny. Batching strategy, merge semantics per block type: ds-extract-reference.md ยง Processing the queue.

# For each escalation:
curl -sS -X POST https://ds-extract-fa4sdo7pnkrl.adom.cloud/extract-region \
  -H 'Content-Type: application/json' \
  -d '{"pdf_hash": "...", "page": 8, "bbox_pt": [x0, y0, x1, y1], "dpi": 300}' \
  -o /tmp/crop-<block_id>.png

Then Read each crop PNG and answer the question field. Batch where possible โ€” reading 5โ€“10 small crops in one turn is cheap.

For tables: expect a markdown table back. For figures: a "complete / partial / phantom" verdict plus a brief description. For low-agreement text: the verbatim text.

Write each verdict back into the corresponding block's text or table_cells field in memory.

Step 4: Assemble the wiki markdown

Walk the merged pages[].blocks and group by docling's semantic section structure. The wiki renders sections by exact heading match โ€” see publish-and-events-reference.md ยง Heading-to-tab routing for the full DS_TAB_MAP.

Key rules (the wiki silently drops content that breaks these):

  • Key Specifications must be a 2-column table (Parameter | Value)
  • Pin Configuration must have a Name column; optional Pin, Type, Description
  • Formulas use LaTeX: $$...$$ (display) or $...$ (inline)
  • Features use - unordered lists
  • Do NOT include ## Diagrams โ€” that tab is populated from screenshot assets

Save the body to /home/adom/project/project-content/datasheets/<part>/<part>.md plus a temp copy at /tmp/<part>-body.md.

Step 5: Publish

adom-wiki page publish "datasheets/<part>" \
  --title "<Part> โ€” <one-line description>" \
  --brief "<2โ€“3 sentence summary>" \
  --body-md /tmp/<part>-body.md \
  --changelog "Parsed from <manufacturer> <doc number>" \
  --sample-prompt "Show me the datasheet for <part>" \
  --sample-prompt "Parse the <family> datasheet" \
  --sample-prompt "<3โ€“5 trigger phrases total>"

# Set metadata
adom-wiki page edit --field metadata \
  --body-md /home/adom/project/project-content/datasheets/<part>/metadata.json \
  "datasheets/<part>"

Step 6: Upload the hero + screenshot assets

For each figure block in the extraction that passed rules (or that Claude validated), fetch its crop and upload:

# Hero: pick the functional block diagram or overview
adom-wiki asset upload datasheets/<part> --asset-type hero-image \
  --file <best-figure>.png \
  --caption "<description>"

# Screenshots: every notable figure
adom-wiki asset upload datasheets/<part> --asset-type screenshot \
  --file <figure>.png \
  --caption "<figure caption from paired caption block>"

Target 5โ€“20 screenshots. Prefer the figures with captions already paired by ds-extract. Always explicitly set hero_asset_id to avoid the implicit-fallback bug โ€” see publish-and-events-reference.md ยง Hero image pitfall.

Step 7: Store artifacts + complete the queue

# PDF, extracted JSON, and all crops go to project-content
cp /tmp/<part>.pdf /home/adom/project/project-content/datasheets/<part>/
mv /tmp/<part>-extract.json /home/adom/project/project-content/datasheets/<part>/

# Mark queue item complete (if this came from the queue)
ds-queue complete <id> --by $(hostname) --wiki-slug "datasheets/<part>"

Full directory layout + queue semantics: publish-and-events-reference.md ยง Step 7.

Live Visualization

When --visualize is set, run the datasheet-visualizer service and emit stage/page/section/crop events so the user can watch live. Details (service install, event schema, stage names, emit helpers, batch behavior between datasheets): see publish-and-events-reference.md ยง Live Visualization.

Troubleshooting

SymptomCauseFix
/extract 500 or hangs > 10 minService container OOM or docling crashedcurl /health; if down, SSH in and bash ~/ds-extract/start.sh. See service logs at /tmp/ds-extract.log
Vendor PDF GET hangs / HTTP/2 stream INTERNAL_ERROR (curl exit 92)ST / Akamai (and some other Akamai-fronted vendor CDNs) silently RST the connection on curl/reqwest's HTTP/2 fingerprintFall through to wget โ€” its HTTP/1.1 negotiation looks browser-enough to get through. Pattern: wget -q --tries=2 --timeout=30 --user-agent "Mozilla/5.0 (X11; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0" -O /tmp/<part>.pdf <url>. Confirmed 2026-04-26 against https://www.st.com/resource/en/datasheet/vl53l8cx.pdf โ€” curl/reqwest both reset; wget got the full 3.6 MB %PDF in ~12 s. The aci component orchestrator now does this fall-through automatically.
Mouser/DigiKey URL returns ~13 KB HTML "Access denied"Vendor anti-bot frontingAlways pass a Firefox UA; if still blocked, switch to the manufacturer's canonical CDN (e.g. diodes.com/assets/Datasheets/<part>.pdf, datasheets.raspberrypi.com/<part>/<part>-datasheet.pdf). Always verify the response starts with %PDF- magic bytes before posting to ds-extract โ€” the 13 KB HTML page would otherwise eat 7 minutes in the parser
Many tables in escalation queuedocling + pdfplumber both failed on this table styleExpected โ€” hand off to Claude vision on those bboxes. If > 50% of tables fail, the PDF is likely image-only (scanned) and needs OCR preprocessing
Wiki page has no tabsHeading names don't match DS_TAB_MAP exactlySee DS_TAB_MAP in publish-and-events-reference.md โ€” heading strings are case-sensitive
Hero image shows something unexpectedhero_asset_id not set explicitly and wiki picked an older assetAlways call page edit --field hero_asset_id <id> after uploading the hero
Slug collision with existing molecule pagedatasheets/<slug> and an existing molecule share the slugDatasheet content merges into the molecule page (unified pages table). Decide intentionally: keep merged, or publish to <slug>-datasheet
adom-wiki page publish rejects "requires 3-8 sample prompts"Missing --sample-prompt flagsAdd 3โ€“5 --sample-prompt flags (trigger phrases)

Dependencies

The ds-extract service handles all Python/OCR deps. On the user's container you only need:

  • curl (every container has this)
  • adom-wiki CLI (installed by node install.mjs)
  • ds-queue CLI (same)

Sub-Skills
?
What are Sub-Skills?

Sub-skills are community-contributed AI skill extensions for this component. They teach AI assistants about specific tools, configurators, or workflows.

Examples:

  • A manufacturer’s configuration tool for a motor controller
  • A community-written design guide for an amplifier circuit
  • An automated test/validation script for a sensor module

How to add one: Click Add Sub-Skill, provide the URL to your skill and a brief description. Submissions are reviewed by the Adom team before going live.

No sub-skills yet. Be the first to contribute one!

๐Ÿ”Ž How Claude finds this page (discovery snippet)

This page opts into Adom Wiki auto-discovery. When a user working in Claude Code mentions any of the trigger phrases below, Claude can proactively suggest this page. The pitch is exactly what Claude will say.

Pitch
"Convert manufacturer PDF datasheets into interactive wiki pages with stat cards, pin cards, and diagram galleries. Extracts text, images, and specs automatically."
Triggers
"parse a datasheet", "convert a datasheet", "download datasheet", "standardize a datasheet", "extract specs from datasheet", "show me the datasheet", "datasheet to wiki", "parse pdf datasheet", "import datasheet", "process datasheets", "batch parse datasheets", "datasheet parser"

Recent activity

1 commit