Skip to content

URL scraping

When you paste a product URL into the add-item dialog, GiftWrapt scrapes it for a title, description, photo, price, and currency. This page covers how that pipeline works, what knobs the admin has, and how to plug in your own scraper.

add-item dialog /admin/scraping /admin/ai
│ │ │
│ paste URL │ tune knobs │ flip AI toggles
▼ ▼ ▼
┌────────────────────────────────────────────────────────────┐
│ GET /api/scrape/stream (SSE, src/routes/api/scrape) │
│ - auth + URL validate │
│ - build provider chain via loadConfiguredProviders() │
└────────────────────────────────────────────────────────────┘
│ orchestrate()
┌────────────────────────────────────────────────────────────┐
│ src/lib/scrapers/orchestrator.ts │
│ sequential chain (fall through on low score) ─┐ │
│ parallel racers (fire alongside) ├──► score │
│ cache lookup / dedup │ then │
│ per-attempt persistence to itemScrapes │ pick │
│ post-passes (clean-title, ...) ┘ winner │
└────────────────────────────────────────────────────────────┘
│ ScrapeProvider interface (one per backend)
┌─────────┴────────┬─────────────────────────────────────────────┐
│ fetch-provider │ scrapeProviders[] (admin-configurable) │
│ (always on) │ browserless, flaresolverr, browserbase, │
│ │ scrapfly, giftwrapt-scraper, custom-http, │
│ │ ai - in admin-controlled chain order │
└──────────────────┴─────────────────────────────────────────────┘

Each configured entry has a tier (1-5). The orchestrator runs tier 1’s entries in parallel, merges their results, and only advances to tier 2 if the merged score fell below scrapeQualityThreshold. Same for tier 2 → tier 3, etc.

The always-on fetch-provider is implicit tier 0; it runs first, alone. The ai-provider is a parallel racer that fires alongside the tier loop and competes via final scoring.

Why tiers: cost control. Put cheap stuff in tier 1 (browserless, custom HTTP) and paid hosted services in tier 2 / 3. Pages where tier 1 already gets a usable result never spend money on Browserbase or ScrapFly.

Default tier on new entries:

  • browserless, flaresolverr, custom-http, giftwrapt-scraper → tier 1
  • browserbase-fetch, scrapfly → tier 2
  • browserbase-stagehand → tier 3

The admin tunes from /admin/scraping via the Tier dropdown on each card.

When a tier has multiple entries that all succeed, the orchestrator does fill-the-gaps merging before scoring:

  • Sort contributions by per-provider score (descending). The highest-scoring result is the “base.”
  • For each scalar field (title, description, price, currency, siteName, finalUrl) that the base left empty, fill from runners-up in score order. First non-empty wins.
  • imageUrls always concatenates and dedupes across all contributors.
  • Re-score the merged result. If it now clears scrapeQualityThreshold, the tier wins and later tiers don’t fire.

The merged result is persisted with scraperId: 'merged:a,b,c' so the admin can trace which providers contributed. /admin/scrapes renders that as Provider A + Provider B (merged).

Implementation: src/lib/scrapers/merge.ts. Mirrors the same priority-fill pattern the local extractor uses across its OG / JSON-LD / Microdata / Heuristics layers.

Configured in /admin/scraping (DB-backed appSettings.scrapeProviders, with secret fields encrypted at rest). The admin can add multiple entries of the same type (e.g. two Browserless instances) and drag to reorder within a tier; the Tier dropdown moves an entry between tiers.

typemodesecret fieldsbest for
browserlesssequentialtokenself-hosted JS rendering
flaresolverrsequential-self-hosted Cloudflare bypass
browserbase-fetchsequentialapiKeyhosted JS rendering, no LLM
browserbase-stagehandparallelapiKeyhosted structured extraction with LLM
scrapflysequentialapiKeyhosted scraping with optional anti-bot / JS
giftwrapt-scrapersequentialtokenself-hosted facade chaining several backends
custom-httpsequentialcustomHeadersbring-your-own scraper service
ai-providerparallel-extracts directly from raw HTML (admin AI config)

Plus the always-on built-in:

idmodegates on
fetch-providersequential-

Default chain order: fetch → [scrapeProviders in admin-controlled order] → ai.

  • Vanilla page, OG tags, no JS gating → fetch-provider is enough.
  • Empty body unless JS runs (Shopify-on-React, custom SPAs) → add a browserless entry (self-host) or browserbase-fetch (hosted).
  • Cloudflare challenge / Turnstile in front of the page → add a flaresolverr entry (self-host).
  • Bot-detection-heavy sites (Amazon, retailers with aggressive WAFs) → scrapfly with asp=true, or giftwrapt-scraper if you self-host the facade.
  • You already operate a scraper somewhere → custom-http in /admin/scraping. JSON mode validates against ScrapeResult; HTML mode goes through the local extractor.
  • The page is consistently weird, the chain keeps falling through, and even the cheap ai-provider (which extracts from raw HTML) misses → add a browserbase-stagehand entry; it spins up a real browser session and uses Stagehand’s extract() to pull a structured ScrapeResult. Slow and LLM-billable; runs in parallel.
keydefaultwhat
scrapeProviderTimeoutMs10000per-provider HTTP budget
scrapeOverallTimeoutMs20000overall scrape budget incl. parallel racers
scrapeQualityThreshold3score needed to short-circuit the chain
scrapeCacheTtlHours24URL-based dedup cache TTL (0 = disable)
scrapeProviders[]discriminated array of typed entries
scrapeAiProviderEnabledfalseflips the parallel AI scraper on
scrapeAiCleanTitlesEnabledfalsepost-pass: AI normalises the winning title

The AI client config (provider type / model / api key) lives next to the scraping AI toggles under /admin/ai; both AI features read from it.

On first server start after upgrading, if no entry of type browserless exists yet AND BROWSERLESS_URL is set, the bootstrap inserts a corresponding entry. Same for FLARESOLVERR_URL. After that, the admin owns the configuration and the env vars are unused. New deploys should configure everything via /admin/scraping.

# Optional first-boot seed values; do NOT add new vars here for new deploys.
BROWSERLESS_URL=...
BROWSER_TOKEN=...
FLARESOLVERR_URL=...

Authenticated users supply the URL, so the scraper has to refuse to fetch addresses inside our own infrastructure. Both the always-on fetch-provider and the AI provider go through safeFetch, which:

  • rejects non-http(s) schemes,
  • DNS-resolves the hostname (dns.lookup with all: true) and rejects any address in a private, loopback, link-local, CGNAT, multicast, or reserved range (full list lives in the module),
  • walks redirects manually (redirect: 'manual') and re-runs the same check on every hop, so a public host that 30x’s to 127.0.0.1 is rejected at the redirect rather than followed,
  • caps the redirect chain (default 5),
  • sets credentials: 'omit'.

Hosted providers (Browserbase, ScrapFly, browserless, flaresolverr, giftwrapt-scraper, custom-http) make their own outbound calls from outside our infra, so they don’t need the same hostname check on the user-supplied URL - their upstream services apply their own rules. They DO go through admin-configured endpoints, which the orchestrator does not validate against this list.

Secret fields on scrapeProviders entries are AES-256-GCM-encrypted in app_settings JSONB using the same envelope helpers (encryptAppSecret / decryptAppSecret) that already protect the AI and Resend API keys. The master key is derived via scrypt from BETTER_AUTH_SECRET. New writes encrypt; reads accept either an envelope or legacy plaintext, so an upgrade is a no-op until the admin next saves an entry.

@browserbasehq/stagehand is declared under optionalDependencies in package.json. Default pnpm install pulls it in; deploys that don’t need Stagehand can pnpm install --no-optional to skip it (and skip the transitive playwright-core it pulls in). The provider also dynamically imports the SDK only when its fetch() method runs, so configurations where no Stagehand entry is enabled never load the heavy module.

scenarioworking chainnotes
Vercel, no extrasfetch-provider onlyBest-effort. Hard sites return a clean failure so the form prompts for manual entry.
Vercel + Browserbase Fetch APIfetch → browserbase-fetchHosted JS rendering, paste API key in /admin/scraping. Cheapest hosted upgrade.
Vercel + ScrapFlyfetch → scrapflyHosted with anti-bot bypass via asp=true. Pay per credit.
Vercel + Browserbase Fetch + Stagehandfetch → browserbase-fetch → ai | stagehandStagehand races in parallel; wins when the cheap chain returns thin data.
Vercel + custom HTTP scraperfetch → custom-httpCheapest way to get JS rendering on Vercel without paying a hosted vendor.
Self-hosted (containers)full chain incl. browserless + flaresolverrRun them as sidecar containers behind the same reverse proxy.
AI extraction on top of any of the aboveadds ai-provider as a parallel racerOff by default. Costs money.
  1. Add a discriminated variant to scrapeProviderEntrySchema in src/lib/settings.ts. Mark any secret fields with appSecretField().
  2. If a new secret field type is added, also include it in encryptScrapeProviderSecrets so the write path encrypts it.
  3. Create src/lib/scrapers/providers/<id>.ts exporting a create<Id>Provider(entry) factory. kind: 'html' plus the shared extractor is the cheapest path; only go kind: 'structured' when the upstream service genuinely returns structured fields you trust.
  4. Choose mode: 'sequential' unless the provider is independent enough to race (ai-provider and browserbase-stagehand are parallel).
  5. isAvailable() should be a cheap entry-level check - the chain filters before fetching.
  6. Wire the new factory into loadConfiguredProviders()’s switch in src/lib/scrapers/providers/load-configured.ts.
  7. Add a card component in src/components/admin/scraper-providers-form-view.tsx for the type, plus a default-entry shape in makeDefaultEntry().
  8. Add a fixture test under __tests__/ covering the success path plus each error code your provider can throw.
  • Errors classify into a small enum (bot_block, http_4xx, http_5xx, network_error, timeout, invalid_response, config_missing, unknown). Use ScrapeProviderError so the orchestrator surfaces the right wire code in the streaming UX.
  • The cache short-circuits before any provider runs. force: true on scrapeUrl({...}) (or ?force=true on the SSE route) bypasses it.
  • Don’t clobber user input. The add-item form tracks per-field “user touched” refs and skips prefill when the user has typed in a field.
  • Per-provider timeouts (10s default) are independent of the overall budget (20s default). Both are tunable in /admin/scraping.
  • The orchestrator emits a single SSE event stream; the client useScrapeUrl reduces it into a 5-state machine. If you add a new event type, add it to StreamEvent in types.ts and to the reducer in use-scrape-url.ts.