The ingest side of the content engine: a curated creator list in Airtable, no-cookie Apify scrapers pulling every proven post, reel and video into one corpus — and a benchmark teardown that turns that corpus into the quality bar your own drafts must clear before they ship.
Every drafting system is only as good as its reference set. This engine keeps a living corpus of proven content from ~44 creators across YouTube, LinkedIn and Instagram — engagement numbers attached — so the content engine (cookbook 05) can judge every draft against what demonstrably travels, not against vibes.
~35 Make.com scenarios plus an n8n workflow did the scraping. The LinkedIn path needed a session cookie hard-coded into one scenario — it expired in December 2024 and the LinkedIn side silently died. Nobody noticed for months.
One local Node script with a per-channel map, three no-cookie Apify actors, a webhook fired by a single Airtable checkbox. First run backfills 6 months, every later run is incremental and costs ≈ $0. Verified end-to-end: posts upserted, profile refreshed, checkbox auto-cleared.
Base __AIRTABLE_BASE__ holds everything. The same three tables the ingest script writes are the ones the content engine mines — no export step, no second copy.
| Table | What lives there | Key fields |
|---|---|---|
| Users | The curated creator list — one row per creator per channel. Your own profiles sit in the same list, so your content lands in the same corpus. | Username (handle/slug), Channel, Run Now, Nickname, Signature, Followers, Friends, Platform ID, ProfileLink, BioLink, Profile Pic, Total Videos, Total Likes, Run ID, Last Modified |
| Videos | YouTube videos + Instagram reels, transcript attached | ID, User (link), Channel, Caption, Type, Views, Likes, Comments, Shares, Saves, Duration, Height/Width, Cover, SRT, Video Created Time, Link + workflow checkboxes (Generate Script, Repurpose, Used, Generate SRT, Delist) |
| Text Posts | LinkedIn posts | ID, User (link), Channel, Caption, Likes, Comments, Shares, Cover, Media, Post Created Time, Link, Delisted |
Every record carries a stable ID — the YouTube video id, the LinkedIn activity URN, the Instagram shortcode. All writes go through Airtable's performUpsert (merge on ID, typecast on, chunks of 10), so re-pulling an overlap window is free: existing records update in place, nothing duplicates.
Engagement-ratio formulas (Like+Share, LikeView, LikeComment) live in the base. The consuming skill sorts videos by Like+Share and text posts by raw Likes. No saved Airtable views are required — the quality floor and top-10%-per-creator filters are applied in the consumer code on every pull.
Every run refreshes the creator row: follower counts, bio, profile link, display name — and unchecks Run Now when done. The Run ID field stores the last Apify run for traceability.
The token lives in a gitignored file (chmod 600). It needs the write scope (data.records:write) on the base — a read-only token works for the drafting side and then silently fails the moment ingest tries to upsert. Scope it right on day one.
The cutoff is computed per creator in the main loop, so every channel behaves the same — only how each actor reaches the cutoff differs.
Nothing stored yet for this creator? Pull the full lookback window — default 6 months (--months N to change it).
Read the newest stored post for that creator (one cheap Airtable query, scoped by Channel), fetch only newer items with a 2-day overlap. The overlap re-pulls a few records — the ID-based upsert absorbs them. Incremental cost: ≈ $0.
Whatever the actor returns gets filtered against the cutoff again locally (lenient: undated items are kept). Server filters where the actor supports them, local filter always.
| Channel | Apify actor (public marketplace) | Cookie? | Lands in |
|---|---|---|---|
| YouTube | streamers/youtube-scraper | no | Videos |
apimaestro/linkedin-profile-posts + apimaestro/linkedin-profile-detail (enrich) | no — cookie-free | Text Posts | |
apify/instagram-reel-scraper + apify/instagram-profile-scraper (enrich) | no | Videos |
The actor takes a server-side oldestPostDate, so the cutoff passes straight through. Sort newest-first, download subtitles as SRT (English), keep only watch?v= items.
Store the thumbnail only, never the video file — Airtable bloats fast and the transcript already lives in the SRT field (capped at 100k chars). Channel stats from the first item refresh the profile: subscribers, description, totals.
The old actor needed a li_at session cookie. The replacement takes a profile slug and returns posts — no session, nothing to expire. Reactions → Likes, reposts → Shares; LinkedIn exposes no view counts (stored 0).
No server-side date filter exists (only total_posts), so a cheap 12-post probe measures posting cadence, then the real pull is sized to reach the cutoff plus 30% headroom, capped at 400. Follower/connection counts come from one extra profile-detail call — run on every ingest so counts stay current.
The reel scraper returns includeTranscript (→ the SRT field) and includeSharesCount natively, and onlyPostsNewerThan is a server-side date floor — same incremental model as YouTube. Reels map into the Videos table: caption, link, first frame as cover, likes/comments/shares/views, duration, dimensions.
Instagram doesn't expose saves (stored 0). Profile pic + followers + bio come from the separate profile scraper in the enrich step.
One entry in the CHANNELS map: { table, dateField, getTs, fetch, map, profile }. The backfill/incremental cutoff is handled centrally — a new channel only describes how to fetch and how to map fields.
Actors already available on the marketplace for the next ones: clockworks/tiktok-profile-scraper, apidojo/tweet-scraper.
Apify Starter plan: $29/mo in credits — the actors are pay-per-result; a 5-post test run cost ≈ $0.00.
Adding a creator or refreshing one is a single checkbox in Airtable. Everything downstream is automatic.
An Airtable automation (trigger: Run Now is checked) runs a one-line script step that fetches your endpoint with the triggering record id: https://__YOUR_MACHINE__.ts.net/trigger?token=__WEBHOOK_SECRET__&recordId=... The endpoint is the local listener exposed via Tailscale Funnel — no cloud server, your own machine answers.
A tiny HTTP listener on localhost validates the token, responds 200 immediately (so Airtable's webhook never times out on a slow scrape), then spawns the ingest. Single-flight with coalescing: webhooks arriving mid-run queue exactly one more drain — no overlap, no missed checkbox. A launchd agent keeps it alive across reboots.
The listener passes --record <id>, so ingest fetches that one row directly instead of querying {Run Now}=1 — dodging Airtable's formula re-index lag that would otherwise miss a just-checked row. It runs the actor, upserts, refreshes the profile, unchecks the box.
| Flag | What it does |
|---|---|
node ingest.mjs | Drain every creator with Run Now checked (manual run) |
--record <id> | Process one exact record — the webhook path |
--handle NAME | Only that username |
--months N / --since YYYY-MM-DD | Change the backfill window / set an explicit cutoff |
--backfill | Force the full window, ignore stored data |
--limit N / --dry-run | Hard-cap items per creator / fetch + map but write nothing |
--all | Ignore the checkbox, run every creator — full re-syncs (use with care, cost). Skips creators refreshed <12h ago so an interrupted sweep resumes instead of re-probing |
A webhook that fires while the machine sleeps is lost — Airtable doesn't retry. But the design is idempotent: the box stays checked, so a timed poller (launchd, atomic lock dir against overlapping runs, logs only when something happened) drains any missed rows later. Idle cost ≈ 0: one cheap Airtable GET and exit.
Tailscale Funnel persists across reboots once armed: tailscale funnel --bg 8787. Requires Funnel + HTTPS certificates enabled once in the Tailscale admin console. Health check: curl /health on the public URL.
Scraping is the cheap half. The value is the teardown: rank every creator's content by engagement, deconstruct why the winners won into a frameworks file, and make that file the gate your own drafts must clear. The current teardown sits on a full re-sync of ~44 creators: 273 + 288 + 178 LinkedIn posts from the three benchmark creators and 1,855 YouTube videos.
Calm authority, mindset reframes, heavy line breaks, no hype. Three repeatable shapes: Aspiration → Tension → Reframe → Resolution, the Identity Gap ("you feel like a mess, they think you're a machine"), and the Data-Drop Listicle (specificity buys the right to teach).
Adapt, don't copy: he funnels every post into his newsletter — the engine softens that to an occasional CTA, never every time.
The richer 288-post pull shows five shapes, not two: the spicy leadership reframe with a 3-tier consequence breakdown, the tactical System with a before/after block ("Email time: 2+ hrs → 15 min"), the short inspirational punch, the vulnerable origin story, and the contrarian-credential take.
Hard-avoid: his "-DM" signature, stacked P.S. lead magnets, and one-word "agree?" bait — engagement-farming, off-register.
Worth stealing: the anti-trends proof story (specific outcome → the contrarian moves that produced it), hyper-specific never-rounded numbers ("$80m", "31,085,072 views"), and ICP-named targeting (he names the reader).
The trap: his highest-engagement posts are comment-gated lead-gen ("comment X and I'll DM you"). That's a distribution hack, not a content style — hard reject.
Rank by Like+Share with a ≥2,000-view floor — raw ratio alone surfaces tiny-view noise. Nine observed hook types; the single biggest validated winner (494K views / 40K likes) is the time-boxed transformation ("Give me 51 seconds and…").
A passing script: hook lands in ≤5 seconds, carries a specific number, opens in the first sentence — never an "in this video…" preamble.
| Creator move | Verdict for your own drafts |
|---|---|
| Tension → Reframe spine (Welsh) | Adopt — same register |
| Tactical System + before/after (Martell) | Adopt — the core lane |
| Anti-trends proof story (Latka) | Adopt |
| Extreme, never-rounded specificity (Latka) | Already doing — keep maxing |
| Newsletter funnel on every post (Welsh) | Soften — don't funnel every time |
| Comment-gating / "like + comment + reshare" / DM-bait | Reject — engagement-farming, salesy |
| "-DM" signature + stacked P.S. lead magnets | Reject |
(1) opens with a ≤12-word hook in one of the proven patterns, (2) uses either a Tension→Reframe spine OR a concrete step-System with a before/after, (3) carries at least one exact number, (4) keeps the line-break rhythm, (5) closes with a standalone punch or ONE relevant CTA — never comment-bait. If a draft is weaker than the median top-post of the three creators for the same pillar, it gets rewritten. The median is the floor, not the ceiling.
Your own channels sit in the same Users table, so your videos land in the same Videos table and rank under the same formulas. At draft time, the content engine's QA agent (cookbook 05) scores every new draft against this frameworks file plus benchmark anchor posts — rework until it clears, never ship below the bar.
Long-term: once enough of your own posts clear the engagement threshold, your winners replace the inspiration creators as the top anchors. Their corpus seeds the bar; your real outcomes take it over.
Ranking method: LinkedIn by engagement (Likes + 3×Shares + Comments), YouTube by Like+Share over the ≥2,000-view floor. Refresh the whole corpus with node ingest.mjs --all, then re-run the teardown whenever the creator set changes materially — the last refresh made the dataset 2-3× richer and surfaced two new Martell shapes the first pass missed.
Wall: the old LinkedIn actor needed a li_at session cookie hard-coded into the automation. It expired — and the LinkedIn ingest was silently dead for over a year.
Fix: switch to a no-cookie actor that takes a handle and returns posts. No session, nothing to expire, nothing to babysit.
Wall: the existing Airtable token only ever read (the drafting side never needed more), so the first ingest run failed on write.
Fix: mint a token with data.records:write on the base. Read alone is not enough for an upserting pipeline.
Wall: handing Airtable a profile-image URL silently drops the attachment — the CDN blocks datacenter IPs.
Fix: download the image locally (residential IP works) and upload the bytes via Airtable's content API. Clear the existing pic first so re-runs don't append duplicates.
Wall: querying filterByFormula={Run Now}=1 right after the webhook misses the just-checked row — Airtable's formula index lags.
Fix: pass the triggering record id through the webhook and fetch that exact record directly.
Wall: the actor only accepts total_posts — pagination parameters don't advance, so you can't ask for "posts since March".
Fix: a cheap 12-post probe measures the creator's posting cadence, then the real pull is sized to reach the cutoff + headroom (capped at 400). A near-daily poster's first 6-month backfill genuinely needs ~200 posts; incrementals after that are tiny.
Wall: Instagram's includeTranscript is compute-heavy — ~$0.05–0.33 per reel, scaling with video length. Backfilling a reel-heavy creator adds up.
Fix: backfill consciously, then lean on the server-side date floor for cheap incrementals.
Wall: a webhook that fires while the machine is asleep is lost — Airtable doesn't retry.
Fix: idempotent design. The checkbox stays checked, so a timed safety sweep (lock-dir guarded) drains any missed rows on the next tick.
Wall: YouTube and Instagram share the Videos table — a same-named creator on both channels would inherit the wrong "newest stored post" cutoff.
Fix: scope the latest-post query by Channel as well as by creator.
Copy the bootstrap promptThe button below puts it on your clipboard.
Paste it into Claude CodeWith an Airtable account and an Apify account ready (Starter credits cover it).
Answer its questionsIt builds the base, then the ingest script, then runs the benchmark teardown on your first corpus.
Incremental runs cost ≈ $0 — the spend is the first backfill.