Engineering

How We Built an Automated PR Outreach Scraper for Music Industry Contacts

2025-05-15 Updated 2026-05-27 6 min read

An automated PR outreach scraper can find hundreds of relevant podcasts, radio shows, bloggers, and reviewers — complete with contact emails — in minutes instead of days. We built exactly this for a metal and rock music client: a four-phase Python pipeline that searches across 10+ European languages, crawls the results with a headless browser, and extracts scored, deduplicated contacts ready for outreach.

Independent artists and labels burn enormous time on manual PR discovery. Google one query, scan results, copy emails, repeat. Multiply that across podcasts, blogs, radio stations, and a dozen countries — it becomes a full-time job. This pipeline replaces that grind with a single script run. Finding the contacts is only half the battle, though; the message you send them matters just as much, which is why we later dug into why AI-generated outreach sounds like a bot and how we fixed it with a sender voice profile.

Why manual PR contact discovery doesn't scale for music promotion

A typical metal or rock artist targeting European press needs to find contacts across the UK, Germany, Scandinavia, France, Italy, Spain, Poland, and more. Each country has local-language blogs, radio shows, and podcasts that never appear in English Google results.

Manually searching "metal podcast Sweden" then "metal podcast Norway" then "heavy metal blog Deutschland" is tedious and incomplete. You miss results because you don't know the right local keywords, and you waste time visiting sites that haven't posted since 2019. The keyword-discovery problem is solvable systematically — we covered that approach separately in how we built a genetic algorithm for SEO keyword research using Google Trends and LLM mutations.

The pipeline we built solves both problems: it searches in the right languages, across the right country indexes, and scores results by relevance and recency automatically.

How the multi-language search phase works

The first phase fires 20 carefully crafted search queries through the Serper.dev Google Search API, each targeting a specific country and language combination. Queries cover English plus Swedish, Norwegian, German, Danish, Dutch, French, Finnish, Italian, Spanish, and Polish — using country-specific Google indexes via the gl= parameter.

All 20 queries run concurrently using Python's ThreadPoolExecutor with 10 workers, fetching 2 pages of results per query. The entire search phase completes in seconds rather than the minutes it would take sequentially.

# Concurrent search across multiple country/language combos
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {
        executor.submit(search_serper, query, country_code, lang): query
        for query, country_code, lang in query_matrix
    }
    for future in as_completed(futures):
        results.extend(future.result())

This approach surfaces results that a single English-language Google search would never return — local Finnish metal podcasts, German rock review blogs, Scandinavian radio shows with submission forms.

How filtering and deduplication clean the raw results

Raw search results include a lot of noise: Spotify pages, YouTube channels, Instagram profiles, Reddit threads. The filter phase strips all social and streaming platform URLs using a domain blocklist, keeping only actual websites where contact information might exist.

Because the same popular metal blog (like a well-known angry metal review site) appears in results for multiple country queries, deduplication by root domain is essential. Without it, the crawler would waste time hitting the same site five times from five different query results.

After filtering and dedup, a typical run reduces 400+ raw URLs down to 80-150 unique, relevant domains worth crawling.

Why the async crawl orchestration was the hardest engineering problem

The crawl phase submits filtered URLs to an internal headless crawler service for full JavaScript rendering. The crawler is asynchronous — you submit a URL and get back a job ID, not immediate content. This meant the pipeline needed a batch-submit and poll loop to orchestrate the work.

URLs are batched in groups of 5 and submitted to the crawler. After each batch submission, the pipeline polls every 10 seconds, checking job statuses until all tasks complete or a 300-second deadline passes. Failed or timed-out tasks degrade gracefully — the pipeline moves on with whatever succeeded.

# Batch submit + poll pattern (synchronous, no async/await)
for batch in chunked(urls, batch_size=5):
    job_ids = [submit_crawl(url) for url in batch]
    deadline = time.time() + 300

    while time.time() < deadline:
        statuses = [check_status(jid) for jid in job_ids]
        if all(s in ("complete", "failed") for s in statuses):
            break
        time.sleep(10)

    completed = collect_results(job_ids)

Building this synchronously with requests and time.sleep polling — rather than async/await — was a deliberate choice. The script needed to be simple, runnable anywhere, and easy to debug. No event loop complexity, no aiohttp dependency chains. We applied a similar philosophy in a later project where we used headless browser traffic interception to extract healthcare provider data from a portal with no public API — simple tooling, massive output.

How email extraction and relevance scoring produce usable contacts

Once crawled HTML is collected, the extraction phase runs regex-based email discovery with junk filtering. Common false positives — addresses like wix.com, wordpress.com, or example@email.com — are stripped automatically.

Each result gets a relevance score from 0 to 10 based on keyword density analysis. The scoring function checks crawled page text for genre-relevant terms (metal, rock, glam, sleaze, heavy, thrash, etc.) and weights them against the total content length.

An active signal detector flags whether the page mentions 2024 or 2025 dates, indicating the site is still actively publishing. A metal blog that last posted in 2021 scores lower than one reviewing albums from last month.

Content type detection classifies each result as podcast, radio, blog, or review site based on page text patterns — so the final CSV can be filtered by outreach channel.

What the output looks like in practice

The pipeline produces a clean CSV with columns for: website URL, page title, detected content type, country hint, extracted emails, relevance score (0-10), active signal (yes/no), and source query. A typical run surfaces 60-100 scored contacts.

Top-scoring results in our test runs included well-known metal review sites, underground podcast hosts, and European radio show submission pages — all with real, working contact emails. The relevance scoring consistently pushed the most useful contacts to the top.

The artist or label can sort by score, filter by content type or country, and start outreach immediately — no manual research required.

Why this approach works better than buying PR contact lists

Purchased media lists go stale fast. Bloggers stop writing, podcasts go dormant, emails change. A scraper pipeline that runs fresh queries against live search results and validates activity signals produces more accurate, more current data than any static database.

The multi-language, multi-region search strategy is the real differentiator. Most PR contact databases focus on English-language media. For a metal artist targeting European markets, the Scandinavian podcast host or the German rock radio DJ is often more valuable than another US-based music blog — and those contacts simply don't exist in generic databases.

The full pipeline pattern at a glance

An automated PR outreach scraper for the music industry follows four phases: concurrent multi-language search, aggressive filtering and deduplication, batch-orchestrated headless crawling with poll-based completion tracking, and scored email extraction with recency validation. The result is a ranked, ready-to-use contact list built from live data — not a stale spreadsheet. For independent artists and labels doing their own press outreach, this kind of pipeline turns days of manual research into a single script run measured in minutes.

Frequently Asked Questions

Why scrape instead of buy a PR contact database?

Bought databases are stale within 6 months and rarely cover niche music sub-genres. A custom scraper hits exactly the niche you care about and refreshes weekly.

What's the hardest part of the pipeline?

Deduplication. The same DJ shows up as four spellings across four sites. Fuzzy matching plus manual review thresholds takes more code than the search itself.

How do you verify contact emails?

SMTP probing without sending, plus pattern matching against the venue's known email format. We never send test emails — that triggers spam filters.

Does this work outside of music PR?

Yes — the same architecture works for journalist outreach, venue booking, conference speakers. The search queries and filters change; the pipeline stays the same.

To cite this article: Iron Mind AI. (2025). "How We Built an Automated PR Outreach Scraper for Music Industry Contacts". Iron Mind AI Blog. https://iron-mind.ai/blog/automated-pr-outreach-scraper-music-industry

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call