Engineering

How to Build a Production LinkedIn Profile Scraper with Python and the Voyager API

2025-06-13 Updated 2026-05-22 8 min read

A production LinkedIn profile scraper in Python uses LinkedIn's internal Voyager API (/voyager/api/identity/dash/profiles) authenticated via JSESSIONID cookie to pull name, headline, company, about section, and profile photo in a single HTTP request — returning structured data in under one second. DOM scraping no longer works on LinkedIn: class names are obfuscated, layouts are condensed, and there are no stable semantic selectors. The Voyager API is the only reliable extraction path.

We built this scraper to power an AI cover image generator. Users paste a LinkedIn username, the system imports their profile data in real time, then feeds it into an image generation pipeline. The scraper needed to be fast (sub-second on warm requests), resilient (automatic session refresh), and safe (every request routed through residential proxies so the server IP never touches LinkedIn).

Why DOM scraping fails on LinkedIn in 2025

LinkedIn renders profiles inside a heavily obfuscated React app. Class names are auto-generated hashes like .pv-top-card--list that change between deploys. The mobile-responsive layout compresses everything into a 593px container, collapsing sections that exist on desktop. There are no stable data- attributes or ARIA labels worth targeting.

Every CSS-selector-based scraper we tested broke within days. The only stable interface is LinkedIn's own internal REST API — the same one their frontend calls to render the page.

How the Voyager API works for profile data

LinkedIn's frontend fetches profile data from a REST endpoint called the Voyager API. The key endpoint is:

GET https://www.linkedin.com/voyager/api/identity/dash/profiles
    ?q=memberIdentity
    &memberIdentity={slug}
    &decorationId=com.linkedin.voyager.dash.deco.identity.profile.FullProfileWithEntities-93

Authentication requires three things: a valid JSESSIONID cookie (from a logged-in LinkedIn session), a csrf-token header set to the JSESSIONID value with surrounding quotes stripped, and the x-restli-protocol-version: 2.0.0 header. The response is a normalized JSON structure containing the full profile.

headers = {
    "csrf-token": csrf_token,
    "x-li-lang": "en_US",
    "x-restli-protocol-version": "2.0.0",
    "accept": "application/vnd.linkedin.normalized+json+2.1",
}
resp = requests.get(url, headers=headers, cookies=cookies,
                    proxies=proxies, timeout=20)

The CSRF token derivation is simple but easy to miss: take the JSESSIONID cookie value and strip the surrounding double quotes. That string becomes your csrf-token header. We covered this gotcha and several others in our writeup on reverse engineering undocumented APIs by trusting shipping code over documentation — the LinkedIn invitation endpoint changed entirely between v1 and v2 of the same wrapper library, and only the latest PyPI release reflected what actually worked in production.

Why the Voyager response requires careful parsing

The Voyager API does not return a clean nested profile object. Instead, it returns a flat included array containing every entity referenced by the profile — positions, education, skills, and the profile itself — all at the same level.

To find the primary profile, scan for an item where entityUrn contains fsd_profile: and firstName exists. Name, headline, and summary live directly on this object.

for item in data.get("included", []):
    if item.get("firstName") and "fsd_profile:" in item.get("entityUrn", ""):
        profile_obj = item
        break

name = (profile_obj.get("firstName", "") + " " + profile_obj.get("lastName", "")).strip()
title = profile_obj.get("headline", "")
about = profile_obj.get("summary", "")

The current company is buried in the same flat array. You need to find items where $type ends with Position, the item has a companyName, and there is no dateRange.end — meaning the position is current, not historical.

for item in included:
    if (item.get("$type", "").endswith("Position")
        and item.get("companyName")
        and not item.get("dateRange", {}).get("end")):
        company = item["companyName"]
        break

How warm-path and cold-path scraping work together

The scraper uses a two-path architecture to balance speed and reliability. The warm path makes a direct requests.get() call to the Voyager API using stored session cookies over a SOCKS5 proxy. This completes in under one second with no browser overhead.

When the warm path returns a 401 or 403 (session cookies have expired), the system falls through to the cold path. This launches a full headless browser session via pyppeteer, navigates to the LinkedIn profile page, captures fresh cookies from the live browser, then calls the Voyager API from within the page context. The new cookies are persisted to the database so the next scrape uses the warm path again.

The cold path costs 10-15 seconds but only runs when sessions expire. In production, most scrapes hit the warm path because cookies remain valid for days.

How account pool rotation protects LinkedIn accounts

Running all scrapes through a single LinkedIn session is a fast path to getting that account banned. The system maintains a pool of browser profiles, each with its own LinkedIn login and dedicated residential SOCKS5 proxy. For projects where residential proxies are overkill, we have also used on-demand cloud VMs as a disposable proxy fleet for large-scale web scraping — a pattern that trades monthly proxy subscriptions for pennies-per-run infrastructure.

On startup (and before every scrape), the system syncs profiles tagged "Avatar" from the GoLogin cloud API into a PostgreSQL linkedin_scraper_accounts table. Each row stores the profile ID, proxy DSN, session cookies, CSRF token, usage count, and status (active, needs_refresh, unconfigured, or proxy_dead).

Account selection uses least-usage-first ordering with random tiebreaking:

SELECT * FROM linkedin_scraper_accounts
WHERE status IN ('active', 'needs_refresh', 'unconfigured')
ORDER BY
    CASE status
        WHEN 'active'        THEN 0
        WHEN 'needs_refresh' THEN 1
        WHEN 'unconfigured'  THEN 2
    END,
    usage_count ASC,
    RANDOM()
LIMIT 1

This distributes load evenly across the pool. If one account fails, the system retries with a different account automatically.

Why socks5h matters and socks5 will break your scraper

This single character — the h in socks5h:// — cost us hours of debugging. The difference: socks5:// resolves DNS locally on the server, then sends the resolved IP through the proxy. socks5h:// sends the hostname to the proxy server, which resolves DNS on its end.

With socks5://, local DNS resolution often returns IPv6 addresses that residential SOCKS proxies cannot route. The result is a cryptic "General SOCKS server failure" with no indication that DNS is the problem. Switching to socks5h:// fixed it immediately — the proxy resolves the hostname to an IPv4 address it can actually reach.

# Wrong — DNS resolves locally, may get IPv6 the proxy can't route
proxy_dsn = f"socks5://{username}:{password}@{host}:{port}"

# Correct — hostname sent to proxy for resolution
proxy_dsn = f"socks5h://{username}:{password}@{host}:{port}"

How proxy enforcement prevents IP exposure

The scraper defines a custom _ProxyError exception that acts as a hard circuit breaker. If the proxy is unreachable — whether during a warm-path HTTP request or a cold-path browser session — the scraper raises _ProxyError instead of falling back to an unproxied request.

This is a deliberate safety mechanism. A scrape that succeeds without a proxy exposes the server's real IP to LinkedIn, which can lead to IP-level bans that affect all accounts. The code enforces this at both entry points:

if not proxy_dsn:
    raise _ProxyError(
        f"Refusing browser session for profile_id={profile_id}: no proxy configured"
    )

# On connection failure:
except requests.exceptions.ConnectionError as exc:
    raise _ProxyError(
        f"Proxy unreachable for profile_id={account.get('profile_id')}: {exc}"
    ) from exc

When a proxy dies, the account is marked proxy_dead in the database and excluded from future selection until the proxy is replaced.

Why window.fetch fails inside LinkedIn pages

During development of the cold path, we initially tried fetching profile photos using page.evaluate() with the browser's native fetch() API. The calls failed silently — no errors, no data.

The reason: LinkedIn overrides window.fetch with their own fetchWrapper that intercepts and modifies requests. Any fetch() call made inside page.evaluate() goes through LinkedIn's wrapper, which can block, redirect, or silently drop requests to external CDN URLs.

The fix was straightforward: fetch profile photos directly from Python using requests.get() with the account's proxy. LinkedIn CDN URLs for profile photos are publicly accessible — they do not require authentication. The browser is only needed to capture session cookies and call the authenticated Voyager endpoint.

Pyppeteer's connect() method will hang indefinitely if the target WebSocket endpoint is dead. In production, this meant scrape requests could block forever when a GoLogin session had expired but its WebSocket URL was still stored in the database.

The fix wraps every async operation in asyncio.wait_for() with explicit timeouts: 8 seconds for WebSocket reconnection, 40 seconds for page navigation. When a timeout fires, the account is marked proxy_dead and the system moves to the next account in the pool.

# Reconnect to existing session — 8s timeout
browser = await asyncio.wait_for(
    pyppeteer.connect({"browserWSEndpoint": existing_ws_url}),
    timeout=8,
)

# Full navigation — 40s timeout
page = await asyncio.wait_for(_do_page_work(), timeout=40)

Without these timeouts, a single dead proxy could stall the entire scrape queue. With them, the system fails fast and retries on a healthy account within seconds.

How session persistence eliminates redundant browser launches

Every cold-path scrape captures the browser's cookies and persists them to PostgreSQL along with the derived CSRF token and a timestamp. The next time that account is selected, the warm path finds valid cookies in the database and skips the browser entirely.

Sessions typically remain valid for 3-7 days. During that window, every scrape through that account completes in under one second via pure HTTP. When cookies finally expire (the Voyager API returns 401), the account status flips to needs_refresh, triggering a single cold-path browser session to capture fresh cookies. Then it is back to sub-second warm-path scrapes.

This session persistence layer is what makes the system production-viable. Without it, every scrape would require a 10-15 second browser launch — far too slow for a real-time user-facing feature.

The complete architecture at a glance

A production LinkedIn profile scraper in Python combines the Voyager API for data extraction, a pool of proxy-protected browser profiles for session management, and a warm/cold path architecture for speed. The warm path handles 90%+ of requests in under one second via direct HTTP. The cold path launches a headless browser only when session cookies expire. Account rotation distributes load across the pool, proxy enforcement prevents IP exposure, and socks5h:// ensures DNS resolution happens on the proxy side. The result is a scraper that returns structured LinkedIn profile data — name, title, company, about, and photo — reliably and at scale, without DOM parsing or brittle CSS selectors.

The same warm/cold-path discipline applies to contact-discovery scrapers that have nothing to do with LinkedIn. For a different domain example, see how we built an automated PR outreach scraper for music industry contacts — a four-phase Python pipeline that crawls multilingual search results and extracts scored PR contacts.

Frequently Asked Questions

Is scraping LinkedIn legal?

Public profile data scraping has been upheld in US courts (hiQ vs LinkedIn) but LinkedIn's terms still prohibit automation. We always recommend legal review before deploying any scraper to production.

How does the Voyager API differ from the official LinkedIn API?

The official API is heavily restricted and offers little profile data. Voyager is the same API the live web app uses — far more powerful but undocumented and unsupported.

How do you avoid getting blocked?

Account rotation, residential proxies, aggressive rate limiting (1-2 requests per minute per account), and behavioral randomization. Production scrapers spend more code on stealth than on parsing.

Can this scrape full connection lists?

Voyager exposes the data but pagination and rate limits make it painful. For more than a few thousand profiles, you need a small fleet of accounts and serious orchestration.

To cite this article: Iron Mind AI. (2025). "How to Build a Production LinkedIn Profile Scraper with Python and the Voyager API". Iron Mind AI Blog. https://iron-mind.ai/blog/linkedin-profile-scraper-python-voyager-api

Niro Knox

Full-stack engineer and AI systems builder with 30+ years of production experience. Specialises in LLM integrations, automation pipelines, and high-performance web applications.

Ready to Build Something?

Turn what you just read into a production system. We move fast.

Book a call