A production LinkedIn profile scraper in Python uses LinkedIn's internal Voyager API (/voyager/api/identity/dash/profiles) authenticated via JSESSIONID cookie to pull name, headline, company, about section, and profile photo in a single HTTP request — returning structured data in under one second. DOM scraping no longer works on LinkedIn: class names are obfuscated, layouts are condensed, and there are no stable semantic selectors. The Voyager API is the only reliable extraction path.
We built this scraper to power an AI cover image generator. Users paste a LinkedIn username, the system imports their profile data in real time, then feeds it into an image generation pipeline. The scraper needed to be fast (sub-second on warm requests), resilient (automatic session refresh), and safe (every request routed through residential proxies so the server IP never touches LinkedIn).
Why DOM scraping fails on LinkedIn in 2025
LinkedIn renders profiles inside a heavily obfuscated React app. Class names are auto-generated hashes like .pv-top-card--list that change between deploys. The mobile-responsive layout compresses everything into a 593px container, collapsing sections that exist on desktop. There are no stable data- attributes or ARIA labels worth targeting.
Every CSS-selector-based scraper we tested broke within days. The only stable interface is LinkedIn's own internal REST API — the same one their frontend calls to render the page.
How the Voyager API works for profile data
LinkedIn's frontend fetches profile data from a REST endpoint called the Voyager API. The key endpoint is:
GET https://www.linkedin.com/voyager/api/identity/dash/profiles
?q=memberIdentity
&memberIdentity={slug}
&decorationId=com.linkedin.voyager.dash.deco.identity.profile.FullProfileWithEntities-93
Authentication requires three things: a valid JSESSIONID cookie (from a logged-in LinkedIn session), a csrf-token header set to the JSESSIONID value with surrounding quotes stripped, and the x-restli-protocol-version: 2.0.0 header. The response is a normalized JSON structure containing the full profile.
headers = {
"csrf-token": csrf_token,
"x-li-lang": "en_US",
"x-restli-protocol-version": "2.0.0",
"accept": "application/vnd.linkedin.normalized+json+2.1",
}
resp = requests.get(url, headers=headers, cookies=cookies,
proxies=proxies, timeout=20)
The CSRF token derivation is simple but easy to miss: take the JSESSIONID cookie value and strip the surrounding double quotes. That string becomes your csrf-token header.
Why the Voyager response requires careful parsing
The Voyager API does not return a clean nested profile object. Instead, it returns a flat included array containing every entity referenced by the profile — positions, education, skills, and the profile itself — all at the same level.
To find the primary profile, scan for an item where entityUrn contains fsd_profile: and firstName exists. Name, headline, and summary live directly on this object.
for item in data.get("included", []):
if item.get("firstName") and "fsd_profile:" in item.get("entityUrn", ""):
profile_obj = item
break
name = (profile_obj.get("firstName", "") + " " + profile_obj.get("lastName", "")).strip()
title = profile_obj.get("headline", "")
about = profile_obj.get("summary", "")
The current company is buried in the same flat array. You need to find items where $type ends with Position, the item has a companyName, and there is no dateRange.end — meaning the position is current, not historical.
for item in included:
if (item.get("$type", "").endswith("Position")
and item.get("companyName")
and not item.get("dateRange", {}).get("end")):
company = item["companyName"]
break
How warm-path and cold-path scraping work together
The scraper uses a two-path architecture to balance speed and reliability. The warm path makes a direct requests.get() call to the Voyager API using stored session cookies over a SOCKS5 proxy. This completes in under one second with no browser overhead.
When the warm path returns a 401 or 403 (session cookies have expired), the system falls through to the cold path. This launches a full headless browser session via pyppeteer, navigates to the LinkedIn profile page, captures fresh cookies from the live browser, then calls the Voyager API from within the page context. The new cookies are persisted to the database so the next scrape uses the warm path again.
The cold path costs 10-15 seconds but only runs when sessions expire. In production, most scrapes hit the warm path because cookies remain valid for days.
How account pool rotation protects LinkedIn accounts
Running all scrapes through a single LinkedIn session is a fast path to getting that account banned. The system maintains a pool of browser profiles, each with its own LinkedIn login and dedicated residential SOCKS5 proxy.
On startup (and before every scrape), the system syncs profiles tagged "Avatar" from the GoLogin cloud API into a PostgreSQL linkedin_scraper_accounts table. Each row stores the profile ID, proxy DSN, session cookies, CSRF token, usage count, and status (active, needs_refresh, unconfigured, or proxy_dead).
Account selection uses least-usage-first ordering with random tiebreaking:
SELECT * FROM linkedin_scraper_accounts
WHERE status IN ('active', 'needs_refresh', 'unconfigured')
ORDER BY
CASE status
WHEN 'active' THEN 0
WHEN 'needs_refresh' THEN 1
WHEN 'unconfigured' THEN 2
END,
usage_count ASC,
RANDOM()
LIMIT 1
This distributes load evenly across the pool. If one account fails, the system retries with a different account automatically.
Why socks5h matters and socks5 will break your scraper
This single character — the h in socks5h:// — cost us hours of debugging. The difference: socks5:// resolves DNS locally on the server, then sends the resolved IP through the proxy. socks5h:// sends the hostname to the proxy server, which resolves DNS on its end.
With socks5://, local DNS resolution often returns IPv6 addresses that residential SOCKS proxies cannot route. The result is a cryptic "General SOCKS server failure" with no indication that DNS is the problem. Switching to socks5h:// fixed it immediately — the proxy resolves the hostname to an IPv4 address it can actually reach.
# Wrong — DNS resolves locally, may get IPv6 the proxy can't route
proxy_dsn = f"socks5://{username}:{password}@{host}:{port}"
# Correct — hostname sent to proxy for resolution
proxy_dsn = f"socks5h://{username}:{password}@{host}:{port}"
How proxy enforcement prevents IP exposure
The scraper defines a custom _ProxyError exception that acts as a hard circuit breaker. If the proxy is unreachable — whether during a warm-path HTTP request or a cold-path browser session — the scraper raises _ProxyError instead of falling back to an unproxied request.
This is a deliberate safety mechanism. A scrape that succeeds without a proxy exposes the server's real IP to LinkedIn, which can lead to IP-level bans that affect all accounts. The code enforces this at both entry points:
if not proxy_dsn:
raise _ProxyError(
f"Refusing browser session for profile_id={profile_id}: no proxy configured"
)
# On connection failure:
except requests.exceptions.ConnectionError as exc:
raise _ProxyError(
f"Proxy unreachable for profile_id={account.get('profile_id')}: {exc}"
) from exc
When a proxy dies, the account is marked proxy_dead in the database and excluded from future selection until the proxy is replaced.
Why window.fetch fails inside LinkedIn pages
During development of the cold path, we initially tried fetching profile photos using page.evaluate() with the browser's native fetch() API. The calls failed silently — no errors, no data.
The reason: LinkedIn overrides window.fetch with their own fetchWrapper that intercepts and modifies requests. Any fetch() call made inside page.evaluate() goes through LinkedIn's wrapper, which can block, redirect, or silently drop requests to external CDN URLs.
The fix was straightforward: fetch profile photos directly from Python using requests.get() with the account's proxy. LinkedIn CDN URLs for profile photos are publicly accessible — they do not require authentication. The browser is only needed to capture session cookies and call the authenticated Voyager endpoint.
How navigation timeouts prevent hung scraper processes
Pyppeteer's connect() method will hang indefinitely if the target WebSocket endpoint is dead. In production, this meant scrape requests could block forever when a GoLogin session had expired but its WebSocket URL was still stored in the database.
The fix wraps every async operation in asyncio.wait_for() with explicit timeouts: 8 seconds for WebSocket reconnection, 40 seconds for page navigation. When a timeout fires, the account is marked proxy_dead and the system moves to the next account in the pool.
# Reconnect to existing session — 8s timeout
browser = await asyncio.wait_for(
pyppeteer.connect({"browserWSEndpoint": existing_ws_url}),
timeout=8,
)
# Full navigation — 40s timeout
page = await asyncio.wait_for(_do_page_work(), timeout=40)
Without these timeouts, a single dead proxy could stall the entire scrape queue. With them, the system fails fast and retries on a healthy account within seconds.
How session persistence eliminates redundant browser launches
Every cold-path scrape captures the browser's cookies and persists them to PostgreSQL along with the derived CSRF token and a timestamp. The next time that account is selected, the warm path finds valid cookies in the database and skips the browser entirely.
Sessions typically remain valid for 3-7 days. During that window, every scrape through that account completes in under one second via pure HTTP. When cookies finally expire (the Voyager API returns 401), the account status flips to needs_refresh, triggering a single cold-path browser session to capture fresh cookies. Then it is back to sub-second warm-path scrapes.
This session persistence layer is what makes the system production-viable. Without it, every scrape would require a 10-15 second browser launch — far too slow for a real-time user-facing feature.
The complete architecture at a glance
A production LinkedIn profile scraper in Python combines the Voyager API for data extraction, a pool of proxy-protected browser profiles for session management, and a warm/cold path architecture for speed. The warm path handles 90%+ of requests in under one second via direct HTTP. The cold path launches a headless browser only when session cookies expire. Account rotation distributes load across the pool, proxy enforcement prevents IP exposure, and socks5h:// ensures DNS resolution happens on the proxy side. The result is a scraper that returns structured LinkedIn profile data — name, title, company, about, and photo — reliably and at scale, without DOM parsing or brittle CSS selectors.