TL;DR
I needed to pull a user's saved-posts list from Xiaohongshu (China's Instagram-meets-Pinterest) for offline analysis. Their web client renders the data, but their API has a request-signature scheme that defeats casual scraping. After three failed approaches — Playwright sniff, manual cookie injection, direct API fetch — the working solution turned out to be the inverse of "automate the API call": let the browser issue the request and listen for the response. 250 saved posts in 90 seconds. Here's what each approach taught me.
This is part 2 of a series. Part 1 covered shipping four iOS apps with one Claude Code agent; this part is one of the agent's harder side-quests.
The setup
I'm building an offline AI-prompt manager for iOS. I needed evidence that "AI-related content" was a real user interest signal, not just a guess. The cheapest evidence: my own saved-posts history on Bilibili and Xiaohongshu. Bilibili had a clean public API — 347 items pulled in 30 seconds. Xiaohongshu? Different story.
Stack used throughout:
- Python 3.12
- Playwright 1.50 with Chromium
- macOS / Windows compatible (
subprocess.run+pathlib) - Storage state persisted as JSON for re-entry without re-login
Attempt 1 — login + cookie polling
The standard Playwright pattern: launch a headed browser, let the user log in interactively, watch for a "logged-in" cookie to appear, save the storage state.
# login_and_save.py — first version
def has_login_cookie(context, name: str) -> bool:
cookies = context.cookies()
return any(c.get("name") == name and c.get("value") for c in cookies)
# Polling loop
while time.time() < deadline:
if has_login_cookie(context, "web_session"): # XHS sets this on login
context.storage_state(path=str(state_path))
browser.close()
return 0
time.sleep(2)
I ran the script. It happily reported "登录成功! storage state saved" within 5 seconds — even though I hadn't actually scanned the QR code yet. The cookie was set, but…
# Probe: am I really logged in?
resp = page.request.get("https://edith.xiaohongshu.com/api/sns/web/v2/user/me")
data = resp.json()
print(data["data"]["guest"]) # → True
Lesson 1: Xiaohongshu sets the web_session cookie immediately on first page load — for guest users too. The cookie is not a login signal. It's a session ID that gets associated with a user only after authentication.
The fix: poll for guest: false from the actual /me endpoint, not for cookie presence.
def _xhs_logged_in(page, context) -> bool:
try:
resp = page.request.get(
"https://edith.xiaohongshu.com/api/sns/web/v2/user/me",
headers={"Accept": "application/json"},
)
return resp.json().get("data", {}).get("guest", True) is False
except Exception:
return False
This works. After a real login, guest: false flips. Storage state saves cleanly.
Attempt 2 — direct API fetch
Now logged in. Time to find the collection endpoint.
I started with educated guesses based on naming conventions seen on similar Chinese apps:
candidates = [
"https://edith.xiaohongshu.com/api/sns/web/v1/note/collect_page?num=20&cursor=",
"https://edith.xiaohongshu.com/api/sns/web/v2/note/collect_page?num=20&cursor=",
"https://edith.xiaohongshu.com/api/sns/web/v1/user_posted/collected?num=20",
"https://edith.xiaohongshu.com/api/sns/web/v1/user/collected?num=20",
]
Every single one returned 404 page not found. Not 401, not 403 — a flat 404. Either the endpoints had been renamed, or my guesses were wrong on a fundamental dimension.
Lesson 2: When you guess endpoints and get 404 across all variants, stop guessing. Sniff what the browser actually does.
Attempt 3 — Playwright network sniff
I added page.on("request") and page.on("response") listeners and navigated to the user's profile in a headed browser. Then I let myself click around manually and watched what the browser fetched.
def on_request(req):
url = req.url
if "xiaohongshu.com/api/" in url or "edith.xiaohongshu" in url:
captured.append({
"method": req.method,
"url": url,
"headers": dict(req.headers),
})
90 seconds of clicking and scrolling produced 41 captured API calls. The interesting one:
GET https://edith.xiaohongshu.com/api/sns/web/v2/note/collect/page
?num=30
&cursor=
&user_id=...
&image_formats=jpg,webp,avif
&xsec_token=
&xsec_source=
So the path is note/collect/page (singular note, slash-separated collect/page) — not note/collect_page, not user_posted/collected. None of my guesses had been close.
I copied that URL into a direct script invocation:
url = "https://edith.xiaohongshu.com/api/sns/web/v2/note/collect/page?num=30&cursor=&user_id=...&image_formats=jpg,webp,avif&xsec_token=&xsec_source="
resp = page.request.get(url, headers={"Accept": "application/json", "Origin": "...", "Referer": "..."})
print(resp.json())
# → {"code": -1, "success": false}
Lesson 3: A request that the browser issues successfully will not necessarily work when you replay it directly — even with the same cookies. The server is checking something else.
What broke: x-s signature headers
Xiaohongshu uses a request-signature scheme. Every API call from the official web client includes three headers:
-
x-s(a hash signature) -
x-s-common(a "platform fingerprint" payload) -
x-t(a timestamp)
These are computed by the front-end JavaScript before each fetch. The implementation lives in heavily-obfuscated bundle code that runs in the browser and signs each request based on the URL, body, and a per-session secret.
To reproduce these signatures from Python you'd need to either reverse-engineer the obfuscated signing code (hours; brittle to bundle updates) or run a JS engine alongside Python (overhead; jsdom doesn't quite work because the real bundle expects browser APIs).
Both options are bad-ROI for a one-shot offline analysis.
Lesson 4: When the platform makes API replay expensive, don't replay. Borrow the browser's signing for free.
The fix: let the browser issue the request, listen for the response
The signed-request problem disappears if you don't issue the request from your code at all. Instead:
- Have Playwright drive the page to load whatever data you want (clicking the right tab, scrolling)
- Let the front-end JS sign and issue the calls itself
- Listen to the HTTP responses with
page.on("response")and capture the JSON bodies
pages_data: list[dict] = []
last_response_time = [time.time()]
def on_response(resp):
if "/api/sns/web/v2/note/collect/page" in resp.url and resp.status == 200:
body = resp.json()
pages_data.append(body)
last_response_time[0] = time.time()
page.on("response", on_response)
# Navigate + click collection tab + scroll loop
page.goto(f"https://www.xiaohongshu.com/user/profile/{user_id}")
page.locator("[class*='reds-tab']:has-text('收藏')").first.click()
time.sleep(3)
for _ in range(60): # cap
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(1.2)
if time.time() - last_response_time[0] > 8: # idle = done
break
Each scroll triggers another collect/page request from the front end. The front end signs it correctly. The server returns paginated JSON. My listener captures the body.
250 saved posts in 25 paginated responses, ~90 seconds end-to-end.
The full output is structured cleanly:
{
"data": {
"notes": [
{
"note_id": "...",
"type": "video",
"title": "ai直接把我公寓租出去了 把我家具也卖了",
"user": {"nickname": "L"},
"interact_info": {"liked_count": "8650"}
}
],
"cursor": "...",
"has_more": true
}
}
What this enabled
With 250 collected posts in JSON, I ran simple keyword-based classification (regex against ~15 tightened patterns covering LLM/Chat, Image-Gen, Video-Gen, Coding-Agent, Agent/Automation, Voice/Music, RAG/Embedding, General-AI):
- 36 of 250 were AI-related (14.4%)
- Top signal: 21 posts about LLM/Chat (Claude Code being the dominant brand — 11 posts mentioned it directly)
- Second: Coding-Agent (Cursor / Aider / Continue) — 11 posts (notably 0 in my Bilibili sample, suggesting Xiaohongshu is where the user's "newer" interest pattern shows up)
These signals fed directly into product decisions for the iOS prompt-manager app I'm shipping. The "Claude Code starter prompts" section of the bundled prompt library got expanded from 0 to 7 items — driven by behavior data, not guess.
Generalized lessons for any "scrape from a hostile API" problem
- A cookie isn't necessarily login state. Test with an actual identity-API call.
- 404 across multiple endpoint guesses = stop guessing. Sniff the browser instead.
- API replay fails on signed requests. That's the platform's threat model talking.
- Don't fight the signature scheme. Outsource it to the browser. Drive the page, listen to responses.
- Idle detection > fixed scroll count. Loop until N seconds without new responses, not until N scrolls.
- Keep the user out of the polling loop. A working "did the user log in" detector should poll an authoritative API, not a side-effect cookie.
Cost analysis
| Approach | Engineer hours | Reliability |
|---|---|---|
| Reverse-engineer x-s signatures | 4-8 hrs | Brittle to bundle updates |
| jsdom + bundle replay | 6-12 hrs | Memory-heavy, fragile |
| Playwright + response listener (this post) | 30 minutes | Works as long as the UI tab works |
The browser-as-signer approach is also more honest: you're using the same code path the site expects you to use. You just happen to keep the data around.
Code
The full set of scripts (login, sniff, collected, analyze) lives in the AutoApp toolkit repo. License is MIT.
If you adapt the response-listener pattern to other Chinese platforms (Douyin / Zhihu / Jike), I'd love to hear what works.
This article was originally published by DEV Community and written by 孫昊.
Read original article on DEV Community