The Data Collector

What Changed in Web Data: May 2026 Recap

The Data Collector — Wed, 10 Jun 2026 12:47:26 GMT

Reddit closed the JSON door, Twitch tightened anonymous access — here’s what changed in web data this May, and what it means for your pipelines.

May was a quiet month on the surface and a loud one underneath. Two platforms that data teams have relied on for years quietly changed the rules, and if your collectors touch Reddit or Twitch, you probably felt it. Here’s the recap.

The big story: Reddit’s .json endpoints are gone

For over a decade, the worst-kept secret in web data was that you could append .json to almost any Reddit URL and get clean, structured data back. No API key, no OAuth, no rate-limit dance. Entire monitoring tools, research projects, and brand-tracking dashboards were built on it.

In May, that door closed. Unauthenticated .json requests now return 403 errors across the board. This is the next step in a trajectory that started with Reddit’s 2023 API pricing changes: free, anonymous access to Reddit data is being shut down, surface by surface.

What it means in practice:

Tools and scripts built on .json endpoints broke overnight — silently, in many cases. If your dashboard shows flat Reddit numbers since mid-May, check your collector.
The official API remains, but with authentication requirements and pricing that doesn’t fit casual or research use.
One stable, sanctioned surface still works: RSS feeds. Reddit publishes RSS for subreddits, users, and search queries, and it’s an intentionally public interface — not a loophole.

We rebuilt our Reddit collection on exactly that foundation. Reddit Scraper Fast is live on Apify: it pulls posts from subreddits, user profiles, and keyword searches via RSS — no login, no API key, and it survived the May lockdown without a hiccup because it never depended on the closed endpoints in the first place.

Twitch: stricter limits for anonymous access

Twitch also tightened how much data anonymous clients can page through in one go. If you’ve been collecting streams or channel data without authentication, you may have noticed results capping out earlier than they used to.

We took the opportunity to overhaul our Twitch Scraper. It now supports four modes:

Top streams — who’s live and big right now, by game or overall
Clips — trending clips for any channel or game
Videos — VODs and highlights with view counts and durations
Channel profiles — follower counts, descriptions, and metadata

All modes were rebuilt to work reliably within the new limits, so you get consistent results instead of silent truncation.

The lesson: build on surfaces that are meant to be public

May’s changes follow a pattern we’ve been watching all year: platforms are methodically closing the unofficial endpoints and undocumented tricks that data tools quietly depended on. The pipelines that keep working are the ones built on surfaces designed to be public — RSS, sitemaps, official APIs — with managed scrapers as the fallback for everything else.

If there’s one takeaway from this month: audit your collectors for hidden dependencies on “free” endpoints. They’re disappearing one by one, usually without an announcement.

What’s next

June is already busy — we’re expanding coverage on the job-market and review-platform side, and there’s a deeper guide in the works on building change-resistant data pipelines. If a platform you rely on changed its rules recently and you’re stuck, reply to this email — reader questions have a way of becoming the next deep dive.

Thanks for reading, The Data Collector

How to Scrape Twitter/X in 2026: What Still Works

The Data Collector — Wed, 25 Mar 2026 11:09:01 GMT

Twitter/X scraping in 2026 is a minefield. Most of the tools that worked even a year ago — snscrape, guest tokens, free proxy lists — are now dead or unreliable. Twitter's aggressive bot detection and expensive API tiers ($200-$5,000/month) have narrowed the options considerably.

I put together a practical guide covering what still works: official API access, managed scraping services, and the legal considerations you should know about. If you're collecting public Twitter data for research or monitoring, this breaks down the realistic options without the hype.

Read the full guide: How to Scrape Twitter/X in 2026: Public Data, Rate Limits, and What Still Works

ScraperAPI Tutorial: Build a Web Scraper That Bypasses Anti-Bot Protection

The Data Collector — Wed, 25 Mar 2026 11:07:57 GMT

Modern websites have made simple scraping nearly impossible. TLS fingerprinting, behavioral analysis, and machine learning now work together to detect and block automated access. If you've been struggling with anti-bot protection, you're not alone — the landscape shifted dramatically in 2025-2026.

This tutorial walks through using ScraperAPI to handle the hard parts: proxy rotation across 40M+ residential IPs, JavaScript rendering, and automatic CAPTCHA solving. Includes working code for scraping Google search results, Amazon products, and LinkedIn job listings, plus a complete price-monitoring application.

Read the full tutorial: ScraperAPI Tutorial: Build a Web Scraper That Bypasses Anti-Bot Protection in 2026

Best Proxy Services for Web Scraping in 2026: Residential vs Datacenter vs Rotating

The Data Collector — Wed, 25 Mar 2026 11:05:03 GMT

Choosing the right proxy service can make or break your web scraping project. With sites deploying increasingly sophisticated anti-bot measures in 2026, the difference between residential, datacenter, and rotating proxies matters more than ever. Each type has distinct trade-offs in cost, speed, and detection resistance.

In this guide, I break down the three main proxy categories, compare top providers like ThorData, Bright Data, and Oxylabs, and share practical Python code for implementing proxy rotation. Whether you’re scraping at scale or just getting started, this will help you pick the right tool for the job.

Read the full guide: Best Proxy Services for Web Scraping in 2026: Residential vs Datacenter vs Rotating

How to Scrape Amazon Reviews in 2026: Product Intelligence with Python

The Data Collector — Tue, 24 Mar 2026 10:05:54 GMT

Amazon reviews are one of the most valuable datasets in e-commerce. Whether you’re conducting product research, competitor analysis, or sentiment tracking, programmatic access to this data gives you a serious competitive advantage.

This guide provides a comprehensive overview of extracting Amazon review data in 2026 — from understanding Amazon’s anti-scraping measures to building production-ready Python code that handles pagination, rate limiting, and data extraction at scale.

If you’re building product intelligence tools or doing market research, this is your complete playbook.

👉 Read the full guide with code examples on Dev.to: How to Scrape Amazon Reviews in 2026: Product Intelligence with Python

How to Scrape Trustpilot Reviews in 2026: Build a Reputation Monitor with Python

The Data Collector — Tue, 24 Mar 2026 10:05:24 GMT

Trustpilot is one of the most trusted review platforms in the world — but manually tracking reviews for your brand or competitors doesn’t scale. What if you could extract this data programmatically and build automated monitoring systems?

In this guide, we walk through how to scrape Trustpilot review data using Python in 2026. You’ll learn how to handle Cloudflare protection, extract structured review data, and build a reputation monitoring pipeline that runs on autopilot.

Whether you’re doing competitor analysis, sentiment tracking, or lead generation — this guide covers the complete workflow from setup to production.

👉 Read the full guide with code examples on Dev.to: How to Scrape Trustpilot Reviews in 2026: Build a Reputation Monitor with Python

Google Maps Scraping in 2026: API vs. DIY — What Actually Works?

The Data Collector — Mon, 23 Mar 2026 13:06:08 GMT

If you’re collecting business data from Google Maps in 2026, you’ve probably noticed that neither approach — the official API nor web scraping — is straightforward anymore. I just published a deep-dive guide breaking down both methods, and here are the key takeaways.

The Official API: Clean but Costly

Google’s Places API gives you structured, reliable data — but it comes at a price. At roughly $17 per 1,000 requests, pulling 50,000 business listings would cost around $1,700. For small projects under 1,000 businesses, the API is hard to beat: clean JSON responses, no maintenance headaches, and no risk of getting blocked.

Scraping: Cheaper but Technically Demanding

Direct scraping of Google Maps is significantly harder than it was a few years ago. Google now uses heavy JavaScript rendering, protobuf-encoded data, sophisticated fingerprinting, and aggressive CAPTCHA challenges. Simply sending HTTP requests and parsing HTML won’t get you far.

The workaround? Instead of scraping Maps directly, you can target Google Search local results (using the tbm=lcl parameter), which returns business listings in a more scrapable format. Combined with a proxy rotation service like ScraperAPI to handle IP blocks and CAPTCHAs, this approach scales well for larger datasets at a fraction of the API cost.

The Bottom Line

The right choice depends entirely on scale:

Under 1,000 businesses — use the official Places API. The cost is manageable and you avoid all the technical complexity.
Larger datasets — a scraping approach with proper proxy infrastructure is more cost-effective, but budget time for maintenance as Google regularly changes its markup.

Whichever route you choose, always add delays between requests (2-5 seconds), validate your data quality early, and make sure the data you’re collecting is actually useful before scaling up.

Read the full guide on dev.to →

How to Scrape Amazon Product Prices in 2026

The Data Collector — Mon, 23 Mar 2026 12:14:36 GMT

Amazon is one of the most valuable — and most challenging — data sources on the internet. Real-time prices, competitor stock levels, customer ratings, and historical price changes are all there, but Amazon has spent billions making scraping as difficult as possible. One careless request gets you a 503 error. Two gets you blocked for an hour.

Our latest guide breaks down five practical methods for extracting Amazon product prices in 2026, from the official API to production-grade scraping tools that actually work at scale.

What the Guide Covers

The full tutorial walks through multiple extraction strategies, each suited to different use cases. You will learn how to pull pricing data from Amazon’s JSON-LD structured data blocks (the cleanest method), fall back to HTML parsing with CSS selectors for legacy pages, and use session-based requests with cookie management to look more like a real browser.

For production use, the guide covers managed proxy services like ScraperAPI with Amazon-specific endpoints, as well as Amazon’s own Product Advertising API for legitimate access to pricing data. There is also a complete working Python script with fallback logic, rotating headers, SQLite storage, and a configurable price monitoring loop.

Why Price Monitoring Matters

Whether you are running an e-commerce business tracking competitor prices, building a price comparison tool, doing market research on product categories, or looking for arbitrage opportunities, automated Amazon price monitoring gives you a real edge. The guide covers the anti-bot measures you will face — JavaScript rendering, behavioral fingerprinting, CAPTCHA challenges — and how to handle each one.

You will also learn best practices for responsible scraping: 5–10 second intervals between requests, user-agent rotation, and graceful error handling when Amazon’s defenses kick in.

Read the full guide on dev.to →

Subscribe to The Data Collector for weekly web scraping tutorials and data engineering guides.

How to Scrape LinkedIn Job Listings in 2026

The Data Collector — Mon, 23 Mar 2026 12:11:58 GMT

LinkedIn hosts over 30 million active job listings at any given time, making it the single richest source of employment data on the internet. For recruiters, market researchers, and developers building job aggregation tools, accessing this data programmatically is a game-changer.

The good news? LinkedIn exposes a surprising amount of job data through public guest-facing endpoints — the same ones search engines use to index job listings. No API keys, no login credentials, no paid subscriptions. Just HTTP requests and a bit of Python.

What the Guide Covers

Our full guide walks through the public endpoint approach in detail. You will learn how to query LinkedIn’s unauthenticated jobs-guest API to search for listings by keyword and location, extract full job descriptions using individual job IDs, and export everything to structured CSV files — all with basic Python libraries like requests and BeautifulSoup.

The guide also covers responsible scraping practices: adding 2–3 second delays between requests, rotating user agents, and understanding the legal landscape (including the landmark hiQ Labs v. LinkedIn ruling that supports scraping publicly available data).

For higher-volume needs, the guide covers proxy services like ScraperAPI and pre-built Apify actors that handle rate limiting and IP rotation automatically — so you can scale from a few hundred to tens of thousands of listings without getting blocked.

Why This Matters

Whether you are building a niche job board, tracking salary trends across industries, monitoring competitor hiring patterns, or feeding data into a job recommendation engine, LinkedIn’s public endpoints give you a legitimate starting point. The complete tutorial includes working Python code you can run immediately, plus scaling strategies for production use.

Read the full guide on dev.to →

Subscribe to The Data Collector for weekly web scraping tutorials and data engineering guides.

Build a Bluesky Analytics Dashboard with Python (Step-by-Step)

The Data Collector — Fri, 20 Mar 2026 18:11:19 GMT

Bluesky just crossed 30 million users, and the best part? Its AT Protocol makes all public data freely accessible — no API keys, no rate limit headaches, no $42K/month Twitter API bills.

In this tutorial, you'll build a Python dashboard that scrapes Bluesky posts, analyzes engagement patterns, and outputs actionable insights. Whether you're doing brand monitoring, competitor research, or tracking trending topics — this gets you from zero to working analytics in about 30 minutes.

What We're Building

A Python script that:

Scrapes Bluesky posts matching your search terms
Processes the data into a structured format
Generates engagement analytics (top posts, posting times, hashtag frequency)
Outputs a clean CSV report you can open in any spreadsheet tool

Prerequisites: Python 3.8+, an Apify free account (no credit card needed), and 10 minutes of patience.

Step 1: Set Up the Bluesky Scraper

We use the Bluesky Posts Scraper on Apify, which handles all the AT Protocol complexity for you. It supports search queries, user profile scraping, and hashtag tracking.

The engagement formula weights replies highest (they indicate conversation), then reposts (reach), then likes (passive appreciation). Adjust these weights based on what matters for your use case.

Step 2: Generate Analytics and Visualizations

The script generates a terminal report with top posts by engagement, best posting times, most active authors, and day-of-week analysis. You also get a CSV export and optional matplotlib charts.

Real-World Use Cases

Brand monitoring: Track mentions of your company, product, or competitors.
Content research: Find what topics get the most engagement in your niche.
Academic research: Structured data for discourse analysis.
Trend detection: Run the scraper on a schedule and compare engagement patterns over time.

Read the full tutorial with complete source code on dev.to →

Subscribe to The Data Collector for weekly web scraping guides

ScraperAPI vs Scrape.do vs ScrapeOps: Which Is Worth Paying For in 2026?

The Data Collector — Fri, 20 Mar 2026 18:08:22 GMT

If you’re scraping the web in 2026, you’ve probably hit the wall: CAPTCHAs, IP bans, JavaScript rendering issues. Managed scraping APIs promise to handle all of that for you — proxy rotation, retries, browser rendering — so you can focus on data, not infrastructure.

We tested three popular options head-to-head: ScraperAPI, Scrape.do, and ScrapeOps. Each takes a different approach to the same problem, and the differences matter more than you’d think.

ScraperAPI came out as the all-rounder: 40M+ residential IPs, ~98% success rate in our tests, and the most generous free tier at 5,000 requests/month. It also includes structured data endpoints for Amazon and Google. Scrape.do is the budget pick at $29/month, but with a smaller proxy pool and no structured data parsing. ScrapeOps ($75/month) shines if you’re already using Scrapy at scale, with excellent monitoring dashboards.

FeatureScraperAPIScrape.doScrapeOps Free Tier5,000 req/mo1,000 req/mo1,000 req/mo Starting Price$49/mo$29/mo$75/mo Success Rate (tested)~98%~95%~92% Residential Proxies40M+ IPsAvailableDatacenter only Structured DataYesNoYes

Bottom line: For most use cases, ScraperAPI offers the best overall value. But don’t take our word for it — test it on your actual targets using the free tier before committing.

This is a summary. Read the full comparison on Dev.to

Residential vs Datacenter Proxies: A Practical Guide for 2026

The Data Collector — Fri, 20 Mar 2026 18:07:52 GMT

The proxy you choose can make or break your scraping operation. Residential proxies route through real ISPs and look like genuine users. Datacenter proxies are fast and cheap but easier to detect. The cost difference is staggering: 10-50x more for residential.

So when do you actually need residential? In our testing, datacenter proxies handle public APIs, government portals, news sites, and most targets without sophisticated anti-bot systems just fine — with 95%+ success rates at a fraction of the cost ($0.001-$0.003 per request vs $0.01-$0.05 for residential).

Residential becomes essential for sites with aggressive bot detection: Amazon, LinkedIn, AliExpress, Glassdoor, and anything behind Cloudflare’s toughest settings. If you’re scraping these targets, residential IPs are not optional — they’re the cost of doing business.

The smart play is a tiered strategy: start with datacenter, retry failures through residential, and consider a managed API like ScraperAPI for mixed workloads. Our testing showed managed APIs can match residential success rates (~98%) at datacenter-like costs (~$0.005/page) by intelligently routing requests.

Quick decision guide: Datacenter-only costs ~$200/month with ~40% success on tough sites. Residential-only runs ~$3,000/month with ~97% success. A managed API like ScraperAPI hits ~$490/month with ~98% success — often the sweet spot.

This is a summary. Read the full guide on Dev.to

HN, Substack, and GitHub: Free Scraping API, No Signup Required

The Data Collector — Fri, 20 Mar 2026 18:07:10 GMT

Sometimes you don’t need a full scraping framework. You just want to pull data from Hacker News, Substack, or GitHub — quickly, without signing up for anything. That’s exactly what this lightweight API does.

Running on minimal infrastructure, it offers three endpoints you can hit right now with curl. No API key signup required — just use the demo key:

Hacker News Search:

curl 'https://frog03-20494.wykr.es/api/v1/hn?q=AI+scraping&limit=5&api_key=demo-key-2026'

Substack Articles:

curl 'https://frog03-20494.wykr.es/api/v1/substack?publication=platformer&limit=3&api_key=demo-key-2026'

GitHub Repository Search:

curl 'https://frog03-20494.wykr.es/api/v1/github?q=python+scraper&limit=5&api_key=demo-key-2026'

The free tier gives you 10 requests/day with the demo-key-2026 key. For heavier use, there’s a $9.99/month paid tier. It’s a side project built for personal projects and prototyping — for production workloads, the author recommends dedicated scraping tools.

Try it yourself: Browse the API docs | Read the full article on Dev.to

Best Walmart Scrapers in 2026: A Complete Comparison

Fri, 20 Mar 2026 10:42:35 GMT

We just published a detailed comparison guide on the best Walmart scrapers available in 2026 on dev.to. Here's what you'll learn:

Top Walmart scraping tools compared side-by-side
Pricing, features, and performance benchmarks
Which scraper works best for different use cases
Tips for reliable Walmart data extraction

Read the full guide on dev.to: https://dev.to/the-data-collector

How to Scrape Bandcamp Music Data in 2026

Fri, 20 Mar 2026 10:40:17 GMT

We just published a comprehensive guide on scraping Bandcamp music data on dev.to. Here's what you'll learn:

How to extract artist, album, and track data from Bandcamp
Best tools and approaches for Bandcamp scraping
Handling Bandcamp's site structure effectively
Use cases for Bandcamp data collection

Read the full guide on dev.to: https://dev.to/the-data-collector

How to Get Startup & Company Data from Crunchbase

Fri, 20 Mar 2026 10:38:52 GMT

We just published an in-depth guide on extracting startup and company data from Crunchbase on dev.to. Here's what you'll learn:

Methods to scrape Crunchbase company profiles
Getting funding rounds, investors, and employee data
Comparing Crunchbase scraping tools
Legal considerations and best practices

Read the full guide on dev.to: https://dev.to/the-data-collector

Scraping Bandcamp in 2026: The Most Scraper-Friendly Music Platform

The Data Collector — Fri, 20 Mar 2026 10:02:53 GMT

Bandcamp is one of the most scraper-friendly platforms on the web — if you know where to look.

The platform embeds extractable data in JSON-LD (Schema.org), data-tralbum attributes with track details, and data-band attributes for artist profiles — all without needing API access.

Key Insights

Three data sources in every page — JSON-LD, data-tralbum, and data-band attributes give you tracks, albums, pricing, and artist info from a single HTTP request.
Minimal anti-bot protection — straightforward HTTP requests work with basic rate limiting (1-2 seconds between requests).
Server-rendered HTML — tag and search pages allow pagination-based scraping without JavaScript execution, simplifying infrastructure.

The full article covers track extraction, album scraping, tag browsing, search pagination, and artist analytics with Python examples.

👉 Read the full article on Dev.to →

Scraping Metacritic in 2026: Clean JSON Without API Keys

The Data Collector — Fri, 20 Mar 2026 10:02:52 GMT

Metacritic lacks a public API — yet its frontend calls a backend service at backend.metacritic.com that returns clean JSON, eliminating the need for HTML parsing or API keys.

Standard browser headers (User-Agent, Referer, Origin) are all you need. No API key. No authentication token. No OAuth flow.

Key Insights

Hidden backend API — backend.metacritic.com exposes structured JSON endpoints that power the frontend, making scraping trivial.
No authentication required — just send standard browser headers and you get full access to game scores, reviews, and metadata.
Async-ready architecture — the guide demonstrates concurrent requests with semaphores for efficient bulk data collection.

The full article covers fetching critic and user reviews, browsing games by platform, async scraping patterns, and rate limiting best practices.

👉 Read the full article on Dev.to →

Scraping SoundCloud in 2026: The Hidden API Behind the Music

The Data Collector — Fri, 20 Mar 2026 10:02:51 GMT

SoundCloud shut down public API registrations in 2017 — but developers can still access music data through an undocumented internal API hiding in plain sight.

The platform's frontend quietly calls api-v2.soundcloud.com with an embedded client_id that rotates every few weeks. By extracting this ID from JavaScript bundles, you can query track metadata, artist profiles, playlists, and search results — all returning clean JSON.

Key Insights

The client_id lives in JS bundles — SoundCloud embeds a 32-character identifier that changes periodically, so your scraper needs to re-extract it.
The internal API returns structured JSON — play counts, likes, comments, artist data, and full track metadata without authentication.
Rate limiting is real at scale — expect IP blocks and pagination complexity when scraping thousands of tracks.

Full article covers track scraping, artist stats, search, and playlist extraction with complete Python examples.

👉 Read the full article on Dev.to →

What We Learned Building 11 Web Scrapers in One Week

The Data Collector — Fri, 20 Mar 2026 08:51:03 GMT

We spent the last week building 11 Apify actors for scraping different platforms. Not toy scrapers — production-ready actors that handle pagination, rate limits, and all the weird edge cases real websites throw at you. Here’s what we actually discovered along the way.

Pinterest: The Hidden JSON Goldmine

This one blew our minds. Pinterest stores ALL page data in a JSON blob called __PWS_INITIAL_PROPS__ embedded directly in the HTML. No JavaScript execution needed. No headless browser. No Playwright. One single curl request gets you everything — pins, boards, user data, the whole lot. We went from “this will probably need a full browser” to “wait, it’s just… there?” in about ten minutes. If you’re scraping Pinterest with a headless browser, you’re massively overcomplicating it.

eBay: Surprisingly Wide Open

We expected eBay to put up a fight. It’s one of the biggest e-commerce sites in the world. But eBay serves clean, well-structured HTML to datacenter IPs with essentially no anti-bot measures. No CAPTCHAs, no fingerprinting, no IP blocks. Compared to most e-commerce platforms that throw everything at you, eBay was remarkably straightforward. We had a working scraper in under two hours. Sometimes the simplest explanation is the right one — not every site is trying to stop you.

The Residential Proxy Tax

Reddit, AliExpress, and TikTok all block datacenter IPs aggressively. If you’re hitting them from AWS or any cloud provider, you’re getting blocked immediately. They require residential proxies, which is exactly why they’re hard to scrape reliably — and why scrapers for these platforms tend to be more expensive to run. This is the hidden cost nobody talks about when they say “just scrape it.” Residential proxies eat into your margins fast.

TikTok vs. GitHub: A Tale of Two APIs

Here’s a fun contrast. TikTok has over 4 million Apify runs — clearly massive demand. But the top TikTok actor only has 2.0 out of 5 stars. Why? Because TikTok recently broke profile video feeds even when using stealth Playwright with all the anti-detection tricks. The platform is actively hostile to automation.

GitHub, on the other hand, is the polar opposite. Their public API gives you 60 requests per hour without any authentication at all. Add a free personal access token and you get 5,000 requests per hour. No proxies needed, no browser automation, no cat-and-mouse games. It’s almost like they want you to build on their data.

The Telegram Trick Nobody Knows

This one’s a gem. Telegram public channels can be scraped via t.me/s/{channel} — a clean, server-rendered web endpoint that most people completely overlook. No Telegram API credentials, no bot tokens, no library dependencies. Just plain HTTP requests to a public URL. We were honestly surprised more people aren’t using this approach.

The Takeaway

After building all 11 scrapers, the pattern is clear: the difficulty of scraping a platform has almost nothing to do with the complexity of the data and everything to do with how aggressively the platform fights automation. Pinterest and eBay are data-rich but easy. TikTok and Reddit have simple data structures but are nightmares to access reliably.

All 11 actors are live at apify.com/cryptosignals — if you need data from any of these platforms, check them out. We built them so you don’t have to fight these battles yourself.