Here is everything I know about web scraping

1. Avoid browser automation

Browser automation is slow, brittle, and expensive. It’s also not scalable. It's interesting that many scraping tutorials use Selenium or Puppeteer, but that's the last thing you want to do. Sometimes you can't avoid it. In those cases, I try and and use it sparingly, usually to grab cookies that I'm unable to get otherwise.

2. Use the API's of the site

A site has to get their data 1 of 2 ways, either it's server rendered, in which case you can just fetch the page, or it's client rendered, in which case you can use the API's that the site uses to get their data.

And surprisingly, API's that are called from the client (ie the browser) really can't be protected very well.

Calling API's are waaaay faster and more reliable than using a browser automation tools like Selenium or Puppeteer to scrape the website, which can easily get blocked with captchas.

How to find the sites API's:

Right Click > Inspect Element > Network Tab > Fetch/XHR

Refresh the page to see if any requests come in. If they do, awesome. Click on them to see their response and if it includes the data you’re looking for.

Many of the requests are loading images, so they won’t have any text in the responses.

You can search through all the responses with Cmd + F to make it easier to find the request you’re looking for.

If you don’t see any API calls, try to click on the next page of results, or navigate around the site to see if any calls are fired off.

Once you find the request that you’re looking for, right click on it > Copy > Copy as Node.js fetch (or curl for you python devs)

This gets us all the cookies and most of the headers in order to make a successful request.

Paste in your code editor. (Remember to await fetch as well as await the response)

Run it.

If you run it locally or on a server and it doesn’t work, there are a bunch of things you can try to get it to work:

Make sure you add a “User-Agent” header on the request. For some reason that copy as node fetch command doesn’t capture the user agent.
Use axios or my favorite, got-scraping. This puppy is a lifesaver. Really awesome package that enables you to scrape a lot of sites pretty easily. And gets around that "Please enable JavaScript" cloudflare crap.
Use HttpsProxyAgent rejectUnauthorized: true. This hasn’t worked for me a ton, but worth giving a shot.

import https from 'https'
const res = await fetch(`https://sample-site.chromedriver`, {
  agent: new https.Agent({
    rejectUnauthorized: false,
  }),
})

Sometimes you need to get fresh cookies. Make a fetch request to any part of the website before you make the call to the api, and get the cookies using response.headers
Use a residential IP. This is pretty much your trump card. This works in most cases. For cheap residential IP's with unlimited bandwidth, storm proxies is a steal. If the site is well protected though, you'll need a premium residential IP, and for those I use smart proxy.
Use the site's mobile app if it has one to find the API's. This doesn’t work for super dee duper secure apps like IG/Facebook. But on less secure ones it works wonders. Use Mitmproxy find the endpoints. Here is a video on how to use it: https://www.youtube.com/watch?v=xQGC-8ojYbU&t=1s

3. Fetch the page

If the site is server rendered, you'll have to fetch the page HTML, and then parse it. To do that, you’re going to Inspect Element > Network Requests > All and it’s usually the first request.

If you're lucky, and in many cases you are, the site will have a JSON representation of the page that you can use instead of parsing the HTML.

Especially Next.js. On the window object, there is a NEXT_DATA object that has all the data you need. So in your console just type window.NEXT_DATA and you'll see everything.

If that doesn't work, then you have the painstaking task of parsing the HTML. Cheerio is a great library for this, or BeautifulSoup if you're using python.

4. If you’ve gone through all of this and STILL can’t get the data you’re looking for, THEN you have my permission to use Puppeteer/Playwright/Selenium.

5. For the beefiest of beefiest sites, use undetected-chromedriver

Only available in python, but it saved my bacon on one project.

6. Now that you have the code written, where do you deploy it?

I am a big fan of AWS Lambda functions. They are ridiculously cheap, and probably the best thing is that the IP rotates automatically, preventing you from being blocked. The only bummer is it's impossible to know exactly how much each function is costing you, which is super annoying. There are services out there that will help you with this, but I haven't tried one yet. But plan to.

For jobs that run every day, or minute or whatever, I use cron jobs hosted on render.com. They are super cheap, and VERY easy to use. Think Heroku, but WAY better.

7. How to get around captchas

You shouldn't be running into captchas. I actually have never run into captchas. (Knock on wood)

8. How to get around IP blocks

Use Lambda functions or a IP's from storm proxy or smart proxy. Fun fact about lambda functions, if you update them, you get a new IP. So if you're getting blocked, just update the function and you'll get a new IP. You can do that in the aws sdk. Like this:

import {
  LambdaClient,
  UpdateFunctionConfigurationCommand,
  InvokeCommand,
} from '@aws-sdk/client-lambda'

const lambdaClient = new LambdaClient({
  region: 'your-region',
  credentials: {
    accessKeyId: process.env.accessKeyId,
    secretAccessKey: process.env.secretAccessKey,
  },
})

export const updateLambdaFunction = async (functionName) => {
  const params = {
    FunctionName: functionName,
    Environment: {
      Variables: {},
    },
  }
  try {
    const data = await lambdaClient.send(
      new UpdateFunctionConfigurationCommand(params),
    )
    console.log(`Success, lambda function updated`, functionName)
  } catch (err) {
    console.log('Error', err)
  }
}

9. Subscribe to my YouTube channel!

I have a LOT of helpful videos, if I do say so myself 😁