Apify #crawlee-js

Skip request in preNavigationHooks

André Mácola

08/01/2025, 8:56 PM

is it possible to skip que request for url in preNavigationHooks ? I don't want to do the request at all in request handler if something occurs in preNavigationHooks. The only thing that worked for me is throwing a NonRetryableError but I think this is not ideal. The request.skipNavigation is not ideal because the request itself still occurs. ATM I'm using NonRetryableError but my logs are ugly. How do I suppress the logs? And I think too many NonRetryableError will cause Crawlee to fail?

postNavigationHooks timeout

banyan

07/29/2025, 2:37 PM

I'm using camoufox and handleCloudflareChallenge utility in postNavigationHooks and request timeout after 100 seconds. Is it possible to lower the timeout limit from 100 in postNavigationHooks? it seems like it doesn't respect requestHandlerTimeoutSecs or navigationTimeoutSecs

Rotate country in proxy for each request

thenetaji

07/26/2025, 5:10 PM

can we rotate country for proxy without relaunching crawlee? I need to use specific country for every url, without relaunching crawlee everytime.

Crawleee js vs crawlee python

banyan

07/24/2025, 6:56 AM

I've only used crawlee js and I'm wondering does cralee js has the same features as crawlee python? Is one better than the other in some cases?

Managing duplicate queries using RequestQueue but it seems off.

fierDeToiMonGrand

07/17/2025, 8:11 PM

## Description It appears that my custom RequestQueue isn't working as expected. Very few jobs are being processed, even though my RequestQueue list has many more job IDs.

Copy code

import { RequestQueue } from "crawlee";

let jobQueue: RequestQueue;
async function initializeJobQueue() {
  if (!jobQueue) {
    jobQueue = await RequestQueue.open("job-deduplication-queue");
  }
}

async function fetchJobPages(page: Page, jobIds: string[], origin: string) {
  await initializeJobQueue();

  const filteredJobIds = [];
  if (saveOnlyUniqueItems) {
    for (const jobId of jobIds) {
      const jobUrl = `${origin}/viewjob?jk=${jobId}`;
      const request = await jobQueue.addRequest({ url: jobUrl });
      if (!request.wasAlreadyPresent) filteredJobIds.push(jobId);
    }
  } else {
    filteredJobIds.push(...jobIds);
  }

  myLog(
    `Filtered ${jobIds.length - filteredJobIds.length} duplicates, ` +
    `processing ${filteredJobIds.length} unique jobs.`
  );

  // fetchJobWithRetry and batching logic follows...
}

Am i using the request correctly, I am not using the default one from the crawler because my scrapping logic does not allow it.

CPU Overloaded - false positiv

Answer Overflow

07/17/2025, 5:39 PM

message has been deleted

re-enqueue request without throwing error

NeoNomade | Scraping hellhound

07/14/2025, 10:29 AM

Is there any method of doing a request again without throwing an error but also respecting the maximum retries ?

Configuring playwright + crawlee js to bypass certain sites

Rói

07/02/2025, 8:09 AM

I have noticed some pages that appear completely normal are sometimes hard to fetch content from. After some investigation, it might have something to do with the site being behind cloudflare. Do you have any suggestions on how to get past this? I believe in certain cases, it's simply a matter of popups and accepting some cookies. I do have stealth plugin added, but it still does not pierce through.

Target business owners #crawlee-js

Bryan Allred

07/01/2025, 10:25 PM

Business Owners: Automate the Impossible — Before Your Competitors Do From securing high-demand tickets to automating bulk product checkouts, online reservations, and real-time data scraping Whether you're in: ✅ E-commerce (auto-buy bots, inventory monitors) ✅ Travel & Hospitality (auto-reserve flights, hotels, bookings) ✅ Events & Entertainment (ticket sniping, seat selection) ✅ Market Intelligence (real-time scraping & monitoring) I develop stealth systems that bypass queues, solve CAPTCHAs, rotate IPs, and mimic human behavior — all tailored to your use case. DM me if your business needs to move faster than the competition. Let automation handle the grind while you focus on growth.

Pure LLM approach

Rói

06/30/2025, 11:59 PM

How would you go about this problem? Given x topic, you want to extract y data from a list of website base urls. Is there any built-in functionality for this? If not, how do you solve this? I have attempted crawling entire sites, and one shot prompt the entire aggregated stuff to LLM given context window 1mill or higher. Seems to work okay, but I'm positive there are techniques to scrap tags / unrelated meta data from each url straped within every site. Then there's the 2 step approach, crawl all links with fixed max_pages, but since I am building LLM approach that is language agnostic, I can not rely on keywords for heuristics. I literally have to crawl all links with data around the href, feed it into an LLM to determine what is relevant and the consequently crawl those targeted URLs. FYI: Using JS version with playwright.

Anyone here automated LinkedIn profile analytics before?

sev-puri

06/29/2025, 4:12 PM

Trying to build a dashboard that fetches data like impressions, followers, views, etc. Using Playwright with saved cookies, realistic headers, delays, etc., but still running into issues: - Getting blocked by bot detection - Pages load skeletons, not actual content Tried direct URLs, stealth mode, session reuse… still not stable. Has anyone found a reliable way to do this or detect when the content is actually loaded?

X's terms of service

JOE

06/28/2025, 5:28 PM

Hello @Kacka H. , Does crawlee & apify service abide by X's terms of services when i use it to collect tweets for academic purposes ? Thanks in advance.

I need account

Answer Overflow

06/22/2025, 9:04 PM

message has been deleted

Invalidate request queue after some time

MrSquaare

06/19/2025, 5:47 PM

Hello! I would like to know if there's a builtin feature to invalidate (purge) the request queue after some time? Thanks!

Apify Question:

Arz

06/16/2025, 8:04 AM

Does anyone know a good option (preferably free, but I don't mind paying a little if it's good) that can extract data? I'm looking for a tool that can search specific websites (ideally 5 of my choice) for job offers and then forward the results to Make.com, which will handle the rest of the workflow.

crawlee/js playwrite help related to clicking

ムNKiT

06/10/2025, 6:44 AM

when we are building any scraper we something need to click on any button or somewhere but after reading crawlee docs i didn't find anything related to click option, can someone please guide me to do it

LinkedIn DM Sync to DB

Amal Chandran

06/09/2025, 4:06 AM

LinkedIn and Sales Navigator messaging capabilities are insufficient for efficient operations in our use case. We want to implement a system that enables real-time, bidirectional syncing of LinkedIn messages to a database, allowing us to build additional features on top of them (e.g. unified inbox functionality). What is the best approach to achieve this, considering LinkedIn’s API limitations and anti-automation policies?

Crawlee PuppeteerCrawler not starting with Chrome Profile

𝓓𝙤𝙘𝙠𝙚𝙧𝙨

06/07/2025, 5:34 PM

I need a Chrome profile to run the scraper, since I need my session cookies to access precise pages. This is my code

Copy code

js
import { PuppeteerCrawler, Dataset } from "crawlee";
import { router } from "./routes.js";

const crawler = new PuppeteerCrawler({
    launchContext: {
        useChrome: true,
        userDataDir: 'C:\Users\enric\AppData\Local\Google\Chrome\User Data\Default',
        launchOptions: {
            headless: false,
        }
    },
    requestHandler: router,
    async failedRequestHandler({ request }) {
        // This function is called when the crawling of a request failed too many times
        await Dataset.pushData({
            url: request.url,
            succeeded: false,
            errors: request.errorMessages,
        })
    },
});

await crawler.run([
    'mylink'
]);

Still the crawler opens without the Chrome Profile 🙂

enqueueLinks with urls don't trigger router handler

FoudreTower

06/07/2025, 12:14 AM

Hello my "search" handler enqueues a url ( I have verified and the url exists and is valid ) to my "subprocessors" handler but for some reasons it's not being triggered

Copy code

js
router.addHandler(
  "search",
  async ({ request, page, log, pushData, enqueueLinks }) => {
    log.info(`Processing ${request.url} ...`);

    // Find all search result items
    const searchElements = await page
      .locator("div#search div#rso div[data-hveid][lang]")
      .all();

    for (let index = 0; index < 1; index++) {
      const element = searchElements[index];

      const url =
        (await element.locator("a").first().getAttribute("href"))

      await enqueueLinks({
        label: "subprocessors",
        urls: [url],
      });
    }

    console.log(await Dataset.getData());
  }
);

// Not triggered
router.addHandler("subprocessors", async ({ request, page, log, pushData }) => {
  log.info(`Processing ${request.url} ...`);
});

Is the default request queue the same for different crawler instances?

MrSquaare

06/03/2025, 6:39 PM

Hello everyone, I would like to know if the default request queue (if not specified in the Crawler options) is the same for all instances? I tried to run an HttpCrawler next to a PlaywrightCrawler and for some unknown reason the HttpCrawler picked a request which was for the PlaywrightCrawler Thanks

What proxy providers work best with Crawlee?

Hustler

06/03/2025, 4:17 AM

We are trying to benchmark different proxies - which ones are the best?

Max requests per second

MrSquaare

06/01/2025, 1:02 PM

Hello! I would like to know, is there, like for the maxRequestsPerMinute / maxTasksPerMinute an option but for second? If not, what would be the easiest way to implement this? Always waiting 1s in the request handler and relying on maxConcurrency? Thanks!

how to implement in easypanel

Talles

05/29/2025, 9:57 PM

guys how to implement the crawlee on easypanel, please help , can't make it run with playwright

we are looking for a scraping expert

🅰ndrew ✪

05/22/2025, 4:25 PM

We are seeking a skilled web scraper developer to create an efficient and reliable web scraping tool. The ideal candidate will have experience in extracting data from various websites and handling different data formats. You will be responsible for building a scraper that can navigate through sites, collect data, and store it in a structured format. If you are proficient in web scraping techniques and have a strong understanding of HTML and JavaScript, Please DM me. Please provide examples of previous web scraping projects in your DM.

How to improve recaptcha v3 's score?

hunterleung.

05/22/2025, 4:20 PM

Hi there , does anyone have success experience with recaptcha v3? The target is https://completedns.com/dns-history/ . I want get data from it using Python script. I have tested many recaptcha solver but still failed. Anyone could help? Thanks in advance

Throttle on 429 responses

Hall

05/18/2025, 7:47 PM

Someone will reply to you shortly. In the meantime, this might help:

LinkedIn Session Timeout

Amal Chandran

05/16/2025, 1:50 AM

I was trying to automate some of my regular activity in LinkedIn using stagehand & browser base. I have enabled the proxy mapped it to close by location as well. the account is getting logged out when I perform one or two actions. The workflow - grab the cookies, user agent, location from already logged in browser - create a browser base session using the cookies and proxies - navigate to posts and grab the impressions, likes, comments for tracking progress How can I overcome this issue ?

Timeout in Docker (with Camoufox image)

MrSquaare

05/10/2025, 9:18 PM

Hello everyone, I'm trying to create a scraper with Crawlee + Camoufox that I'll run in a Docker container. To do this, I used the Apify image for Camoufox (https://github.com/apify/apify-actor-docker/tree/master/node-playwright-camoufox) and followed the same tutorial as this one: https://docs.apify.com/sdk/js/docs/guides/docker-images But, for some unknown reason, every request timeout (even a request to google.com). Do you have any idea why this is happening? I tried a simple fetch, which works, so it doesn't seem to be a network issue.

Wiping session between inputs

BageDevimo

05/04/2025, 11:02 PM

Hello! I'm crawling / scraping a site which involves doing the following steps for each input. 1. Entering some data 2. doing a bunch of "Load more" 3. Collect output The problem is that the site differs its experience on the first entry vs the second, and it would be nice to run these in parallel. So I was going to use a new "session" for each one as that no data is retained between inputs, but I can't see how to do that. I'm guessing the site uses session cookies, localStorage, or some combination as I can't see how to get it clean. I almost just want like each request in a new incognito tab, haha. Any tips?

preNavigationHooks not followed

NeoNomade | Scraping hellhound

04/23/2025, 10:32 AM

Camoufox JS integration used. If I log something before the await page.route it works, inside page.route it doesn't.

Copy code

typescript
 preNavigationHooks: [
                    async (gotoOptions) => {
                        gotoOptions.waitUntil = "load";
                    },
                    async ({page}) => {
                        await page.route("**/*", async (route) => {
                            const url = route.request().url();
                            const resourceType = route.request().resourceType();
                            const trackingScriptRegex =
                                /googletagmanager|facebook|sentry|ads|tracking|metrics|analytics|optimizely|segment/i;
                            const extraBlocklistRegex =
                                /tiktok|facebook|prismic-images|bing|ads|tracking|metrics|analytics|contentsquare|lytics|adtrafficquality|adsrvr|tmol|snapchat|ticketm\.net/i;

                            const isBlockedResourceType = ["stylesheet", "font", "media"].includes(resourceType);
                            const isBlockedScript = resourceType === "script" && trackingScriptRegex.test(url);
                            const isBlockedByExtraPatterns = extraBlocklistRegex.test(url);

                            const shouldBlock =
                                !url.includes("recaptcha") &&
                                (isBlockedResourceType || isBlockedScript || isBlockedByExtraPatterns);

                            if (shouldBlock) {
                                await route.abort();
                                return;
                            }

                            await route.continue();
                        });
                    },

                ],