https://apify.com/ logo
Join Discord
Powered by
  • how to implement in easypanel
    t

    Talles

    05/29/2025, 9:57 PM
    guys how to implement the crawlee on easypanel, please help , can't make it run with playwright
    0
    a
    • 2
    • 1
  • we are looking for a scraping expert
    u

    🅰ndrew ✪

    05/22/2025, 4:25 PM
    We are seeking a skilled web scraper developer to create an efficient and reliable web scraping tool. The ideal candidate will have experience in extracting data from various websites and handling different data formats. You will be responsible for building a scraper that can navigate through sites, collect data, and store it in a structured format. If you are proficient in web scraping techniques and have a strong understanding of HTML and JavaScript, Please DM me. Please provide examples of previous web scraping projects in your DM.
    0
    l
    • 2
    • 1
  • How to improve recaptcha v3 's score?
    h

    hunterleung.

    05/22/2025, 4:20 PM
    Hi there , does anyone have success experience with recaptcha v3? The target is https://completedns.com/dns-history/ . I want get data from it using Python script. I have tested many recaptcha solver but still failed. Anyone could help? Thanks in advance
    0
    a
    • 2
    • 1
  • Throttle on 429 responses
    h

    Hall

    05/18/2025, 7:47 PM
    Someone will reply to you shortly. In the meantime, this might help:
    0
    d
    • 2
    • 1
  • LinkedIn Session Timeout
    a

    Amal Chandran

    05/16/2025, 1:50 AM
    I was trying to automate some of my regular activity in LinkedIn using stagehand & browser base. I have enabled the proxy mapped it to close by location as well. the account is getting logged out when I perform one or two actions. The workflow - grab the cookies, user agent, location from already logged in browser - create a browser base session using the cookies and proxies - navigate to posts and grab the impressions, likes, comments for tracking progress How can I overcome this issue ?
    0
    h
    a
    m
    • 4
    • 3
  • Timeout in Docker (with Camoufox image)
    m

    MrSquaare

    05/10/2025, 9:18 PM
    Hello everyone, I'm trying to create a scraper with Crawlee + Camoufox that I'll run in a Docker container. To do this, I used the Apify image for Camoufox (https://github.com/apify/apify-actor-docker/tree/master/node-playwright-camoufox) and followed the same tutorial as this one: https://docs.apify.com/sdk/js/docs/guides/docker-images But, for some unknown reason, every request timeout (even a request to google.com). Do you have any idea why this is happening? I tried a simple fetch, which works, so it doesn't seem to be a network issue.
    0
    h
    a
    n
    • 4
    • 7
  • Wiping session between inputs
    b

    BageDevimo

    05/04/2025, 11:02 PM
    Hello! I'm crawling / scraping a site which involves doing the following steps for each input. 1. Entering some data 2. doing a bunch of "Load more" 3. Collect output The problem is that the site differs its experience on the first entry vs the second, and it would be nice to run these in parallel. So I was going to use a new "session" for each one as that no data is retained between inputs, but I can't see how to do that. I'm guessing the site uses session cookies, localStorage, or some combination as I can't see how to get it clean. I almost just want like each request in a new incognito tab, haha. Any tips?
    0
    h
    p
    c
    • 4
    • 6
  • preNavigationHooks not followed
    n

    NeoNomade | Scraping hellhound

    04/23/2025, 10:32 AM
    Camoufox JS integration used. If I log something before the await page.route it works, inside page.route it doesn't.
    Copy code
    typescript
     preNavigationHooks: [
                        async (gotoOptions) => {
                            gotoOptions.waitUntil = "load";
                        },
                        async ({page}) => {
                            await page.route("**/*", async (route) => {
                                const url = route.request().url();
                                const resourceType = route.request().resourceType();
                                const trackingScriptRegex =
                                    /googletagmanager|facebook|sentry|ads|tracking|metrics|analytics|optimizely|segment/i;
                                const extraBlocklistRegex =
                                    /tiktok|facebook|prismic-images|bing|ads|tracking|metrics|analytics|contentsquare|lytics|adtrafficquality|adsrvr|tmol|snapchat|ticketm\.net/i;
    
                                const isBlockedResourceType = ["stylesheet", "font", "media"].includes(resourceType);
                                const isBlockedScript = resourceType === "script" && trackingScriptRegex.test(url);
                                const isBlockedByExtraPatterns = extraBlocklistRegex.test(url);
    
                                const shouldBlock =
                                    !url.includes("recaptcha") &&
                                    (isBlockedResourceType || isBlockedScript || isBlockedByExtraPatterns);
    
                                if (shouldBlock) {
                                    await route.abort();
                                    return;
                                }
    
                                await route.continue();
                            });
                        },
    
                    ],
    0
    h
    o
    m
    • 4
    • 12
  • Proxy settings appear to be cached
    j

    je

    04/19/2025, 1:39 PM
    Hi, I'm trying to use residential proxies on a playwright crawler, but it appears that even when I comment out the proxyConfiguration there is still an attempt to use a proxy. Created a fresh project to create a minimal test to debug and it worked fine, until I had a proxy failure, and then it happened again. The error is: WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session... goto: net::ERR_TUNNEL_CONNECTION_FAILED so clearly it's trying to use a proxy. I have verified this by looking at the process arguments that include --proxy-bypass-list= --proxy-server=http://127.0.0.1:63572 . Any ideas? It's driving me insane. Code as follows:
    Copy code
    import { PlaywrightCrawler } from 'crawlee'
    
    // const proxyConfiguration = new ProxyConfiguration({
    //   proxyUrls: [
    //     '...'
    //   ],
    // })
    
    const crawler: PlaywrightCrawler = new PlaywrightCrawler({
      launchContext: {
        launchOptions: {
          headless: false,
          // channel: 'chrome',
          // viewport: null,
        },
      },
      // proxyConfiguration,
      maxRequestRetries: 0,
      maxRequestsPerCrawl: 5,
      sessionPoolOptions: {
        blockedStatusCodes: [],
      },
      async requestHandler({ request, page, log }) {
        log.info(`Processing ${request.url}...`)
        await page.waitForTimeout(100000)
      },
      failedRequestHandler({ request, log }) {
        log.info(`Request ${request.url} failed too many times.`)
      },
      // browserPoolOptions: {
      //   useFingerprints: false,
      // },
    })
    
    await crawler.addRequests([
      'https://abrahamjuliot.github.io/creepjs/'
    ])
    
    await crawler.run()
    
    console.log('Crawler finished.')
    0
    h
    n
    • 3
    • 4
  • Caching requests for development and testing
    j

    je

    04/18/2025, 10:27 AM
    Hi, I'm wondering what people are doing (if anything) to record and replay requests while building scrapers. A lot of building scrapers is trial and error, making sure you have the right selectors, json paths, etc, so I end up running my code a fair few times. I'd ideally cache the initial request to each endpoint and replay it when it's requested again, just for development, so I'm not continually hitting the website (both for politeness, and also to reduce the chances of triggering any antibot provisions). Thinking back to my ruby days there was a package called VCR which would do this if you instantiated it before HTTP requests, with ways to invalidate the cache. In JS there's netflix's polly which I'm going to try out shortly, but I'm interested to hear what other people are doing/using, if anything. I'm using a mix of crawlers (Basic, Cheerio, Playwright), so looking for something flexible. Cheers!
    0
    a
    h
    a
    • 4
    • 4
  • Customising logging
    j

    je

    04/16/2025, 11:18 AM
    Is there a recommended way to customise logging? I want to be able to log which specific crawler and which handler a log is coming from. I have tried to override the logger in the crawler using
    Copy code
    import defaultLog, { Log } from '@apify/log';
    ...
    const crawler = new BasicCrawler({
      requestHandler: router,
      log: defaultLog.child({ prefix: 'MyCrawler' })
    })
    but then I get the following type error:
    Copy code
    Type 'import("scrapers/node_modules/@apify/log/esm/index", { with: { "resolution-mode": "import" } }).Log' is not assignable to type 'import("scrapers/node_modules/@apify/log/cjs/index").Log'.
      Types have separate declarations of a private property 'options'.ts(2322)
    Thanks!
    0
    h
    l
    n
    • 4
    • 4
  • How to clear cookies?
    v

    Vice

    04/13/2025, 9:02 PM
    I need to clear the cookies for a website before requesting it using the CheerioCrawler, how do I do it? TIA
    0
    h
    l
    l
    • 4
    • 4
  • Browerless + Crawlee
    m

    Martin

    04/11/2025, 12:14 PM
    Hello, Is there any way to run Crawlee on Browserless?
    0
    h
    a
    +2
    • 5
    • 5
  • How to handle 403 error response using Puppeteer and JS when click on the button which hit an API
    a

    Apify developer

    04/09/2025, 5:46 AM
    We are building a scrapper and that is using client side pagination and when we click on the Next page it calls the API but the api returns 403 as they are detecting it is coming from some bot. So how can we bypass that while opening the browser or while doing the scrapping. Any suggestion will be halpful.
    0
    h
    l
    m
    • 4
    • 4
  • Request works in Postman but doesn't works in crawler even with full browser
    l

    Lukas Sirhal

    04/06/2025, 4:20 PM
    Hello I'm trying to handle ajax call via got-scraping. I prepare call in postman, where it works fine. But if I want to try it in Actor a got 403 every time. Even if I try i via Puppeteer or Playwrite and click on the button with request I got response with geo.captcha-delivery.com/captcha url to solve it. Please can anybody give me any advice how to handle this issue?
    0
    h
    l
    +2
    • 5
    • 7
  • about RESIDENTIAL proxies
    n

    new_in_town

    04/01/2025, 4:55 PM
    Hi all, what is your experience with RESIDENTIAL proxies? Let us share: - provider URL - price /GB residential traffic - their advantages/disadvantages My experience: iproyal.com, "royal-residential-proxies" $5.51 per GB with "Pay As You Go" option, I am paying for $66.15 for 12 GB These are good proxies, everything works. But expensive. Recently I've been seeing that the gigabytes I bought are running out too fast.
    0
    h
    • 2
    • 2
  • served with unsupported charset/encoding: ISO-88509-1
    i

    iDora

    03/28/2025, 2:57 PM
    Reclaiming failed request back to the list or queue. Resource http://www.etmoc.com/look/Looklist?Id=47463 served with unsupported charset/encoding: ISO-88509-1
    0
    h
    a
    • 3
    • 5
  • Cannot detect CDP client for Puppeteer
    m

    Martin

    03/28/2025, 12:38 PM
    Hi, How to fix this? `Failed to compile ./node_modules/.pnpm/@crawlee+puppeteer@3.13.0_playwright@1.50.1/node_modules/@crawlee/puppeteer/internals/utils/puppeteer_utils.js:224:22 Module not found: Can't resolve 'puppeteer/package.json' 222 | return client.send(command, ...args); 223 | } > 224 | const jsonPath = require.resolve('puppeteer/package.json'); | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 225 | const parsed = JSON.parse(await (0, promises_1.readFile)(jsonPath, 'utf-8')); 226 | throw new Error(
    Cannot detect CDP client for Puppeteer ${parsed.version}. You should report this to Crawlee, mentioning the puppeteer version you are using.
    ); 227 | } https://nextjs.org/docs/messages/module-not-found`
    0
    h
    a
    • 3
    • 4
  • error in loader module
    z

    Zhasulyainou

    03/20/2025, 12:38 PM
    Hi! Error with Lodash in Crawlee Please help. I ran the actor and got this error. I tried changing to different versions of Crawlee, but the error still persists. node:internal/modules/cjs/loader:1140 const err = new Error(message); ^ Error: Cannot find module './_baseGet' Require stack: - C:\wedat\dat-spain\apps\actor\node_modules\lodash\get.js - C:\wedat\dat-spain\apps\actor\node_modules\@sapphire\shapeshift\dist\cjs\index.cjs - C:\wedat\dat-spain\apps\actor\node_modules\@crawlee\memory-storage\memory-storage.js - C:\wedat\dat-spain\apps\actor\node_modules\@crawlee\memory-storage\index.js - C:\wedat\dat-spain\apps\actor\node_modules\@crawlee\core\configuration.js
    0
    a
    h
    o
    • 4
    • 5
  • Saving the working configurations & Sessions for each sites
    f

    Fabien

    03/20/2025, 9:53 AM
    Hi! I'm new to Crawlee, I'm super excited to migrate my scraping architecture to Crawlee but I can't find how to achieve this. My use case: I'm scraping 100 websites multiple times a day. I'd like to save the working configurations (cookies, headers, proxy) for each site. From what I understand, Session are made for this. However, I'd like to have the working Sessions in my database: this way working sessions persists even if the script shutdown... Also, saving the working configurations in a database would be useful when scaling Crawlee to multiple server instances. My ideal scenario would be to save all the configurations for each sites (including the type of crawler used (cheerio, got, playwright), css selectors, proxy needs, headers, cookies...) Thanks a lot for your help!
    0
    h
    o
    • 3
    • 4
  • Request queue with id: [id] not does not exist
    n

    Nel549

    03/15/2025, 7:02 PM
    I create an API with express that runs crawle when called on an endpoint. It is weird that it works completly fine on the first request I make to the API, but fails on the next ones. I get the error: Request queue with id: [id] not does not exist. I think I'm making some JavaScript mistake tbh, I don't have much experience with it. Here is the way I'm doing the API:
    Copy code
    javascript
    import { crawler } from './main.js'  // Import the exported crawler from main file
    import express from "express";
    
    const app = express();
    app.use(express.json());
    
    const BASE_URL = "https.....";
    
    app.post("/scrape", async (req, res) => {
        if (!req.body || !req.body.usernames) {
            return res.status(400).json({ error: "Invalid input" });
        }
    
        const { usernames } = req.body;
        const urls = usernames.map(username => `${BASE_URL}${username}`);
    
        try {
            await crawler.run(urls);
            const dataset = await crawler.getData();
    
    
            return res.status(200).json({ data: dataset });
        } catch (error) {
            console.error("Scraping error:", error);
            return res.status(500).json({ error: "Scraping failed" });
        }
    });
    
    
    const PORT = parseInt(process.env.PORT) || 3000;
    app.listen(PORT, () => console.log(`Server running on port ${PORT}`));
    Here is how my crawler look:
    Copy code
    javascript
    const proxies = [...] //my proxy list
    
    const proxyConfiguration = new ProxyConfiguration({
        proxyUrls: proxies,
    });
    
    
    export const crawler = new CheerioCrawler({
        proxyConfiguration,
    
        requestHandler: async ({ request, json, proxyInfo  }) => {
            log.info(JSON.stringify(proxyInfo, null, 2))
    
            /// Scraping logic
    
            await Dataset.pushData({
                // pushing data
            });
        }, new Configuration({
        persistStorage: false,
    }));
    0
    h
    o
    • 3
    • 2
  • Only-once storage
    r

    royrusso

    03/14/2025, 1:49 AM
    Helllo all, I’m looking to understand how crawlee uses storage a little better and have a question regarding that: Crawlee truncates the storage of all indexed pages every time I run. Is there a way to not have it do that? Almost like using it as an append-only log for new items found. Worst case scenario, I can keep an in-memory record of all pages and simply not write to disk when I see it. Curious what best practices are here.
    0
    h
    • 2
    • 2
  • Camoufox failing
    n

    NeoNomade | Scraping hellhound

    03/13/2025, 12:02 PM
    I have a project that is using the PlaywrightCrawler from Crawlee. If I create the template camoufox it's running perfectly, when I take the same commands from the package.json of the template and basically following the same example in my project I get the following error:
    Copy code
    2025-03-13T11:58:38.513Z [Crawler] [INFO ℹ️] Finished! Total 0 requests: 0 succeeded, 0 failed.
    {"terminal":true}
    2025-03-13T11:58:38.513Z [Crawler] [ERROR ❌] BrowserLaunchError: Failed to launch browser. Please check the following:
    - Check whether the provided executable path "/Users/dp420/.cache/camoufox/Camoufox.app/Contents/MacOS/camoufox" is correct.
    - Try installing the required dependencies by running `npx playwright install --with-deps` (https://playwright.dev/docs/browsers).
    Of course none of those 2 ideas are helping, camoufox binary is already there, and playwright install --with-deps have been already ran because the project was previously running firefox. the entire error log is attached https://cdn.discordapp.com/attachments/1349714201794314240/1349714201999970374/message.txt?ex=67d41ace&is=67d2c94e&hm=94ccecebb822a84aa03fb30f46efea8dab1fd4579034e8dd715ac737926f310f&
    0
    h
    n
    • 3
    • 4
  • Redirect Control
    n

    nikus

    03/12/2025, 6:36 PM
    Im trying to make a simple crawler, how do proper control the redirects? Some bad proxies sometimes redirect to auth page , in this case i want to mark the request as failed if the redirect URL ( target ) contains something like /auth/login. Whats the best to handle this scenarios and abort the request earlier?
    0
    h
    a
    o
    • 4
    • 6
  • TypeError: Invalid URL
    m

    Mathias Berwig

    03/11/2025, 7:48 PM
    Adding requests with
    crawler.run(["https://website.com/1234"]);
    works locally while in the apify cloud it breaks with the following error:
    Reclaiming failed request back to the list or queue. TypeError: Invalid URL
    It appears that while running in the cloud, the URL is split by character and each creates a request in the queue, as it can be seen in the screenshot. The bug happens no matter the URL is hardcoded in the code or added dynamically via input. I'm using crawlee 3.13.0. Complete error stack:
    Copy code
    WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. TypeError: Invalid URL
    2025-03-11T19:21:27.987Z     at new URL (node:internal/url:806:29)
    2025-03-11T19:21:27.988Z     at getCookieContext (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:75:20)
    2025-03-11T19:21:27.989Z     at CookieJar.getCookies (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:452:23)
    2025-03-11T19:21:27.989Z     at CookieJar.callSync (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:168:16)
    2025-03-11T19:21:27.990Z     at CookieJar.getCookiesSync (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:575:21)
    2025-03-11T19:21:27.991Z     at Session.getCookies (/home/myuser/node_modules/@crawlee/core/session_pool/session.js:264:40)
    2025-03-11T19:21:27.992Z     at PlaywrightCrawler._applyCookies (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:344:40)
    2025-03-11T19:21:27.992Z     at PlaywrightCrawler._handleNavigation (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:329:20)
    2025-03-11T19:21:27.993Z     at async PlaywrightCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:260:13)
    2025-03-11T19:21:27.994Z     at async PlaywrightCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/playwright/internals/playwright-crawler.js:114:9) {"id":"PznVw0jlt50G6EL","url":"D","retryCount":1}
    https://cdn.discordapp.com/attachments/1349106681929273344/1349106682592104529/image.png?ex=67d1e502&is=67d09382&hm=e256bfab84c422dac13112a30ab70ab4aa3d9d84f2360e9a1a5fab78a5e1a358&
    0
    h
    g
    +2
    • 5
    • 4
  • How to ensure dataset is created before pushing data to it?
    c

    Casper

    03/07/2025, 1:15 PM
    I have a public actor and some of my users experience that either default and/or named datasets don't seem to be existing and somehow won't be created when pushing data to them. This is the error message I can see affecting only a handful of user runs:
    Copy code
    bash
    
    ERROR PlaywrightCrawler: Request failed and reached maximum retries. ApifyApiError: Dataset was not found
    2025-03-06T17:37:21.112Z   clientMethod: DatasetClient.pushItems
    2025-03-06T17:37:21.113Z   statusCode: 404
    2025-03-06T17:37:21.115Z   type: record-not-found
    2025-03-06T17:37:21.119Z   httpMethod: post
    2025-03-06T17:37:21.120Z   path: /v2/datasets/<redacted>/items
    2025-03-06T17:37:21.122Z   stack:
    2025-03-06T17:37:21.124Z     at makeRequest (/home/myuser/node_modules/apify-client/dist/http_client.js:187:30)
    2025-03-06T17:37:21.125Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    2025-03-06T17:37:21.127Z     at async DatasetClient.pushItems (/home/myuser/node_modules/apify-client/dist/resource_clients/dataset.js:104:9)
    2025-03-06T17:37:21.129Z     at async processSingleReviewDetails (file:///home/myuser/dist/helperfunctions.js:365:5)
    2025-03-06T17:37:21.131Z     at async Module.processReviews (file:///home/myuser/dist/helperfunctions.js:379:13)
    2025-03-06T17:37:21.133Z     at async getReviews (file:///home/myuser/dist/main.js:37:5)
    2025-03-06T17:37:21.135Z     at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/myuser/dist/main.js:98:13)
    2025-03-06T17:37:21.137Z     at async wrap (/home/myuser/node_modules/@apify/timeout/cjs/index.cjs:54:21)
    2025-03-06T17:37:21.139Z   data: undefined {"id":"<redacted>","url":"<redacted>?sort=recency&languages=all","method":"GET","uniqueKey":"https://www.trustpilot.com/review/<redacted>?languages=all&sort=recency"}
    
    `
    How can I ensure that the datasets are created ahead of time when running the scraper before it collects data and then fails because the dataset cant be created or does not exist?
    0
    h
    p
    • 3
    • 7
  • Routing issue
    s

    Scai

    03/05/2025, 7:57 PM
    I have a listing website as INPUT and
    enqueueLinks
    of it. These links (case studies) at the time has also multiple pages. When the cralwer adds the links with the new label attached, it's not happening anything. When using only case study page, it's scrapping the data and working. Not sure, what to do next and how to test it more. Does the Queue System waits to complete to add all links to start scrapping?
    0
    h
    p
    • 3
    • 3
  • Using BrightData's socks5h proxies
    j

    Jeno

    03/05/2025, 7:08 PM
    BrightData's datacenter proxies can be used with socks5 but only with remote dns resolution, thus the protocol should be given like socks5h://... Testing it with curl works, but using it in crawlee it doesn't work. Just keeps hanging.
    Copy code
    proxyConfiguration: new ProxyConfiguration({
          newUrlFunction: () => {
            return 'socks5h://brd-customer-...-zone-...:...@brd.superproxy.io:22228';
          },
        })
    Any idea how it should work? Edit: since I use CamouFox, I tried:
    Copy code
    firefoxUserPrefs: {
              'network.proxy.socks_remote_dns': true,  // Enable remote DNS resolution
            },
    But it still just hangs.
    0
    h
    p
    • 3
    • 3
  • Loadtime
    b

    BOBPG

    03/01/2025, 11:08 PM
    Hello, Is there a way to get the load time of a site from crawlee in headless mode? I'm using PlaywrightCrawler. Thanks!
    0
    h
    p
    • 3
    • 3
  • How to stop following delayed javascript redirects?
    n

    Nth

    03/01/2025, 3:58 PM
    I'm using the AdaptivePlaywrightCrawler with the same-domain strategy in enqueueLinks. The page I'm trying to crawl has delayed JavaScript redirects to other pages, such as Instagram. Sometimes, the crawler mistakenly thinks it's still on the same domain after a redirect and starts adding Instagram URLs to the main domain, like example.com/account/... and example.com/member/..., which don't actually exist, so, how can I stop following these delayed JavaScript redirects?
    0
    h
    p
    a
    • 4
    • 6