Talles
05/29/2025, 9:57 PM🅰ndrew ✪
05/22/2025, 4:25 PMhunterleung.
05/22/2025, 4:20 PMHall
05/18/2025, 7:47 PMAmal Chandran
05/16/2025, 1:50 AMMrSquaare
05/10/2025, 9:18 PMBageDevimo
05/04/2025, 11:02 PMNeoNomade | Scraping hellhound
04/23/2025, 10:32 AMtypescript
preNavigationHooks: [
async (gotoOptions) => {
gotoOptions.waitUntil = "load";
},
async ({page}) => {
await page.route("**/*", async (route) => {
const url = route.request().url();
const resourceType = route.request().resourceType();
const trackingScriptRegex =
/googletagmanager|facebook|sentry|ads|tracking|metrics|analytics|optimizely|segment/i;
const extraBlocklistRegex =
/tiktok|facebook|prismic-images|bing|ads|tracking|metrics|analytics|contentsquare|lytics|adtrafficquality|adsrvr|tmol|snapchat|ticketm\.net/i;
const isBlockedResourceType = ["stylesheet", "font", "media"].includes(resourceType);
const isBlockedScript = resourceType === "script" && trackingScriptRegex.test(url);
const isBlockedByExtraPatterns = extraBlocklistRegex.test(url);
const shouldBlock =
!url.includes("recaptcha") &&
(isBlockedResourceType || isBlockedScript || isBlockedByExtraPatterns);
if (shouldBlock) {
await route.abort();
return;
}
await route.continue();
});
},
],
je
04/19/2025, 1:39 PMimport { PlaywrightCrawler } from 'crawlee'
// const proxyConfiguration = new ProxyConfiguration({
// proxyUrls: [
// '...'
// ],
// })
const crawler: PlaywrightCrawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
headless: false,
// channel: 'chrome',
// viewport: null,
},
},
// proxyConfiguration,
maxRequestRetries: 0,
maxRequestsPerCrawl: 5,
sessionPoolOptions: {
blockedStatusCodes: [],
},
async requestHandler({ request, page, log }) {
log.info(`Processing ${request.url}...`)
await page.waitForTimeout(100000)
},
failedRequestHandler({ request, log }) {
log.info(`Request ${request.url} failed too many times.`)
},
// browserPoolOptions: {
// useFingerprints: false,
// },
})
await crawler.addRequests([
'https://abrahamjuliot.github.io/creepjs/'
])
await crawler.run()
console.log('Crawler finished.')
je
04/18/2025, 10:27 AMje
04/16/2025, 11:18 AMimport defaultLog, { Log } from '@apify/log';
...
const crawler = new BasicCrawler({
requestHandler: router,
log: defaultLog.child({ prefix: 'MyCrawler' })
})
but then I get the following type error:
Type 'import("scrapers/node_modules/@apify/log/esm/index", { with: { "resolution-mode": "import" } }).Log' is not assignable to type 'import("scrapers/node_modules/@apify/log/cjs/index").Log'.
Types have separate declarations of a private property 'options'.ts(2322)
Thanks!Vice
04/13/2025, 9:02 PMMartin
04/11/2025, 12:14 PMApify developer
04/09/2025, 5:46 AMLukas Sirhal
04/06/2025, 4:20 PMnew_in_town
04/01/2025, 4:55 PMiDora
03/28/2025, 2:57 PMMartin
03/28/2025, 12:38 PMCannot detect CDP client for Puppeteer ${parsed.version}. You should report this to Crawlee, mentioning the puppeteer version you are using.
);
227 | }
https://nextjs.org/docs/messages/module-not-found`Zhasulyainou
03/20/2025, 12:38 PMFabien
03/20/2025, 9:53 AMNel549
03/15/2025, 7:02 PMjavascript
import { crawler } from './main.js' // Import the exported crawler from main file
import express from "express";
const app = express();
app.use(express.json());
const BASE_URL = "https.....";
app.post("/scrape", async (req, res) => {
if (!req.body || !req.body.usernames) {
return res.status(400).json({ error: "Invalid input" });
}
const { usernames } = req.body;
const urls = usernames.map(username => `${BASE_URL}${username}`);
try {
await crawler.run(urls);
const dataset = await crawler.getData();
return res.status(200).json({ data: dataset });
} catch (error) {
console.error("Scraping error:", error);
return res.status(500).json({ error: "Scraping failed" });
}
});
const PORT = parseInt(process.env.PORT) || 3000;
app.listen(PORT, () => console.log(`Server running on port ${PORT}`));
Here is how my crawler look:
javascript
const proxies = [...] //my proxy list
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: proxies,
});
export const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: async ({ request, json, proxyInfo }) => {
log.info(JSON.stringify(proxyInfo, null, 2))
/// Scraping logic
await Dataset.pushData({
// pushing data
});
}, new Configuration({
persistStorage: false,
}));
royrusso
03/14/2025, 1:49 AMNeoNomade | Scraping hellhound
03/13/2025, 12:02 PM2025-03-13T11:58:38.513Z [Crawler] [INFO ℹ️] Finished! Total 0 requests: 0 succeeded, 0 failed.
{"terminal":true}
2025-03-13T11:58:38.513Z [Crawler] [ERROR ❌] BrowserLaunchError: Failed to launch browser. Please check the following:
- Check whether the provided executable path "/Users/dp420/.cache/camoufox/Camoufox.app/Contents/MacOS/camoufox" is correct.
- Try installing the required dependencies by running `npx playwright install --with-deps` (https://playwright.dev/docs/browsers).
Of course none of those 2 ideas are helping, camoufox binary is already there, and playwright install --with-deps have been already ran because the project was previously running firefox.
the entire error log is attached
https://cdn.discordapp.com/attachments/1349714201794314240/1349714201999970374/message.txt?ex=67d41ace&is=67d2c94e&hm=94ccecebb822a84aa03fb30f46efea8dab1fd4579034e8dd715ac737926f310f&nikus
03/12/2025, 6:36 PMMathias Berwig
03/11/2025, 7:48 PMcrawler.run(["https://website.com/1234"]);
works locally while in the apify cloud it breaks with the following error: Reclaiming failed request back to the list or queue. TypeError: Invalid URL
It appears that while running in the cloud, the URL is split by character and each creates a request in the queue, as it can be seen in the screenshot.
The bug happens no matter the URL is hardcoded in the code or added dynamically via input.
I'm using crawlee 3.13.0.
Complete error stack:
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. TypeError: Invalid URL
2025-03-11T19:21:27.987Z at new URL (node:internal/url:806:29)
2025-03-11T19:21:27.988Z at getCookieContext (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:75:20)
2025-03-11T19:21:27.989Z at CookieJar.getCookies (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:452:23)
2025-03-11T19:21:27.989Z at CookieJar.callSync (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:168:16)
2025-03-11T19:21:27.990Z at CookieJar.getCookiesSync (/home/myuser/node_modules/tough-cookie/dist/cookie/cookieJar.js:575:21)
2025-03-11T19:21:27.991Z at Session.getCookies (/home/myuser/node_modules/@crawlee/core/session_pool/session.js:264:40)
2025-03-11T19:21:27.992Z at PlaywrightCrawler._applyCookies (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:344:40)
2025-03-11T19:21:27.992Z at PlaywrightCrawler._handleNavigation (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:329:20)
2025-03-11T19:21:27.993Z at async PlaywrightCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:260:13)
2025-03-11T19:21:27.994Z at async PlaywrightCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/playwright/internals/playwright-crawler.js:114:9) {"id":"PznVw0jlt50G6EL","url":"D","retryCount":1}
https://cdn.discordapp.com/attachments/1349106681929273344/1349106682592104529/image.png?ex=67d1e502&is=67d09382&hm=e256bfab84c422dac13112a30ab70ab4aa3d9d84f2360e9a1a5fab78a5e1a358&Casper
03/07/2025, 1:15 PMbash
ERROR PlaywrightCrawler: Request failed and reached maximum retries. ApifyApiError: Dataset was not found
2025-03-06T17:37:21.112Z clientMethod: DatasetClient.pushItems
2025-03-06T17:37:21.113Z statusCode: 404
2025-03-06T17:37:21.115Z type: record-not-found
2025-03-06T17:37:21.119Z httpMethod: post
2025-03-06T17:37:21.120Z path: /v2/datasets/<redacted>/items
2025-03-06T17:37:21.122Z stack:
2025-03-06T17:37:21.124Z at makeRequest (/home/myuser/node_modules/apify-client/dist/http_client.js:187:30)
2025-03-06T17:37:21.125Z at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2025-03-06T17:37:21.127Z at async DatasetClient.pushItems (/home/myuser/node_modules/apify-client/dist/resource_clients/dataset.js:104:9)
2025-03-06T17:37:21.129Z at async processSingleReviewDetails (file:///home/myuser/dist/helperfunctions.js:365:5)
2025-03-06T17:37:21.131Z at async Module.processReviews (file:///home/myuser/dist/helperfunctions.js:379:13)
2025-03-06T17:37:21.133Z at async getReviews (file:///home/myuser/dist/main.js:37:5)
2025-03-06T17:37:21.135Z at async PlaywrightCrawler.requestHandler [as userProvidedRequestHandler] (file:///home/myuser/dist/main.js:98:13)
2025-03-06T17:37:21.137Z at async wrap (/home/myuser/node_modules/@apify/timeout/cjs/index.cjs:54:21)
2025-03-06T17:37:21.139Z data: undefined {"id":"<redacted>","url":"<redacted>?sort=recency&languages=all","method":"GET","uniqueKey":"https://www.trustpilot.com/review/<redacted>?languages=all&sort=recency"}
`
How can I ensure that the datasets are created ahead of time when running the scraper before it collects data and then fails because the dataset cant be created or does not exist?Scai
03/05/2025, 7:57 PMenqueueLinks
of it. These links (case studies) at the time has also multiple pages. When the cralwer adds the links with the new label attached, it's not happening anything. When using only case study page, it's scrapping the data and working. Not sure, what to do next and how to test it more. Does the Queue System waits to complete to add all links to start scrapping?Jeno
03/05/2025, 7:08 PMproxyConfiguration: new ProxyConfiguration({
newUrlFunction: () => {
return 'socks5h://brd-customer-...-zone-...:...@brd.superproxy.io:22228';
},
})
Any idea how it should work?
Edit: since I use CamouFox, I tried:
firefoxUserPrefs: {
'network.proxy.socks_remote_dns': true, // Enable remote DNS resolution
},
But it still just hangs.BOBPG
03/01/2025, 11:08 PMNth
03/01/2025, 3:58 PM