liamk-ultra
07/26/2025, 4:51 PMpsutil.Process.memory_full_info()
,
which requires elevated permissions on macOS and fails with
psutil/_psosx.py", line 356, in wrapper
raise AccessDenied(self.pid, self._name)
I came up with a monkey patch workaround, so this is not a pressing issue. Just thought I'd mention it in case anyone else encountered it, or a fix could be incorporated in the code.liamk-ultra
07/26/2025, 4:38 PMstatistics = Statistics.with_default_state(
log_message=f"{target.venue} Web Scraper Stats",
log_interval=timedelta(minutes=30),
periodic_message_logger=None,
statistics_log_format="table"
)
liamk-ultra
07/22/2025, 7:22 PMError crawling target dummy2: Service StorageClient is already in use. Existing value: <crawlee.storage_clients._memory._memory_storage_client.MemoryStorageClient object at 0x335e5ef30>, attempted new value: <crawlee.storage_clients._memory._memory_storage_client.MemoryStorageClient object at 0x335e7c710>.
Upon investigation I discovered the docs saying:
> The FileSystemStorageClient is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time.
So, even though it appears to be working with some basic tests, I'm not confident this approach will work. I actually don't want concurrent access, I want the storage to be separated, on a per-crawler basis. (Or otherwise, segmented within the Memory or File storage.) I'm not opposed to pointing it at '/tmp' in production, but the warning makes me doubtful that it would work correctly.
I did try creating multiple memory clients by setting unique queue_id, store_id and dataset_id, but that resulted in the same error.
Is this a limitation, or perhaps is there some way of doing what I'm trying to do in some other way?
Thanks for your help!Razz
07/09/2025, 7:48 AMAdeyemi Bamidele
06/29/2025, 5:45 PMFelix
06/27/2025, 9:29 AMGlitchy_mess
06/26/2025, 7:23 AM#cookie dictionary is just a dictionary of cookie parameters that is formatted per #playwright's preferred format
browserPlugin = PlaywrightBrowserPlugin(
browser_new_context_options={"storage_state": {"cookie": cookieDictionary}}
)
browserPoolVar = BrowserPool(plugins=[browserPlugin])
https://cdn.discordapp.com/attachments/1387694806897000529/1387694807174086667/image.png?ex=685e4700&is=685cf580&hm=4e4c893a63f9d2cd430af9a6eadef9537a6672eb6f55aed831f33d4dcf8a1358&
https://cdn.discordapp.com/attachments/1387694806897000529/1387694807694049300/image.png?ex=685e4700&is=685cf580&hm=3a036b370f6739c79e98679c31655f8938932448b178776fe63cd1556441ba82&Answer Overflow
06/22/2025, 9:04 PMMiro
06/14/2025, 11:59 AMMikmokki
06/03/2025, 1:00 PMRashi Agarwal
05/23/2025, 3:46 PMphughesion-h3
05/22/2025, 11:58 PMuser_data
to each request, so I extract and form each request manually.
new_requests = await context.extract_links(strategy='same-origin')
for new_request in new_requests:
print(f"[LINK] Extracted link: {new_request.url}")
I start my crawl pointed at http://localhost
, yet the crawler ends up crawling YouTube since there is a link to YouTube on my site.Andrew
05/12/2025, 7:10 AMKike
05/02/2025, 5:08 PMRykari
04/29/2025, 1:10 PMROYOSTI
04/23/2025, 1:15 PMCRAWLEE_MEMORY_MBYTES
is set to 61440
My docker config
docker run --rm --init -t $docker_args \
-v /mnt/storage:/app/storage \
--user appuser \
--security-opt seccomp=/var/lib/jenkins/helpers/docker/seccomp_profile.json \
-e MONGO_HOST=${MONGO_HOST} \
-e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-e SPAWN=${SPAWN} \
-e MONGO_CACHE=${MONGO_CACHE} \
-e CRAWLEE_MEMORY_MBYTES=${CRAWLEE_MEMORY_MBYTES} \
${IMAGE_NAME} $prog_args
After crawling around 15-20 minutes I receive the warning Memory is critically overloaded
. But when doing the command free -h
I notice I still have a lot of free memory as you can see in the screenshots.
The playwright crawler is configured like this
crawler = PlaywrightCrawler(
concurrency_settings=ConcurrencySettings(
min_concurrency=25,
max_concurrency=125,
),
request_manager=request_queue,
request_handler=router,
retry_on_blocked=True,
headless=True,
)
I'm a bit clueless why I have this issue. In the past I had issues with zombie processes but this was solved by adding the --init
to my docker run command.
Do you have any idea on how I can fix this or further debug this?
https://cdn.discordapp.com/attachments/1364590389650133002/1364590390216228985/Screenshot_2025-04-23_at_15.06.09.png?ex=680a3955&is=6808e7d5&hm=22ebe132be35e12e67124bf986bdd2577902832770aa286f1e272593afe0647d&
https://cdn.discordapp.com/attachments/1364590389650133002/1364590390744973403/Screenshot_2025-04-23_at_15.05.40.png?ex=680a3956&is=6808e7d6&hm=f77a13c3c8e18aa4204e490f19a386806699c1ab2685383e7ec0705dd9eab270&Matheus Rossi
04/17/2025, 8:43 PMimport asyncio
from crawlee.crawlers import AdaptivePlaywrightCrawler
from crawlee import service_locator
from routes import router
async def main() -> None:
configuration = service_locator.get_configuration()
configuration.persist_storage = False
configuration.write_metadata = False
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
request_handler=router,
max_requests_per_crawl=5,
)
await crawler.run(['https://investor.agenusbio.com/news/default.aspx'])
if __name__ == '__main__':
asyncio.run(main())
and here my routes:
from __future__ import annotations
from crawlee.crawlers import AdaptivePlaywrightCrawlingContext
from crawlee.router import Router
from crawlee import RequestOptions, RequestTransformAction
router = Router[AdaptivePlaywrightCrawlingContext]()
def transform_request(request_options: RequestOptions) -> RequestOptions | RequestTransformAction:
url = request_options.get('url', '')
if url.endswith('.pdf'):
print(f"Request options: {request_options} before")
request_options['label'] = 'pdf_handler'
print(f"Request options: {request_options} after")
return request_options
return request_options
@router.default_handler
async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
await context.enqueue_links(
transform_request_function=transform_request,
)
@router.handler(label='pdf_handler')
async def pdf_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
context.log.info('Processing PDF: %s', context.request.url)
Rykari
04/15/2025, 1:43 PMrast42
04/09/2025, 9:45 AMDoigus
04/06/2025, 8:48 AMBlackCoder
04/04/2025, 5:47 AMhuey louie dewey
04/01/2025, 9:36 PMpython
# Standard libraries
import asyncio
import logging
import json
# Installed libraries
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.common.proxy import ProxyType, Proxy
from selenium.webdriver.common.by import By
from selenium import webdriver
from apify import Actor
async def main() -> None:
async with Actor:
Actor.log.setLevel(logging.DEBUG)
proxy_config = await Actor.create_proxy_configuration(groups=['RESIDENTIAL'])
url = "https://api.ipify.org?format=json"
for _ in range(10):
proxy = await proxy_config.new_url()
Actor.log.info(f'Using proxy: {proxy}')
chrome_options = ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.proxy = Proxy({'proxyType': ProxyType.MANUAL, 'httpProxy': proxy})
try:
with webdriver.Chrome(options=chrome_options) as driver:
driver.set_page_load_timeout(20)
driver.get(url)
content = driver.find_element(By.TAG_NAME, 'pre').text
ip = json.loads(content).get("ip")
Actor.log.info(f"IP = {ip}")
except (TimeoutException, WebDriverException, json.JSONDecodeError):
Actor.log.exception("An error occured")
Due to discord message size limitation i attach the log output of the above code in a new message below...jransom33
03/22/2025, 10:44 PMnon_sam
03/16/2025, 8:20 AMkashyab
03/13/2025, 9:51 PMcrawl4ai
but switched since crawlee
seems much better at anti-blocking.
The main issue I am facing is I want to filtering the urls to crawl for a given page using LLMs. Is there a clean way to do this? So far I implemented a transformer for enqueue_links which saves the links to a dict and then process those dicts at a later point of time using another crawler object. Any other suggestions to solve this problem? I don't want to make the llm call in the transform function since that would be an LLM call per URL found which is quite expensive.
Also when I run this on my EC2 instance with 8GB of RAM it constantly runs into memory overload and just gets stuck i.e. doesn't even continue scraping pages. Any idea how I can resolve this? This is my code currentlyKrundleDugins
03/07/2025, 4:49 PMROYOSTI
03/05/2025, 11:57 AMheadless=True
The package that I use is: crawlee[playwright]==0.6.1
When running the crawler I noticed when waiting for remaining tasks to finish it sometimes receives an error like you can see in the screenshot. Is this something that can be resolved easily?
Because I think this error is also related to another issue I have.
In my code I have my own batching system in place. But I noticed that my memory slowly started to increase on each batch.
After some investigation I saw that ps -fC headless_shell
gave me a lot headless_shell with <defunct>
(zombie processes). So I assume this is related to the cleanup that is failing on each crawl.
Below you can see my code for the batching system:
# Create key values stores for batches
scheduled_batches = await prepare_requests_from_mongo(crawler_name)
processed_batches = await KeyValueStore.open(
name=f'{crawler_name}-processed_batches'
)
# Create crawler
crawler = await create_playwright_crawler(crawler_name)
# Iterate over the batches
async for key_info in scheduled_batches.iterate_keys():
urls: List[str] = await scheduled_batches.get_value(key_info.key)
requests = [
Request.from_url(
url,
user_data={
'page_tags': [PageTag.HOME.value],
'chosen_page_tag': PageTag.HOME.value,
'label': PageTag.HOME.value,
},
)
for url in urls
]
LOGGER.info(f'Processing batch {key_info.key}')
await crawler.run(requests)
await scheduled_batches.set_value(key_info.key, None)
await processed_batches.set_value(key_info.key, urls)
KrundleDugins
03/03/2025, 2:57 PMHall
03/02/2025, 5:35 PMHall
03/01/2025, 6:15 PM