https://apify.com/ logo
Join Discord
Powered by
  • Memory access problem on MacOS
    l

    liamk-ultra

    07/26/2025, 4:51 PM
    I'm deploying on Linux, but my development system is MacOS. I discovered an issue that appears with concurrency > 2. I have plenty of memory but that's not actually the problem;, but in any case it should probably fail gracefully. The Snapshotter tries to access child process memory via
    psutil.Process.memory_full_info()
    , which requires elevated permissions on macOS and fails with
    Copy code
    psutil/_psosx.py", line 356, in wrapper
    
        raise AccessDenied(self.pid, self._name)
    I came up with a monkey patch workaround, so this is not a pressing issue. Just thought I'd mention it in case anyone else encountered it, or a fix could be incorporated in the code.
    0
    a
    l
    +2
    • 5
    • 6
  • Unexpected behavior with Statistics logging
    l

    liamk-ultra

    07/26/2025, 4:38 PM
    I wanted to turn off the periodic Statistics logging; my crawls are relatively short, and I'm only interested in the final statistics. I could set the log_interval to something really long. I thought that if I set periodic_message_logger to None it would prevent logging, but actually that doesn't work. The codes tests for it being None and falls back to the default logger in Crawlee.
    Copy code
    statistics = Statistics.with_default_state(
                    log_message=f"{target.venue} Web Scraper Stats",
                    log_interval=timedelta(minutes=30),
                    periodic_message_logger=None,
                    statistics_log_format="table"
                )
    0
    a
    a
    m
    • 4
    • 4
  • StorageClients w/ Multiple Crawlers
    l

    liamk-ultra

    07/22/2025, 7:22 PM
    Hi! This is my first time using Crawlee, and ... so far, so good. It's working. However, I noticed it was using the default FileSystemStorage and creating files locally on my development machine. That's less than ideal in production. Changing to MemoryStorageClient revealed some other problems. I'm running multiple PlaywrightCrawlers asynchronously. The reason for that is that I want to process the scraped documents in a batch, i.e. per site. Also, it's easier to keep things isolated that way. (Each target has it's own set of starting urls, link patterns to enqueue, and selectors to select.) However, this fails with MemoryStorageClient because the first crawler gets the memory, and subsequent ones generate an error:
    Copy code
    Error crawling target dummy2: Service StorageClient is already in use. Existing value: <crawlee.storage_clients._memory._memory_storage_client.MemoryStorageClient object at 0x335e5ef30>, attempted new value: <crawlee.storage_clients._memory._memory_storage_client.MemoryStorageClient object at 0x335e7c710>.
    Upon investigation I discovered the docs saying: > The FileSystemStorageClient is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time. So, even though it appears to be working with some basic tests, I'm not confident this approach will work. I actually don't want concurrent access, I want the storage to be separated, on a per-crawler basis. (Or otherwise, segmented within the Memory or File storage.) I'm not opposed to pointing it at '/tmp' in production, but the warning makes me doubtful that it would work correctly. I did try creating multiple memory clients by setting unique queue_id, store_id and dataset_id, but that resulted in the same error. Is this a limitation, or perhaps is there some way of doing what I'm trying to do in some other way? Thanks for your help!
    0
    a
    m
    +2
    • 5
    • 14
  • Issue with Instagram Reels Playcount Scraper – Restricted Page Errors
    r

    Razz

    07/09/2025, 7:48 AM
    Hi team, I’m using the Instagram Reels Playcount Scraper actor to extract play counts from a list of public Instagram reel URLs. Many of the reels are returning an error: "restricted_page", even though they are accessible in a regular browser without login. Examples of such URLs: - https://www.instagram.com/reel/DLjZ8YdtlFg/ - https://www.instagram.com/reel/DLjY37oNmP6/ - https://www.instagram.com/reel/DLjdQLys0bd/ These reels are publicly visible, and do not have any location, age, or login restrictions. Could you please check if there’s a bug or some proxy/user-agent issue causing this? Or let me know how to avoid this error. Also, please confirm if there's any recent change in Instagram behavior that might affect scraping. Thanks,
    0
    m
    • 2
    • 1
  • Hey friends! 👋I'm now offering powerful, automated web scraping services using Apify — the same p
    a

    Adeyemi Bamidele

    06/29/2025, 5:45 PM
    Hello 👋👋 @everyone
    0
    a
    • 2
    • 1
  • Get metadata in response of /run-sync via API
    f

    Felix

    06/27/2025, 9:29 AM
    Hi, I built an actor that runs smooth, but now I am having trouble accessing all relevant data via API calls. The goal is to start the run via /run-sync and later on access the files that the run stored in the keyValueStore via the API. My problem is, that when I start the run via /run-sync the run starts and returns the result of the run, but there is no ID or any information that would allow me to know in which keyValueStore the files were stored. So I can't access the files. Best would be, that the file urls are already included in my response but it would also be okay to just get an Id of the store and then make another request to get the files. So my question would be: How can I get from a new run, that I started via the API, to accessing the files in the storage? Here is my current response, which does not include any information as far as I can tell: [ { "url": "http://handelsregister.de", "title": "Registerportal | Homepage", "resultsArray": [ { "12345": { "court": "District court Mainz HRB 12345", "firma": "Example UG (haftungsbeschränkt)", "sitz": "Mainz", "downloadedFiles": [ "Registerauszug", "StrukturierterDatensatz", "Gesellschaftsvertrag1" ] } } ], "#error": false, "#debug": { "requestId": "XXXXXXXXXXXXX", "url": "http://handelsregister.de", "loadedUrl": "https://www.handelsregister.de/rp_web/welcome.xhtml", "method": "GET", "retryCount": 0, "errorMessages": [], "statusCode": 200 } } ]
    0
    a
    t
    • 3
    • 3
  • Using browser_new_context_options with PlaywrightBrowserPlugin
    g

    Glitchy_mess

    06/26/2025, 7:23 AM
    Hello, I'm very confused on how to use browser_new_context_options because the error code that i get implies that storage_state is not a variable that Playwright's new_context call supports. However on the playwright docs, storage_state seems to support uploading cookies as a dictionary, and when I try running playwright by itself and uploading cookies through this method, everything works fine. A lot of the cookie related questions in this forum seems to be before the msot recent build, so I was wondering what the syntax should be to properly load cookies through the PlaywrightBrowserPlugin class, as everything else seems to work just fine once that's sorted. TLDR: what's the syntax for using browser_new_context_options so that it can load a cookie that's set up as a dictionary?
    Copy code
    #cookie dictionary is just a dictionary of cookie parameters that is formatted per #playwright's preferred format
    browserPlugin = PlaywrightBrowserPlugin(
        browser_new_context_options={"storage_state": {"cookie": cookieDictionary}}
        )
    browserPoolVar = BrowserPool(plugins=[browserPlugin])
    https://cdn.discordapp.com/attachments/1387694806897000529/1387694807174086667/image.png?ex=685e4700&is=685cf580&hm=4e4c893a63f9d2cd430af9a6eadef9537a6672eb6f55aed831f33d4dcf8a1358& https://cdn.discordapp.com/attachments/1387694806897000529/1387694807694049300/image.png?ex=685e4700&is=685cf580&hm=3a036b370f6739c79e98679c31655f8938932448b178776fe63cd1556441ba82&
    0
    a
    m
    +2
    • 5
    • 17
  • I need account
    a

    Answer Overflow

    06/22/2025, 9:04 PM
    message has been deleted
    0
  • enqueue_links does not find any links
    m

    Miro

    06/14/2025, 11:59 AM
    Hello, I encountered a weird issue where enqueue_links does not find any links on a webpage, specifically https://nanlab.tech. It does not find any links no matter what strategy I choose. I also tried to use extract_links, which managed to find all links with strategy all, but with strategies same-origin and same-hostname no link is extracted and with strategy same-domain there is an error. I am using the latest version of crawlee for python 0.6.10 and for scraping I am using Playwright. Any idea what might be the issue? Here is the handler: @self.crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: # type: ignore text = await context.page.content() self._data[context.request.url.strip()] = { "html": text, "timestamp": ( datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") ), } await asyncio.sleep(self._sleep_between_requests) links = await context.extract_links() print("---------------------------------------------------", len(links), links) await context.enqueue_links(exclude=[self._blocked_extensions]) I am also setting max_requests to 100 and max_crawl_depth to 2 when creating crawler.
    0
    a
    m
    • 3
    • 3
  • Crawler always stops after exactly 300 seconds
    m

    Mikmokki

    06/03/2025, 1:00 PM
    I use crawlee python in docker and it always stops after exactly 300 seconds. I checked that it gets asyncio. CancelledError in AutoscaledPool.run() method but I don't know what sends it. If I try some simple python example the keep_alive works but in my dockerized system it just always sends final statistics after 300 seconds and stops. I checked that it happens with multiple different crawler types
    0
    a
    • 2
    • 2
  • Searching for Developer of Apollo scraper 50k leads-code_crafter
    r

    Rashi Agarwal

    05/23/2025, 3:46 PM
    Hi! I’m using an actor developed by Code Pioneer (code_crafter) and have a quick question. Anyone know if they’re on here or how best to reach them?”
    0
    l
    d
    • 3
    • 5
  • PlaywrightCrawler extract_links does not respect strategy
    p

    phughesion-h3

    05/22/2025, 11:58 PM
    I was crawling a test site that I have hosted locally (localhost). My PlaywrightCrawler subclass must add additional
    user_data
    to each request, so I extract and form each request manually.
    Copy code
    new_requests = await context.extract_links(strategy='same-origin')
        for new_request in new_requests:
            print(f"[LINK] Extracted link: {new_request.url}")
    I start my crawl pointed at
    http://localhost
    , yet the crawler ends up crawling YouTube since there is a link to YouTube on my site.
    0
  • Scrapping tweets are all mock tweets
    a

    Andrew

    05/12/2025, 7:10 AM
    logger.info(f"Starting Twitter Scraper actor for users: {info.x_user}") run = client.actor("kaitoeasyapi/twitter-x-data-tweet-scraper-pay-per-result-cheapest").call( run_input=run_input) annos = client.dataset(run["defaultDatasetId"]).iterate_items() for anno in reversed(list(annos)): print('anno', anno) text = anno.get("text") url = anno.get("url") created_at = anno.get("createdAt") 14:49:04.972 | INFO | Task run 'get_x_announcements-db2' - Starting Twitter Scraper actor for users: H_O_L_O_ 14:49:13.756 | INFO | Task run 'get_x_announcements-db2' - anno {'type': 'mock_tweet', 'id': -1, 'text': "From KaitoEasyAPI, a reminder:Our API pricing is based on the volume of data returned. However, to ensure we can cover our costs on the Apify platform, we have a minimum charge of $X per API call,even if the response contains no results.Thus, we returned N pieces of mock data. We will monitor and adjust the size of N based on the infrastructure costs incurred by Apify.This helps offset the infrastructure costs we incur for each API request, regardless of the data returned. We want to be upfront about this policy so you understand the pricing structure when using our service.Despite adding mock data when no results are found, we're still barely breaking even. Our prices are already very reasonable, but we're committed to maintaining this service to support the growth of your project.Please let us know if you have any other questions!"}
    0
    h
    a
    +3
    • 6
    • 5
  • How to send an URL with a label to main file?
    k

    Kike

    05/02/2025, 5:08 PM
    I am trying to send an URL with a label and user data to main file in order to run this url directly from a specific handler within routes file. Is that possible? I am using Playwright.
    0
    h
    m
    • 3
    • 5
  • structlog support?
    r

    Rykari

    04/29/2025, 1:10 PM
    Could I see an example of how struct log would be implimented officially?
    0
    h
    m
    • 3
    • 4
  • Memory is critically overloaded
    r

    ROYOSTI

    04/23/2025, 1:15 PM
    I have an AWS EC2 instance with 64GB memory. My crawler is running in a docker container. The
    CRAWLEE_MEMORY_MBYTES
    is set to
    61440
    My docker config
    Copy code
    docker run --rm --init -t $docker_args \
        -v /mnt/storage:/app/storage \
        --user appuser \
        --security-opt seccomp=/var/lib/jenkins/helpers/docker/seccomp_profile.json \
        -e MONGO_HOST=${MONGO_HOST} \
        -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
        -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
        -e SPAWN=${SPAWN} \
        -e MONGO_CACHE=${MONGO_CACHE} \
        -e CRAWLEE_MEMORY_MBYTES=${CRAWLEE_MEMORY_MBYTES} \
        ${IMAGE_NAME} $prog_args
    After crawling around 15-20 minutes I receive the warning
    Memory is critically overloaded
    . But when doing the command
    free -h
    I notice I still have a lot of free memory as you can see in the screenshots. The playwright crawler is configured like this
    Copy code
    crawler = PlaywrightCrawler(
            concurrency_settings=ConcurrencySettings(
                min_concurrency=25,
                max_concurrency=125,
            ),
            request_manager=request_queue,
            request_handler=router,
            retry_on_blocked=True,
            headless=True,
        )
    I'm a bit clueless why I have this issue. In the past I had issues with zombie processes but this was solved by adding the
    --init
    to my docker run command. Do you have any idea on how I can fix this or further debug this? https://cdn.discordapp.com/attachments/1364590389650133002/1364590390216228985/Screenshot_2025-04-23_at_15.06.09.png?ex=680a3955&is=6808e7d5&hm=22ebe132be35e12e67124bf986bdd2577902832770aa286f1e272593afe0647d& https://cdn.discordapp.com/attachments/1364590389650133002/1364590390744973403/Screenshot_2025-04-23_at_15.05.40.png?ex=680a3956&is=6808e7d6&hm=f77a13c3c8e18aa4204e490f19a386806699c1ab2685383e7ec0705dd9eab270&
    0
    h
    m
    v
    • 4
    • 4
  • Routers not working as expected
    m

    Matheus Rossi

    04/17/2025, 8:43 PM
    Hello everyone First of all, thanks for this project — it looks really good and promising! I'm considering using Crawlee as an alternative to Scrapy. I'm trying to use a router to run different processes based on the URL. But the request is never captured by the handler . I’d appreciate any insights — am I missing something here? Here’s my crawl.py:
    Copy code
    import asyncio
    from crawlee.crawlers import AdaptivePlaywrightCrawler
    from crawlee import service_locator
    from routes import router      
    
    async def main() -> None:
        configuration = service_locator.get_configuration()
        configuration.persist_storage = False
        configuration.write_metadata = False
    
        crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
            request_handler=router,
            max_requests_per_crawl=5,
        )
    
        await crawler.run(['https://investor.agenusbio.com/news/default.aspx'])
    
    if __name__ == '__main__':
        asyncio.run(main())
    and here my routes:
    Copy code
    from __future__ import annotations
    
    from crawlee.crawlers import AdaptivePlaywrightCrawlingContext
    from crawlee.router import Router
    from crawlee import RequestOptions, RequestTransformAction
    
    router = Router[AdaptivePlaywrightCrawlingContext]()
    
    def transform_request(request_options: RequestOptions) -> RequestOptions | RequestTransformAction:
        url = request_options.get('url', '')
    
        if url.endswith('.pdf'):
            print(f"Request options: {request_options} before")
            request_options['label'] = 'pdf_handler'
            print(f"Request options: {request_options} after")
            return request_options
    
        return request_options
    
    @router.default_handler
    async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
        await context.enqueue_links(
            transform_request_function=transform_request,
        )
    
    @router.handler(label='pdf_handler')
    async def pdf_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
        context.log.info('Processing PDF: %s', context.request.url)
    0
    h
    m
    • 3
    • 4
  • Dynamically change dataset id based on root_domain
    r

    Rykari

    04/15/2025, 1:43 PM
    Hey folks. I've attached an example of my code as a snippet Is it possible to dynamically change the dataset id so that each link has it's own dataset? https://cdn.discordapp.com/attachments/1361698553839354096/1361698553965318205/message.txt?ex=67ffb41a&is=67fe629a&hm=870a3efe67fabe8ab82272d7239e57115139ee54771b448fd0cfe62cb5f1fb44&
    0
    h
    m
    • 3
    • 4
  • Handling of 4xx and 5xx in default handler (Python)
    r

    rast42

    04/09/2025, 9:45 AM
    I built a crawler for crawling the websites and now trying to add functionality to also handle error pages/links like 4xx and 5xx. I was not able to find any documentation regarding that. So, the question is if it is supported and if yes in what direction to look at?
    0
    h
    m
    s
    • 4
    • 6
  • Camoufox and adaptive playwright
    d

    Doigus

    04/06/2025, 8:48 AM
    Hello great friends of Crawlee, I was wondering if there was anyway to use camoufox and the adaptive playwright browser? It seems to throw an error when I try to add the browser pool.
    0
    h
    m
    +2
    • 5
    • 6
  • Hey ,why do i get web scrapping of first url , since i have another url .
    b

    BlackCoder

    04/04/2025, 5:47 AM
    I am implemented Playwright crawler to parse the url , I made a single request to crawler with first url, since the request has been processing , meanwhile , i passed anotther url in craler and hit the request, While processing, through crawler, it is processing content from first url , instead of second url both times. Can be please help? async def run_crawler(url, domain_name, save_path=None): print("doc url inside crawler file====================================>", url) crawler = PlaywrightCrawler( max_requests_per_crawl=10, browser_type='firefox', ) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {url} ...') links = await context.page.evaluate(f'''() => {{ return Array.from(document.querySelectorAll('a[href*="{domain_name}"]')) .map(a => a.href); }}''') await context.enqueue_links(urls=links) elements = await context.page.evaluate(PW_SCRAPING_CODE) data = { 'url': url, 'title': await context.page.title(), 'content': elements } print("datat =================>", data) await context.push_data(data) await crawler.run([url]) i am calling the craler using
    0
    h
    m
    a
    • 4
    • 3
  • proxy_config.new_url() does not return new proxy
    h

    huey louie dewey

    04/01/2025, 9:36 PM
    Here is my selenium python script, where i try to rotate proxies using the proxy_config.new_url():
    Copy code
    python
    # Standard libraries
    import asyncio
    import logging
    import json
    
    # Installed libraries
    from selenium.common.exceptions import TimeoutException, WebDriverException
    from selenium.webdriver.chrome.options import Options as ChromeOptions
    from selenium.webdriver.common.proxy import ProxyType, Proxy
    from selenium.webdriver.common.by import By
    from selenium import webdriver
    from apify import Actor
    
    async def main() -> None:
        async with Actor:
            Actor.log.setLevel(logging.DEBUG)
            proxy_config = await Actor.create_proxy_configuration(groups=['RESIDENTIAL'])
            url = "https://api.ipify.org?format=json"
            for _ in range(10):
                proxy = await proxy_config.new_url()
                Actor.log.info(f'Using proxy: {proxy}')
                chrome_options = ChromeOptions()
                chrome_options.add_argument("--headless")
                chrome_options.add_argument("--no-sandbox")
                chrome_options.add_argument("--disable-dev-shm-usage")
                chrome_options.proxy = Proxy({'proxyType': ProxyType.MANUAL, 'httpProxy': proxy})
                try:
                    with webdriver.Chrome(options=chrome_options) as driver:
                        driver.set_page_load_timeout(20)
                        driver.get(url)
                        content = driver.find_element(By.TAG_NAME, 'pre').text
                        ip = json.loads(content).get("ip")
                        Actor.log.info(f"IP = {ip}")
                except (TimeoutException, WebDriverException, json.JSONDecodeError):
                    Actor.log.exception("An error occured")
    Due to discord message size limitation i attach the log output of the above code in a new message below...
    0
    h
    m
    • 3
    • 7
  • Proxy example with PlaywrightCrawler
    j

    jransom33

    03/22/2025, 10:44 PM
    This is probably a simple fix but I cannot find an example of crawlee using a simple proxy link with Playwright. If anyone has a working example or know what is wrong in the code I would really appreciate your help. Here is the code I have been working with: (I wish I could copy and paste of the code here but the post go over the character limit I get the following error from the code: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/jr/Desktop/Pasos_webscraping/.venv/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None playwright._impl._errors.Error: Page.goto: net::ERR_CERT_AUTHORITY_INVALID at https://www.instagram.com/p/DGWPnK1S0K2/ Call log: - navigating to "https://www.instagram.com/p/DGWPnK1S0K2/", waiting until "load" Any help on how to proceed would be greatly appreciated! https://cdn.discordapp.com/attachments/1353137143576268870/1353137144151150693/Screenshot_2025-03-22_at_6.34.52_PM.png?ex=67e08eab&is=67df3d2b&hm=95999e7bd1bde6bb24da36a2d24c42c1d1d3534b4abef993cda6ffbb5a554c61&
    0
    h
    m
    • 3
    • 2
  • Input schema is not valid (Field schema.properties.files.enum is required)
    n

    non_sam

    03/16/2025, 8:20 AM
    input_schema.json ''' { "title": "Base64 Image Processor", "type": "object", "schemaVersion": 1, "properties": { "files": { "type": "array", "description": "Array of file objects to process", "items": { "type": "object", "properties": { "file": { "type": "object", "properties": { "name": {"type": "string"}, "type": {"type": "string"}, "size": {"type": "integer"}, "content": {"type": "string"}, "description": {"type": "string"} }, "required": ["name", "type", "size", "content", "description"] } }, "required": ["file"] } } }, "required": ["files"] } ''' run and start show error 2025-03-16T08:19:28.275Z ACTOR: ERROR: Input schema is not valid (Field schema.properties.files.enum is required) need help
    0
    h
    o
    • 3
    • 2
  • Issues Creating an Intelligent Crawler & Constant Memory Overload
    k

    kashyab

    03/13/2025, 9:51 PM
    Hey there! I am creating an intelligent crawler using crawlee. Was previously using
    crawl4ai
    but switched since
    crawlee
    seems much better at anti-blocking. The main issue I am facing is I want to filtering the urls to crawl for a given page using LLMs. Is there a clean way to do this? So far I implemented a transformer for enqueue_links which saves the links to a dict and then process those dicts at a later point of time using another crawler object. Any other suggestions to solve this problem? I don't want to make the llm call in the transform function since that would be an LLM call per URL found which is quite expensive. Also when I run this on my EC2 instance with 8GB of RAM it constantly runs into memory overload and just gets stuck i.e. doesn't even continue scraping pages. Any idea how I can resolve this? This is my code currently
    0
    h
    • 2
    • 2
  • Selenium + Chrome Instagram Scraper cannot find the Search button when I run it in Apfiy..
    k

    KrundleDugins

    03/07/2025, 4:49 PM
    Hey everyone, I have built an Instagram Scraper using Selenium and Chrome that works perfectly until I deploy it as an actor here on Apify. It signs in fine but fails every time no matter what I do or try when it gets to the Search button. I have iterated through: 1) search_icon = WebDriverWait(driver, 20).until( EC.element_to_be_clickable((By.CSS_SELECTOR, "svg[aria-label='Search']")) ) search_icon.click() ----- 2) search_icon = WebDriverWait(driver, 20).until( EC.element_to_be_clickable((By.XPATH, "//span[contains(., 'Search')]")) ) search_icon.click() ----- 3) search_icon = WebDriverWait(driver, 20).until( EC.element_to_be_clickable((By.XPATH, "//svg[@aria-label='Search']")) ) search_icon.click() ---- 4) try: search_button = WebDriverWait(driver, 30).until( EC.element_to_be_clickable(( By.XPATH, "//a[.//svg[@aria-label='Search'] and .//span[normalize-space()='Search']]" )) ) # Scroll the element into view just in case driver.execute_script("arguments[0].scrollIntoView(true);", search_button) search_button.click() except TimeoutException: print("Search button not clickable.") ---- 5) search_button = WebDriverWait(driver, 30).until( EC.element_to_be_clickable(( By.XPATH, "//a[.//svg[@aria-label='Search'] and .//span[normalize-space()='Search']]" )) ) driver.execute_script("arguments[0].scrollIntoView(true);", search_button) search_button.click() And I have have tried all of these with residential proxies, data center proxies and at different timeout lengths, NOTHING works and there is nothing that I can find in the documentation to help with this issue. does anyone have any insight into this?? I'd understand if this was failing to even sign in but it is failing at the search button, is the page rendered differently for Apify than it is if your running this from your computer maybe?
    0
    h
    m
    в
    • 4
    • 3
  • Error on cleanup PlaywrightCrawler
    r

    ROYOSTI

    03/05/2025, 11:57 AM
    I use PlaywrightCrawler with
    headless=True
    The package that I use is:
    crawlee[playwright]==0.6.1
    When running the crawler I noticed when waiting for remaining tasks to finish it sometimes receives an error like you can see in the screenshot. Is this something that can be resolved easily? Because I think this error is also related to another issue I have. In my code I have my own batching system in place. But I noticed that my memory slowly started to increase on each batch. After some investigation I saw that
    ps -fC headless_shell
    gave me a lot headless_shell with
    <defunct>
    (zombie processes). So I assume this is related to the cleanup that is failing on each crawl. Below you can see my code for the batching system:
    Copy code
    # Create key values stores for batches
        scheduled_batches = await prepare_requests_from_mongo(crawler_name)
        processed_batches = await KeyValueStore.open(
            name=f'{crawler_name}-processed_batches'
        )
    
        # Create crawler
        crawler = await create_playwright_crawler(crawler_name)
    
        # Iterate over the batches
        async for key_info in scheduled_batches.iterate_keys():
            urls: List[str] = await scheduled_batches.get_value(key_info.key)
            requests = [
                Request.from_url(
                    url,
                    user_data={
                        'page_tags': [PageTag.HOME.value],
                        'chosen_page_tag': PageTag.HOME.value,
                        'label': PageTag.HOME.value,
                    },
                )
                for url in urls
            ]
            LOGGER.info(f'Processing batch {key_info.key}')
            await crawler.run(requests)
            await scheduled_batches.set_value(key_info.key, None)
            await processed_batches.set_value(key_info.key, urls)
    0
    h
    a
    m
    • 4
    • 9
  • Google Gemini Applet - Google Module Not Found (even though it is there)
    k

    KrundleDugins

    03/03/2025, 2:57 PM
    Hey all I have a question about whether I can actually use Apify to access Google Gemini for video analyzation: I've built my own python version of the Gemini Video Analyzer Applet that analyzes social media videos for content style, structure, and aesthetic qualities and it works, I have installed all the Google dependencies required but when I try to run it as an actor using "apify run --purge" no matter what I do it says no module named google found. Is this a bug with Apify ? There is no explicit "Google" folder in the Lib\site-packages but when I check the file path it is there: PS C:\Users\Ken\Apify\run> pip show google-generativeai Name: google-generativeai Version: 0.5.2 Summary: Google Generative AI High level API client library and tools. Home-page: https://github.com/google/generative-ai-python Author: Google LLC Author-email: googleapis-packages@google.com License: Apache 2.0 Location: C:\Users\Ken\AppData\Local\Programs\Python\Python313\Lib\site-packages Requires: google-ai-generativelanguage, google-api-core, google-api-python-client, google-auth, protobuf, pydantic, tqdm, typing-extensions Required-by: PS C:\Users\Ken\Apify\run> Has anyone else run into this issue ? I was excited to try and recreate my agent team that I already use on Apify, but I keep running into all these problems I haven't had anywhere else and I'm starting to wonder if it's worth putting in the time to continue using Apify. Don't get wrong I think Apify is great for launching simple things like a youtube scraper etc. but for things like deploying a 30 Agent Team as an app I am starting to wonder if the learning curve for using Apify to do this is worth the time?
    0
    h
    j
    • 3
    • 9
  • "apify run" no longer able to detect python
    h

    Hall

    03/02/2025, 5:35 PM
    Someone will reply to you shortly. In the meantime, this might help:
    0
    k
    m
    a
    • 4
    • 10
  • ImportError: cannot import name 'service_container
    h

    Hall

    03/01/2025, 6:15 PM
    Someone will reply to you shortly. In the meantime, this might help:
    0