https://kedro.org/ logo
Join Slack
Powered by
# questions
  • y

    Yury Fedotov

    05/14/2025, 5:46 AM
    Hi team, See this example of a pipeline, single node, no inputs, one output. Why does it do the last logged operation: loading a dataset back, after the pipeline is completed successfully? P.S. That doesn't cause any issues on my end, I'm just curious.
    s
    n
    • 3
    • 2
  • m

    Mattis

    05/15/2025, 8:11 AM
    Hi everybody, What could be the issue when i try to run a kedro pipeline in AzureML here: I also tried PickleDatasets, but same result. Are there any known issues so far with kedro pipeline conversion / transferability to AzureML? Or am i missing sth else? Thanks in advance! Additional info here: i create the container like that:
    Copy code
    docker build --progress=plain --build-arg BASE_IMAGE=python:3.10.16-slim -t ABCDE.azurecr.io/kedro:latest .
    And submit the job like this:
    Copy code
    kedro azureml run -p de -s FGHZUI --aml-env kedro_env
    s
    • 2
    • 1
  • a

    Adrien Paul

    05/15/2025, 10:09 AM
    Hi everyone, I'm encountering an issue with the
    kedro-azureml
    plugin — every time I run the CLI, it takes around 1 minute and 10 seconds to start up. Has anyone else experienced this slow startup behavior? Thanks in advance!
    s
    • 2
    • 2
  • z

    Zubin Roy

    05/15/2025, 1:05 PM
    Hey everybody I have a general/best practice question. Loving using Kedro and have created a neat pipeline that trains my models the way I want to. I want to scale this up across x number of markets as my model looks to predict results at the market level. So I thought I'd use a modular pipeline in order to reuse my pipeline code. I like this approach as means I don't have to rewrite any code just have to copy across my parameter and catalog file. The only problem with this approach is my catalog and params file is now getting quite large and was wondering if there was a way to dynamically populate catalog or parameter entries or if there was a better approach within Kedro or generally to avoid large catalog/parameter files. Or perhaps this is just the trade off when we use modular pipelines. Thanks!
    d
    g
    m
    • 4
    • 6
  • j

    Jonathan Dekermanjian

    05/15/2025, 7:36 PM
    Hello Everyone, I have a question about memory management while using Kedro. I have a kedro project that consists of 2 pipelines (data_processing_pipeline & ML_pipeline). My data processing is done using Spark that gets initialized with Kedro hooks. At the end of my data_processing pipeline the results are written to a SparkDataset to disk. Now, my issue is when I execute a kedro run and kedro is now done with the data_processing pipeline and is executing the ML pipeline the Spark session is still holding on to the memory it utilized during the processing. I know this because 20 minutes into the ML portion I can kill the Spark worker with the Spark UI and this releases a significant amount of memory. My question is this How do I tell kedro to release objects that are no longer needed (the dataset is not used beyond the data_processing step) from memory?
    m
    n
    • 3
    • 3
  • m

    Michał Gozdera

    05/19/2025, 10:48 AM
    Hello, I have a question about the right solution for logging pipeline errors in log files. For example, in
    spaceflights-pandas
    we have
    info_file_handler
    defined which logs into
    info.log
    file, but when a DatasetError is raised (for example the dataset csv file is missing), it is not logged in info.log (traceback and error is visible only in console). How to make it be logged in the log file a well? I can always define a hook like this:
    Copy code
    class ErrorCatchLoggingHook:
        @hook_impl
        def on_pipeline_error(self, error: Exception):
            logger.exception(f"Pipeline failed due to an error: {str(error)}")
    but then the error log in the console is duplicated.
    r
    n
    • 3
    • 4
  • j

    juanc

    05/19/2025, 1:32 PM
    Hi everyone! I'm new to Kedro and i'm looking for a way to use
    pandas.read_html
    in data catalog YAML or any other input way i'm missing out. Thank you all.
    d
    • 2
    • 2
  • j

    Jamal Sealiti

    05/20/2025, 11:54 AM
    Hi, are there any plans to make Kedro natively built for streaming (spark Streaming for reading,writing,deleting and merging streaming data) without using using custom nodes and hooks?
    j
    d
    • 3
    • 12
  • f

    Fazil Topal

    05/20/2025, 8:14 PM
    Hi, Is there a way to pass dict variable in kedro node inputs? Example:
    Copy code
    node(myfunc, inputs=dict(x="ds1", y=dict(subv="test1", subc="test2"))
    Basically i would expect kedro to pass the resolved dict into the node as a dict. Right now it's not possible as kedro complains and wants the values as string. Not really sure how to get around that
    d
    j
    • 3
    • 8
  • a

    Armand Masseau

    05/21/2025, 9:32 AM
    Hello everyone. I am new to kedro and I want to execute a pipeline (that works well in sequential running) in parallel using —runner=‘ParallelRunner’. I am facing the following issue: every time I try it, the first node is well executed and saves its output as a SharedMemoryDataset but just after I see that the other nodes « have not run » and I have a RecursionError raising « maximum depth exceeded » coming from the line 284 in set_project_logging in kedro\framework\project\_init_.py « if package_name not in self.data[‘loggers’]. Does anyone know where does it come from ?
    r
    • 2
    • 7
  • a

    Armand Masseau

    05/21/2025, 2:04 PM
    I have another question. Is it possible to link the catalog and params? Currently I am using a globals file from which params is sourcing itself and the catalog too because one of the names of the files used in the pipeline contains a globals variable. I would like to get rid of the globals file and only use parameters because support runtime parameters.
    r
    y
    +2
    • 5
    • 15
  • a

    Adrien Paul

    05/21/2025, 2:16 PM
    Hello, I have a bug with kedro_azureml.dataset.AzureMLAssetDataset in the kedro-azureml plugin. It seems to be related to AzureMachineLearningFilesytem and this issue https://github.com/Azure/azure-sdk-for-python/issues/37089 Someone succeed to use azuremldataset in version 0.9.0 ?
    r
    d
    • 3
    • 16
  • j

    Jonghyun Yun

    05/22/2025, 2:30 PM
    Hi Team, I have a daily job saving several versioned datasets. A downstream process (not written in Kedro) needs <version> (e.g.
    data/01_raw/company/cars.csv/<version>/cars.csv
    ) so that it could pick up correct datasets to process. Is there a way to know which <version> is being used by Kedro?
    r
    • 2
    • 7
  • r

    Richard Asselin

    05/22/2025, 2:42 PM
    Hi there! Just have a quick question re: running kedro-viz from within a virtual environment. For some reason it seems to always pick the version of
    kedro-viz
    from my main Python and not the one in the virtual env (i.e., I have v11.0.0 in my main Python, but v11.0.1 in my virtual env, and running
    kedro viz
    from within the virtual env is picking the 11.0.0 version). Is it just something I'm doing incorrectly? Is that the expected behaviour? Thanks!
    r
    • 2
    • 8
  • c

    coder xu

    05/28/2025, 12:22 AM
    why parquet file in s3 has some errors?just like
    Copy code
    DatasetError: Failed while loading data from dataset ParquetDataset(filepath=kedro/model_input_table.parquet, load_args={}, protocol=s3, save_args={}).
    Expected checksum PqKP+A== did not match calculated checksum: eqRztQ==
    m
    • 2
    • 1
  • c

    coder xu

    05/28/2025, 12:23 AM
    this is my catalog
    Copy code
    model_input_table:
      type: pandas.ParquetDataset
      filepath: <s3://kedro/model_input_table.parquet>
    #  type: pandas.CSVDataset
    #  filepath: <s3://kedro/model_input_table.csv>
    and csv files is fine.
  • j

    Jamal Sealiti

    05/28/2025, 11:32 AM
    Hi, how can i setup kedro with Grafana for tracking node/pipeline data progress?
    d
    j
    • 3
    • 3
  • j

    Jamal Sealiti

    05/30/2025, 10:24 AM
    How Kedro handling merging 2 streaming datasets on some merge keys? And deleting?
    m
    • 2
    • 1
  • j

    Jamal Sealiti

    05/30/2025, 12:19 PM
    Its possible to create custom delta table dataset with change data capture option? and how i can create a table form my custom schame before writstreaming?
    m
    • 2
    • 4
  • y

    Yury Fedotov

    05/30/2025, 2:27 PM
    Are small contributions to docs (like typo fixes) being accepted now? Asking as I see you’re migrating to mkdocs, so maybe not the best time in terms of avoiding merge conflicts
    d
    • 2
    • 3
  • t

    Trọng Đạt Bùi

    06/02/2025, 10:06 AM
    Hello Everyone! Has anyone tried to manually create pipeline(Not auto-register pipeline of kedro)?
    a
    • 2
    • 7
  • a

    Ankit K

    06/02/2025, 3:19 PM
    Hi all, I’m working on a Kedro pipeline (using the
    kedro-vertexai
    plugin, version
    0.10.0
    ) where I need to track each pipeline run in a BigQuery table. We use a table_suffix (typically a date or unique run/session ID) to uniquely identify data and outputs for each pipeline run, ensuring that results from different runs do not overwrite each other and can be traced back to a specific execution. The challenge is that the kedro
    session_id
    or
    KEDRO_CONFIG_RUN_ID
    is not available at config load time, so early config logic (like setting a table_suffix) uses a date or placeholder value. This can cause inconsistencies, especially if nodes run on different days or the pipeline is resumed. (Currently pipeline takes ~2.5 days to run) We tried generating the table_suffix using the current date at config load time, but this led to issues: if a node runs on a different day or the pipeline is resumed, a new table_suffix is generated, causing inconsistencies and making it hard to track a single pipeline run. We also experimented with different Kedro hooks (such as before_pipeline_run and before_node_run) to set or propagate the run/session ID, but still faced challenges ensuring the value is available everywhere, including during config loading. What is the best practice in Kedro (with Vertex AI integration) for generating and propagating a unique run/session ID that is available everywhere (including config loading and all nodes), so that all tracking and table suffixes are consistent for a given run? Should this be set as an environment variable before Kedro starts, or is there a recommended hook or config loader pattern for this? Any advice or examples would be appreciated!
    👀 1
    a
    d
    • 3
    • 2
  • a

    Arnout Verboven

    06/03/2025, 11:00 AM
    Hi! If I have 2 configuration environments (
    local
    and
    prod
    ), is it possible to know during pipeline creation which environment is run? Or how should I do this using proper Kedro patterns. Eg. I want to do something like:
    Copy code
    def create_pipeline(env: str = "local") -> Pipeline:
        if env == "prod":
            return create_pipeline_prod()
        else:
            return create_pipeline_local()
    a
    j
    • 3
    • 4
  • a

    Abhishek Bhatia

    06/10/2025, 5:38 AM
    Hey team! Not a kedro question per se. What is the go to tooling for configuration management in data science projects outside of kedro (with OmegaConf)? Is Hydra the most popular choice? I am looking at the following features: 1. Global configs 2. Clear patterns for config type a. Static vs Dynamic b. Global vs Granular c. Constant vs Overridable 3. Param overriding with Globals 4. Param overriding within config file 5. Support for environment variables 6. Storing environment wise configs - DEV / STG / UAT / PROD etc 7. Interpolation with basic text concat 8. (Optional) Python function as resolvers in config (OmegaConf) 9. Config compilation artifact (i.e. I want to see how my config looks after resolving) 10. Invoking python scripts with arbitrary / alternate config paths 11. Invoking python scripts with specific param value Most of the above features are already there in kedro, but I need this functionality outside kedro. Eager to here the community's recommendation here! 🙂
    d
    j
    • 3
    • 4
  • m

    Malek Bouzidi

    06/10/2025, 12:34 PM
    Hi all. I've been trying kedro for the past few weeks. Everything worked well except for kedro-viz. It doesn't display the previews of the datasets. I followed all the instructions in the doc but nothing worked. Can someone help me to know the reason why it doesn't work ?
    d
    r
    y
    • 4
    • 7
  • s

    Sharan Arora

    06/10/2025, 5:53 PM
    hi I'm getting an error when doing kedro run. Would you be able to help? I have java 17 installed and im unsure why the code gets stuck on that last line, I always have to abort
  • s

    Sharan Arora

    06/11/2025, 1:35 AM
    just to follow up I'm receiving a FileNotFoundError: [WinError 2] The system cannot find the file specified error, I've double checked my path in environment variables and can't find an issue
    d
    • 2
    • 4
  • j

    Jonghyun Yun

    06/11/2025, 9:46 PM
    Hi Team, I'm using kedro 0.18.6. It seems to have a bug. When I create and run a part of composite pipeline, it actually run everything in it. For example, running pipe["a"] will trigger running pipe{"b"], pipe["c"] too. I don't think this is expected behavior. I cannot upgrade the kedro version above 0.18.xx. Was there a fix for this issue?
    n
    • 2
    • 4
  • t

    Trọng Đạt Bùi

    06/12/2025, 6:41 AM
    Has anyone tried to customize Spark Dataset to read multiple folders in HDFS?
    a
    n
    • 3
    • 2
  • m

    Mattis

    06/16/2025, 12:55 PM
    I have configured a dynamic pipeline (catalog and nodes) with a hooks-file. Locally it´s running in a docker container without problems, but when push it to AzureML and run it there, even though i can see the whole pipeline (and all dynamically created nodes names) - i receive "pipeline does not contain that .. node". How is this even possible? Does anyone have a clue?
    s
    r
    m
    • 4
    • 9
1...2728293031Latest