https://kedro.org/ logo
Join Slack
Powered by
# questions
  • e

    Elvira Salakhova

    04/07/2025, 8:29 AM
    Hello, everyone! How do you manage model versioning in MLFlow inside kedro pipeline?
    d
    • 2
    • 2
  • c

    Chee Ming Siow

    04/07/2025, 8:45 AM
    Hi, require some clarification on OmegaConfigLoader(). I have duplicated keys across base and local conf environment. How do retrieve the configs without
    ValueError: Duplicate keys found in ...
    ? In my code, have a function that runs before the actual kedro pipeline. I wish to retrieve the config in the function and prioritize the config attributes defined in local env sample code
    Copy code
    ###### main.py #####
    if __name__ == "__main__":
        # Bootstrap the project to make the config loader available
        project_path = Path.cwd()
        bootstrap_project(project_path)
    
        # Create a Kedro session
        with KedroSession.create(project_path=project_path) as session:
            # You can now access the catalog, pipeline, etc. from the session
            # For example, to run the pipeline:
            conf_eda() # <------------- function
            session.run()
            pass
    
    ##### myfunc.py #####
    
    def conf_eda():
        project_path = Path.cwd()
        conf_path = str(project_path/"conf")
        conf_loader = OmegaConfigLoader(
            conf_source=conf_path,
            )  
        parameters = conf_loader["parameters"] # <----------- error
    
        print(parameters["model_options"])
    
    ##### conf/base/parameters_data_science.yml #####
    model_options:
      test_size: 100
      random_state: 3
    
    ##### conf/local/parameters_data_science.yml #####
    model_options:
      test_size: 300
      random_state: 3
    d
    • 2
    • 5
  • p

    Puneet Saini

    04/07/2025, 10:59 AM
    Hey team! Is there a way to download a pipeline's kedro viz visualisation programmatically?
    ➕ 1
    1️⃣ 1
    r
    d
    j
    • 4
    • 20
  • r

    Robert Kwiatkowski

    04/08/2025, 10:53 AM
    What's the best way to profile Kedro pipeline? I want to reduce the end-to-end time.
    j
    • 2
    • 1
  • w

    Winston Ong

    04/08/2025, 3:43 PM
    Hello everyone, I am facing this Forbidden error when trying to access my S3 bucket with my admin user credentials. I ran the command
    kedro run --pipeline data_processing --env=production
    from the spaceflights-pandas starter.
    Copy code
    DatasetError: Failed while loading data from dataset CSVDataset(filepath=bucket-name/companies.csv, load_args={}, protocol=s3,
    save_args={'index': False}).
    Forbidden
    conf/production/catalog.yml:
    Copy code
    companies:
      type: pandas.CSVDataset
      filepath: <s3://bucket-name/companies.csv>
      credentials: prod_s3
    
    reviews:
      type: pandas.CSVDataset
      filepath: <s3://bucket-name/reviews.csv>
      credentials: prod_s3
    
    shuttles:
      type: pandas.ExcelDataset
      filepath: <s3://bucket-name/shuttles.xlsx>
      load_args:
        engine: openpyxl
      credentials: prod_s3
    conf/production/credentials.yml:
    Copy code
    prod_s3:
      client_kwargs:
        aws_access_key_id: <<access_key>>
        aws_secret_access_key: <<secret_access_key>>
    I'm quite sure my credentials are correct and bucket access is okay, because I ran the following script and I am able to retrieve the file.
    Copy code
    import boto3
    
    s3 = boto3.client(
        's3',
        aws_access_key_id='<<access_key>>',
        aws_secret_access_key='<<secret_access_key>>'
    )
    
    response = s3.get_object(Bucket='bucket-name', Key='companies.csv')
    print(response['Body'].read().decode())
    d
    • 2
    • 2
  • w

    Winston Ong

    04/09/2025, 12:03 AM
    The above issue has been resolved. I ran my code after a while longer and it works.
    👍 2
  • p

    Puneet Saini

    04/09/2025, 1:32 PM
    Hi team! I see that we might need to do
    if "parameters" in key
    in the code for
    omegaconf_config.py
    . Imagine a scenario where we are trying to load some common parameters using
    common
    patterns and
    country
    parameters using
    country
    patterns in settings.py. In that case we can actually do
    country_parameters
    or
    common_parameters
    and it would still work as expected. Need your thoughts
    d
    • 2
    • 2
  • r

    Ralf Kowatsch

    04/10/2025, 9:21 AM
    Hi, I'm looking for the best suitable option for our use case which is a mixture between data engineering and data science. One of the possible solutions we are looking are • dbt https://docs.getdbt.com/ • kedro • sqlmesh Our main concerns are • time to market • scalability • maintainability • etc. The question that bugs me the most is whether there is some sql query push down solution in kedro? I saw https://ibis-project.org/ which looks like a dataframe solution for query push down but I would like to push my sql model in certain cases directly. Has anybody any idea?
    d
    d
    d
    • 4
    • 13
  • b

    Bibo Bobo

    04/11/2025, 11:43 AM
    Guys what is the right way to create pipelines with dynamic inputs? I mean the following. For example if I have a pipeline that takes dataset (defined in data catalog) and some parameters. I would like to be able to switch the dataset from the cofings somehow without touching the definition of the pipeline itself. For example if I have those datasets define in catalog
    Copy code
    "{namespace}.{layer}-{folder}#csv_all":
      type: "${globals:datasets.partitioned_dataset}"
      path: data/{layer}/{namespace}/{folder}
      dataset:
        type: "${globals:datasets.pandas_csv}"
    
    "{namespace}.{layer}-{filename}#single_csv":
      type: "${globals:datasets.pandas_csv}"
      filepath: data/{layer}/{namespace}/{filename}.csv
    And in pipeline definitions I can have either something like this
    Copy code
    pipeline(
            [
                node(
                    func=do_stuff,
                    inputs=[
                        # other params
                        "05_model_input-folder_name#csv_all",
                    ],
                    outputs="some_output",
                )
            ],
        namespace="some_namespace",
    )
    Or something like this depending on whether I want to make a test run on fraction of the data or on the full dataset
    Copy code
    pipeline(
            [
                node(
                    func=do_stuff,
                    inputs=[
                        # other params
                        "05_model_input-filename#single_csv",
                    ],
                    outputs="some_output",
                )
            ],
        namespace="some_namespace",
    )
    And I want to have a configuration in yaml where I can easily change the type of the dataset that is used in the pipeline. Ideally I would like to have a single config from which I can set all the parameters that are used in the pipeline. And have something like this as a result
    Copy code
    pipeline(
        [
            node(
                func=do_stuff,
                inputs=[
                    # other params
                    "dataset",
                ],
                outputs="some_output",
            )
        ],
        namespace="some_namespace",
    )
    I see that when you create pipelines using Kedro cli it creates function with this signature
    def create_pipeline(**kwargs) -> Pipeline:
    so I assume there is way to provide params and have something like this
    Copy code
    def create_pipeline(**kwargs) -> Pipeline:
        return pipeline(
            [
                node(
                    func=do_stuff,
                    inputs=[
                        # other params
                        kwargs.get("dataset"),
                    ],
                    outputs="some_output",
                )
            ],
            namespace="some_namespace",
        )
    But I am not sure how to do it in a right way. I have several pipelines like this and want all of them to be dynamic like this. Should I change the default logic in
    pipeline_registry.py
    and pass those kwargs from there or is there a more simple way to achieve something like this?
    c
    d
    • 3
    • 8
  • d

    Davi Sales Barreira

    04/11/2025, 5:42 PM
    Friends, I'm starting to use
    kedro
    with
    uv
    . If I start the package with Pyspark, I get an error. Here are the steps to reproduce. Start running:
    Copy code
    uvx kedro new
    When prompted, I choose the option to install all tools (this includes pyspark). The project is created. I get into the directory and run:
    Copy code
    uv run ipython
    Inside ipython, if I try
    %load_ext kedro.ipython
    , then I get the error:
    Copy code
    The operation couldn't be completed. Unable to locate a Java Runtime.
    Please visit <http://www.java.com> for information on installing Java.
    
    /Users/davi/test/.venv/lib/python3.11/site-packages/pyspark/bin/spark-class: line 97: CMD: bad array subscript
    head: illegal line count -- -1
    ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
    │ in <module>:1                                                                                    │
    │                                                                                                  │
    │ /Users/davi/test/.venv/lib/python3.11/site-packages/IPyt │
    │ hon/core/interactiveshell.py:2482 in run_line_magic                                              │
    │                                                                                                  │
    │   2479 │   │   │   if getattr(fn, "needs_local_scope", False):                                   │
    │   2480 │   │   │   │   kwargs['local_ns'] = self.get_local_scope(stack_depth)                    │
    │   2481 │   │   │   with self.builtin_trap:                                                       │
    │ ❱ 2482 │   │   │   │   result = fn(*args, **kwargs)
    ....
    PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
    Any idea on what might be happening? BTW, I'm on a Mac.
    👀 1
    r
    • 2
    • 5
  • d

    Davi Sales Barreira

    04/12/2025, 1:57 PM
    is there a way to read parquet files within polars? I see that
    polars.ParquetDataset
    does not exists.
    d
    • 2
    • 2
  • d

    Daniel Mesquita

    04/14/2025, 3:27 PM
    Hey team quick question. Is there a simple way to collapse pipelines in kedro-viz? The effect I want is just on kedro-viz, after project grows complexity is hard to keep track of inputs and outputs through the node directly. Would be nice to collapse pipelines and see hierarchy e.g. Pipeline 1 is compose of sub_pipe2, sub_pipe2 is composed of sub_pipe3 and so on. I understand I could put namespaces in all my pipelines but this has 2 problems: a) structure might break because would need to update catalog, b) if i define pipelines and subpipes hierarchicaly through namespaces would miss the possibility to run just the subpipeline through
    kedro run -p subpipe
    would need to rely on tags to execute a part of it. Is there any feature like this?
    a
    • 2
    • 2
  • m

    Mohamed El Guendouz

    04/14/2025, 4:00 PM
    Hello ! 🙂 When I run a Kedro pipeline remotely on a Dataproc cluster, I get the following error:
    Copy code
    ValueError: Failed to find the pipeline named 'XXXXXX'. It needs to be generated and returned by the 'register_pipelines' function.
    However, when I run
    kedro run --pipeline <pipeline>
    locally on my machine, the pipeline is correctly detected and executed. Just to clarify, I do have an
    __init__.py
    file in the pipeline directory, and my
    register_pipelines()
    function uses
    find_pipelines()
    as shown below:
    Copy code
    from kedro.framework.project import find_pipelines
    from kedro.pipeline import Pipeline
    
    def register_pipelines() -> dict[str, Pipeline]:
        pipelines = find_pipelines()
        pipelines["__default__"] = sum(pipelines.values())
        return pipelines
    Do you have any idea what could be causing this issue on the cluster? Any insights or suggestions would be greatly appreciated. Thank you in advance!
    a
    m
    • 3
    • 5
  • s

    Sven-Arne Quist

    04/15/2025, 9:23 AM
    Hi, is anyone here using Kedro and (PyTorch)Lightning together? It seems to me that Kedro has a lot of advantages like the DAG, and the fact that you can keep all of your parameters and files in one yaml-file etc. At the same time, I worry that Kedro and Lightning will fight over the control of the hardware. Lightning automatically detects GPUs and takes care of distributed training and so on. I don't want Kedro to interfere with this. Can I for instance say to Kedro: "let Lightning do the hardware managment"?
    👍🏼 1
    a
    y
    • 3
    • 2
  • m

    Manoel Pereira de Queiroz

    04/15/2025, 6:32 PM
    Hi guys, are you able to package a Kedro project to enable loading from the catalog in another application? My use case is that not only I have several datasets (stored in GCP), but also Kedro parameters that I would like to embed and reuse in a front-end application. If there is a way for me to package my project (along with the
    conf
    directory) so I can easily access parameters and datasets with
    catalog.load
    or the
    ConfigLoader
    class, instead of configuring my connection to GCP + recreate the parameters from the ground-up in the new application? Thanks in advance and keep up with the good work, this project is awesome!
    y
    a
    • 3
    • 4
  • ł

    Łukasz Janiec

    04/17/2025, 12:09 PM
    Hello, I have unexpected problem with running the Kedro project with simple pipeline, with one node using the OSMNx - it works without problems in the venv in a script:
    Copy code
    def get_shortest_path(
        G: nx.MultiDiGraph, origin: tuple[float, float], destination: tuple[float, float]
    ) -> list[tuple[int, int]]:
        """
        Get the shortest path between two points in the graph.
    
        :param G: The road network graph.
        :param origin: The (latitude, longitude) of the origin.
        :param destination: The (latitude, longitude) of the destination.
        :return: list of edges in the shortest path.
        """
        orig_node = ox.distance.nearest_nodes(G, origin[1], origin[0])
        dest_node = ox.distance.nearest_nodes(G, destination[1], destination[0])
        shortest_path = nx.shortest_path(G, orig_node, dest_node, weight="length")
        path_edges = list(zip(shortest_path[:-1], shortest_path[1:]))
        return path_edges
    But it becomes a problem when I am trying to use it with CLI `kedro run`:
    Copy code
    UserWarning: An error occurred while importing the                                                                   
                                 'networking_route_optimizer.pipelines.data_ingestion' module. Nothing defined therein will be                
                                 returned by 'find_pipelines'.                                                                                        
                                                                                                                                                      
                                 Traceback (most recent call last):                                                                                   
                                   File "/home/ljaniec/.local/lib/python3.10/site-packages/kedro/framework/project/__init__.py", line                 
                                 442, in find_pipelines                                                                                               
                                     pipeline_module = importlib.import_module(pipeline_module_name)                                                  
                                   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module                                       
                                     return _bootstrap._gcd_import(name, package, level)                                                              
                                   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import                                                    
                                   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load                                                 
                                   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked                                        
                                   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked                                                  
                                   File "<frozen importlib._bootstrap_external>", line 883, in exec_module                                            
                                   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed                                       
                                   File                                                                                                               
                                 "/home/ljaniec/workspace/networking-route-optimizer/src/networking_route_opti                
                                 mizer/pipelines/data_ingestion/__init__.py", line 1, in <module>                                                     
                                     from .pipeline import create_pipeline  # NOQA                                                                    
                                   File                                                                                                               
                                 "/home/ljaniec/workspace/networking-route-optimizer/src/networking_route_opti                
                                 mizer/pipelines/data_ingestion/pipeline.py", line 3, in <module>                                                     
                                     from networking_route_optimizer.pipelines.data_ingestion.nodes import (                                  
                                   File                                                                                                               
                                 "/home/ljaniec/workspace/networking-route-optimizer/src/networking_route_opti                
                                 mizer/pipelines/data_ingestion/nodes.py", line 3, in <module>                                                        
                                     import osmnx as ox                                                                                               
                                 ModuleNotFoundError: No module named 'osmnx'                                                                         
                                                                                                                                                      
                                   warnings.warn(
    What is the problem there? I know that OSMNx standard installation uses
    conda
    and I installed it with
    pip
    , but I would expect this to be a problem both in script and in the pipeline...
    n
    • 2
    • 12
  • f

    Fazil Topal

    04/20/2025, 2:13 PM
    Hi everyone, I had a question around serving kedro pipelines. If i have a pipeline x that has some nodes, what's the best way to serve this pipeline? As of now, pipeline reads and writes files but during serving, ideally you want to send the received input to the pipeline (not reading the data from somewhere else) and returning the output (perhaps also not writing to a storage) directly. If there are some pointers i can have a look, would be awesome.
    d
    t
    • 3
    • 16
  • h

    Hugo Acosta

    04/21/2025, 7:52 AM
    Good morning, I have a question regarding the contents of the video "The road to Kedro 1.0", what does "Major refactor to decouple Runners from the DataCatalog" mean, could you elaborate? I have found that ParallelRunner fails when in my DataCatalog the intermediate files are stored in memory, is this related to the previous point? Thanks in advance!
    j
    • 2
    • 1
  • s

    Sudip Bhandari

    04/22/2025, 2:52 PM
    Hi everyone, I have a question about integrating MLflow into my Kedro project. Currently, all outputs from my Kedro project are being stored in a designated folder within the project directory (e.g.,
    mykedroproject/
    ), as specified in my
    catalog.yml
    . However, I've noticed that when I implement MLflow, artifacts and metrics are logged in a different location (under the
    mlruns
    directory). This results in the same outputs being stored twice: once through Kedro and again via MLflow. Do you have any advice on how to address this issue so that I store results only once? Ideally, I would like to have specific artifacts displayed in the MLflow UI, sourced directly from the
    mykedroproject/
    folder. Thanks in advance!!
    👍 1
    j
    y
    l
    • 4
    • 5
  • p

    Puneet Saini

    04/28/2025, 8:20 AM
    Hi team! I am trying to load a spark parquet as a
    polars.LazyPolarsDataset
    for which I assume the filepath needs to be a glob pattern. But since kedro-datasets>=6.0.0, we are checking the availability of the file itself without expanding the glob pattern if passed in. Is this a bug or am I doing something wrong?
    e
    d
    • 3
    • 10
  • f

    Fazil Topal

    04/28/2025, 10:53 AM
    Hi everyone, Is there a reason why globals.yml must be under base or other env folder? Simply putting it under conf does not work. Was there a specific reason for it? I wanted to define them for all the envs, i could hack it but i wanted to know if it's a bug or intended this way
    e
    • 2
    • 1
  • j

    Juan Luis

    04/28/2025, 11:13 AM
    this week we don't have Coffee Chat ☕ and the next one the full Kedro team at QB will be doing a co-location, so expect slower response times from us 🏢 Slack conversation
  • m

    Mikołaj Tym

    04/28/2025, 1:44 PM
    Hi, I'm encountering an issue where Kedro adds extra empty lines between each data row when saving a CSV file using the pandas.CSVDataset type. This results in empty rows when opening the file as text. Do you encounter this issue and know how can I prevent these extra lines from being added during the save process? It looks similar to this issue - https://github.com/kedro-org/kedro/issues/492
    👀 1
    e
    n
    • 3
    • 10
  • j

    Jordan Barlow

    04/28/2025, 4:51 PM
    Wondered if anyone else has come across this, or perhaps I'm doing something wrong. I'm reading from/writing to a hive partition of parquet files using Ibis with the DuckDB backend (
    ibis.FileDataset
    ,
    kedro-datasets>=7.0.0
    ). Kedro seems to make an assumption with the
    filepath
    catalog key of a dataset, that the dataset can be read from and written to that same path. However,
    Backend.write_parquet
    and
    <http://Backend.to|Backend.to>_parquet
    are different when
    load_args={'hive_partitioning': True}
    , as the corresponding DuckDB functions require a directory arg when writing, but a nested glob when reading: https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html This is reflected at the Ibis level as well: https://github.com/ibis-project/ibis/issues/10939 Things still work if you have a catalog entry like this:
    Copy code
    my_hive:
      type: ibis.FileDataset
      filepath: data/01_raw/my_hive/first_col=*/second_col=*/*.parquet
      table_name: my_hive
      file_format: parquet
      connection: ${_duckdb}
      load_args:
        hive_partitioning: true
      save_args:
        partition_by: ${tuple:first_col,second_col}
    But the write operation will treat the entire filepath like a directory path, and you end up with something like:
    Copy code
    my_hive
    └── first_col=*
        └── second_col=*
            └── *.parquet
                └── first_col=val_1
                    ├── second_col=cat_1
                    │   └── data_0.parquet
                    ├── second_col=cat_2
                        └── data_0.parquet
                └── ...
    This isn't really a Kedro design problem – perhaps the DuckDB API should be more symmetric. Has anyone else overcome this at the Kedro level? Thanks.
    d
    • 2
    • 3
  • l

    Lino Fernandes

    04/29/2025, 6:02 AM
    Dear team, what is the best practice to pass a AWS role in kedro? I need to read/write data in AWS. I'll have access to the role name/arn. I'm thinking that a few options: 1. Assume role in a hook e.g., `after_context_created`/
    before_dataset_loaded
    /
    before_dataset_saved
    but not sure if this gets propagated. From here, save the temporary credentials as env variables. 2. Create a custom data set and assume the role there. PS - I'm using
    pandas.ParquetDataset
    and
    partitions.PartitionedDataset
    e
    • 2
    • 2
  • m

    Matthias Roels

    04/29/2025, 4:26 PM
    With MLflow, you have to create a custom
    PythonModel
    in case you want to store a model combined with its preprocessing steps (which you always have to do imo). How can you do that with kedro (or kedro-mlflow)? The problem is that you probably fitted preprocessors in earlier nodes and persisted the result. As far as I can tell from the docs, MLflow requires its artifacts in a custom models to be persisted on disk (which you can do with the catalog) but these path strings are not readily available in the kedro nodes to be passed to the constructor of pyfunc… Any tips, ideas welcome 😀
    🤩 1
    y
    • 2
    • 13
  • p

    Pedro Sousa Silva

    04/30/2025, 9:34 AM
    Hey team, any way to resolve environment variables in globals.yaml? e.g.
    root: ${oc.env:AWS_S3_ROOT}
    My environment-variable-loading logic lives in the init.py file, and i believe that globals.yaml is being resolved before init.py runs, therefore it doesn't know
    AWS_S3_ROOT
    I'm guessing a hook would solve it? Apologies. Perks of migrating old repos to the newer versions of kedro with OmegaConfigLoader as default
    🙏 1
  • n

    Nicolas Betancourt Cardona

    04/30/2025, 2:55 PM
    Hello, I was wondering if there is a safe way of renaming my kedro project. I know that once you give a name to your project it is every where in different forms so I'm afraid of renaming some folders and then damaging something
    d
    n
    • 3
    • 5
  • p

    Pedro Sousa Silva

    05/05/2025, 1:29 PM
    Hey guys, Is there any way we can pass in a runtime argument into the catalog? kedro docs explicitly state that "`runtime_params` are not designed to override
    globals
    configuration.", so I wonder if there's any workaround to my requirement: We have a project where the frontend action will trigger my kedro run in Databricks (via Databricks Jobs REST API). Some parameters from the frontend will override some of my default kedro parameters (this works fine), but i also need to override a dataset definition based on one of these parameters. Particularly, i want my dataset to be written to a specific location that depends on a runtime_param `simulation_id`: my globals.yaml:
    Copy code
    root: ${oc.env:AWS_S3_ROOT}
    simulation_id: ${uuid:""} # ideally something like ${runtime_params:simulation_id}, but i know it's not possible
    folders:
        m_frontend: "09_frontend_reporting/${..simulation_id}"
    my catalog.yaml:
    Copy code
    simulation_json:
      type: json.JSONDataset
      filepath: ${globals:root}/${globals:folders.m_frontend}/simulation_${globals:simulation_id}.json
    What are my options to achieve this?
    j
    • 2
    • 9
  • j

    Joseph McLeish

    05/07/2025, 1:00 PM
    Hi! Started Kedro just today so apologies if this question has already been answered elsewhere or is super trivial. I've followed the tutorial to create a new project with an example pipeline, but when I run
    kedro run
    in the directory of the project (after having run
    uv pip install -r requirements.txt
    ), I get the following error:
    Copy code
    PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
    It seems like I need to install Java for this to work, but there's no mention of Java anywhere in the docs, so this doesn't feel like the right option. I'm on Windows running locally on VS Code and didn't encounter any issues with requirements installation). Is anyone able to help with this error? Thanks! 🙂
    l
    • 2
    • 2