https://kedro.org/ logo
Join Slack
Powered by
# questions
  • j

    Jamal Sealiti

    05/28/2025, 11:32 AM
    Hi, how can i setup kedro with Grafana for tracking node/pipeline data progress?
    d
    j
    • 3
    • 3
  • j

    Jamal Sealiti

    05/30/2025, 10:24 AM
    How Kedro handling merging 2 streaming datasets on some merge keys? And deleting?
    m
    • 2
    • 1
  • j

    Jamal Sealiti

    05/30/2025, 12:19 PM
    Its possible to create custom delta table dataset with change data capture option? and how i can create a table form my custom schame before writstreaming?
    m
    • 2
    • 4
  • y

    Yury Fedotov

    05/30/2025, 2:27 PM
    Are small contributions to docs (like typo fixes) being accepted now? Asking as I see you’re migrating to mkdocs, so maybe not the best time in terms of avoiding merge conflicts
    d
    • 2
    • 3
  • t

    Trọng Đạt Bùi

    06/02/2025, 10:06 AM
    Hello Everyone! Has anyone tried to manually create pipeline(Not auto-register pipeline of kedro)?
    a
    • 2
    • 7
  • a

    Ankit K

    06/02/2025, 3:19 PM
    Hi all, I’m working on a Kedro pipeline (using the
    kedro-vertexai
    plugin, version
    0.10.0
    ) where I need to track each pipeline run in a BigQuery table. We use a table_suffix (typically a date or unique run/session ID) to uniquely identify data and outputs for each pipeline run, ensuring that results from different runs do not overwrite each other and can be traced back to a specific execution. The challenge is that the kedro
    session_id
    or
    KEDRO_CONFIG_RUN_ID
    is not available at config load time, so early config logic (like setting a table_suffix) uses a date or placeholder value. This can cause inconsistencies, especially if nodes run on different days or the pipeline is resumed. (Currently pipeline takes ~2.5 days to run) We tried generating the table_suffix using the current date at config load time, but this led to issues: if a node runs on a different day or the pipeline is resumed, a new table_suffix is generated, causing inconsistencies and making it hard to track a single pipeline run. We also experimented with different Kedro hooks (such as before_pipeline_run and before_node_run) to set or propagate the run/session ID, but still faced challenges ensuring the value is available everywhere, including during config loading. What is the best practice in Kedro (with Vertex AI integration) for generating and propagating a unique run/session ID that is available everywhere (including config loading and all nodes), so that all tracking and table suffixes are consistent for a given run? Should this be set as an environment variable before Kedro starts, or is there a recommended hook or config loader pattern for this? Any advice or examples would be appreciated!
    👀 1
    a
    d
    • 3
    • 2
  • a

    Arnout Verboven

    06/03/2025, 11:00 AM
    Hi! If I have 2 configuration environments (
    local
    and
    prod
    ), is it possible to know during pipeline creation which environment is run? Or how should I do this using proper Kedro patterns. Eg. I want to do something like:
    Copy code
    def create_pipeline(env: str = "local") -> Pipeline:
        if env == "prod":
            return create_pipeline_prod()
        else:
            return create_pipeline_local()
    a
    j
    • 3
    • 4
  • a

    Abhishek Bhatia

    06/10/2025, 5:38 AM
    Hey team! Not a kedro question per se. What is the go to tooling for configuration management in data science projects outside of kedro (with OmegaConf)? Is Hydra the most popular choice? I am looking at the following features: 1. Global configs 2. Clear patterns for config type a. Static vs Dynamic b. Global vs Granular c. Constant vs Overridable 3. Param overriding with Globals 4. Param overriding within config file 5. Support for environment variables 6. Storing environment wise configs - DEV / STG / UAT / PROD etc 7. Interpolation with basic text concat 8. (Optional) Python function as resolvers in config (OmegaConf) 9. Config compilation artifact (i.e. I want to see how my config looks after resolving) 10. Invoking python scripts with arbitrary / alternate config paths 11. Invoking python scripts with specific param value Most of the above features are already there in kedro, but I need this functionality outside kedro. Eager to here the community's recommendation here! 🙂
    d
    j
    • 3
    • 4
  • m

    Malek Bouzidi

    06/10/2025, 12:34 PM
    Hi all. I've been trying kedro for the past few weeks. Everything worked well except for kedro-viz. It doesn't display the previews of the datasets. I followed all the instructions in the doc but nothing worked. Can someone help me to know the reason why it doesn't work ?
    d
    r
    y
    • 4
    • 7
  • s

    Sharan Arora

    06/10/2025, 5:53 PM
    hi I'm getting an error when doing kedro run. Would you be able to help? I have java 17 installed and im unsure why the code gets stuck on that last line, I always have to abort
  • s

    Sharan Arora

    06/11/2025, 1:35 AM
    just to follow up I'm receiving a FileNotFoundError: [WinError 2] The system cannot find the file specified error, I've double checked my path in environment variables and can't find an issue
    d
    • 2
    • 4
  • j

    Jonghyun Yun

    06/11/2025, 9:46 PM
    Hi Team, I'm using kedro 0.18.6. It seems to have a bug. When I create and run a part of composite pipeline, it actually run everything in it. For example, running pipe["a"] will trigger running pipe{"b"], pipe["c"] too. I don't think this is expected behavior. I cannot upgrade the kedro version above 0.18.xx. Was there a fix for this issue?
    n
    • 2
    • 4
  • t

    Trọng Đạt Bùi

    06/12/2025, 6:41 AM
    Has anyone tried to customize Spark Dataset to read multiple folders in HDFS?
    a
    n
    • 3
    • 2
  • m

    Mattis

    06/16/2025, 12:55 PM
    I have configured a dynamic pipeline (catalog and nodes) with a hooks-file. Locally it´s running in a docker container without problems, but when push it to AzureML and run it there, even though i can see the whole pipeline (and all dynamically created nodes names) - i receive "pipeline does not contain that .. node". How is this even possible? Does anyone have a clue?
    s
    r
    m
    • 4
    • 9
  • w

    Wejdan Bagais

    06/17/2025, 4:52 PM
    Hi everyone! 👋 I’m currently exploring how to approach unit testing in Kedro, especially when working with large-scale data pipelines. I’d love to hear your thoughts on a few things: • Do you find unit tests valuable in the context of data pipelines? • How do you typically implement them in Kedro? • Given that data quality checks are often a key focus, how do you handle testing when the input datasets are huge? Creating dummy data for every scenario doesn’t always seem practical. Any tips, examples, or lessons learned would be greatly appreciated! Thanks in advance 🙏
    j
    d
    +2
    • 5
    • 5
  • s

    Sharan Arora

    06/18/2025, 7:53 PM
    Hello, had a question The pipeline I'm trying to build includes credentials for a PostgreSQL DB. The idea is to pass off a containerized pipeline and facilitate the necessary data cleaning, transformation and storage required for further analytics. In credentials.yml, I have added the following
    Copy code
    postgresql_connection:
      host: "${oc.env:POSTGRESQL_HOST}"
      username: "${oc.env:POSTGRESQL_USER}"
      password: "${oc.env:POSTGRESQL_PASSWORD}"
      port: "${oc.env:POSTGRESQL_PORT}"
    and each of these information are stored in a .env file in the same
    local
    folder however when I do
    kedro run
    postgresql_connection isn't recognized and we are unable to detect the actual values provided in the .env file that should be passed onto credentials.yml since I want this to be dynamic and based on user input. Any idea how to resolve this? Additionally what is the process to getting kedro to read credentials.yml as well? it seems on kedro run it only cares about the catalog.yml? is it just linking credentials in catalog? i tried but then it reads the dynamic string literally
    s
    m
    • 3
    • 2
  • r

    Rachid Cherqaoui

    06/20/2025, 11:21 AM
    Hi everyone! 👋 I'm trying to load specific CSV files from an SFTP connection in Kedro, and I need to filter the files using a wildcard pattern. For example, I'd like to load only files that match something like:
    Copy code
    /doc_20250620*_delta.csv
    But I noticed that YAML interprets
    *
    as an anchor, and it doesn't seem to behave like a wildcard here. How can I configure a dataset in
    catalog.yml
    to use a wildcard when loading files from an SFTP path (e.g. to only fetch files starting with a certain prefix and ending with
    _delta.csv
    )? Is there native support for this kind of pattern in Kedro's SFTPDataSet or do I need to implement a custom dataset? Any guidance or examples would be super appreciated! 🙏
    s
    j
    • 3
    • 5
  • r

    Rachid Cherqaoui

    06/23/2025, 7:34 AM
    Hi everyone 👋 I'm currently working with Kedro and trying to load a CSV file hosted on an SFTP server using a
    CSVDataset
    . Here's the relevant entry from my `catalog.yml`:
    Copy code
    yaml
    Copy code
    cool_dataset:
      type: pandas.CSVDataSet
      filepath: 
    <sftp://my-sftp-server/outbox/DW_Extracts/my_file.csv>
      load_args: {}
      save_args:
        index: False
    When I run:
    Copy code
    python
    df = catalog.load("cool_dataset")
    I get the following error: It seems like Kedro/Pandas is trying to use ur`llib` to open the SFTP URL, which doesn't support the
    sftp://
    protocol natively. Has anyone successfully used Kedro to load files from SFTP? If so, could you share your config/setup?
    d
    j
    • 3
    • 8
  • a

    Adrien Paul

    06/23/2025, 5:02 PM
    Hello, In vscode kedro plugging, is it possible to run kedro viz with --include-hooks ? Thanks guys 🙏
    👀 1
    r
    • 2
    • 4
  • n

    Nathan W.

    06/25/2025, 7:32 AM
    Hello guys, I couldn't find any way to store API keys in a
    .env
    or
    credentials.yml
    and then use it in my nodes parameters to make API requests. Are there any simple solutions (without putting it in
    parameters.yml
    and then risk to push my key into production...) I missed ? Thanks a lot in advance for your response, Have a nice day!
    👀 1
    r
    • 2
    • 1
  • f

    Fazil Topal

    06/25/2025, 8:24 AM
    hey everyone, I am building a system where i return the key/filepath of final dataset in the kedro pipeline. What's the ideal way of doing this? A method that also works for partitioned datasets where i get a list of filepaths? I have a catalog instance but somehow all methods are protected so im wondering if im missing something obvious here. I was doing catalog._get_dataset(output)._filepath which works only for non partitioned datasets
    n
    r
    • 3
    • 10
  • j

    Jamal Sealiti

    06/26/2025, 10:14 AM
    Hi, placeholders for catalog.yml not working. I have in conf/base/parameters.yml bootstrap_servers: "localhost:9092" and in my catalog.yml trying to use placeholder like this ${bootstrap_servers} . but i get this error InterpolationKeyError: Interpolation key ' bootstrap_servers' not found
    m
    • 2
    • 2
  • r

    Rachid Cherqaoui

    06/27/2025, 2:20 PM
    hello, How I can put a credentials argument as an input in the pipelines function ?
    👀 1
    r
    • 2
    • 7
  • p

    Pradeep Ramayanam

    06/27/2025, 5:34 PM
    Hi All, hope everyone is doing well! I have a weird file structure as attached and would love to hear if anyone has solved it before, I tried to solve it as attached but I am getting below error DatasetError: No partitions found in '/data/01_raw/nces_ccd/*/Staff/DataFile' Any help would be much appreciated, thanks in advance!!
    👀 1
    r
    • 2
    • 19
  • r

    Rachid Cherqaoui

    06/30/2025, 9:11 AM
    Hi everyone, I have a versioned
    .txt
    file generated by a Kedro pipeline that I created, and I'd like to send it to a folder on a remote server via SFTP. After several attempts, I found it quite tricky to handle this cleanly within Kedro, especially while keeping things consistent with its data catalog and hooks system. Would anyone be able to help or share best practices on how to achieve this with Kedro? Thanks in advance for your support!
    👀 1
    j
    s
    • 3
    • 6
  • j

    Jamal Sealiti

    06/30/2025, 11:29 AM
    Hi, i have kafka->bronze->silver->gold streaming pipline and i want to see data from each stage on kedro vz, its possible?
    d
    r
    • 3
    • 10
  • o

    olufemi george

    07/02/2025, 4:52 PM
    Hello. Newbie here. Pls whats the best practice for using kedro with airflow ( astro ). Should i ; 1. create 2 seperate projects ( astro and the kedro ) and then move the kedro project files into the airflow project ( where exactly do i put them? ) 2. create the airflow project and develop the kedro project within it.
    y
    j
    d
    • 4
    • 4
  • m

    minmin

    07/03/2025, 12:51 PM
    Hello, I am using kedro-mlflow and trying to namespace a pipeline at the same time to do a bunch of runs together. When trying to save a metric, if I use the namespace's names explicitly in the catalog, it works. i.e.:
    model_1.mae:
    type: <http://kedro_mlflow.io|kedro_mlflow.io>.metrics.MlflowMetricDataset
    model_2.mae:
    type: <http://kedro_mlflow.io|kedro_mlflow.io>.metrics.MlflowMetricDataset
    if however i try and template the name in the catalog it fails:
    "{model_name}.mae":
    type: <http://kedro_mlflow.io|kedro_mlflow.io>.metrics.MlflowMetricDataset
    I get the error message: DatasetError: Failed while saving data to dataset MlflowMetricDataset(run_id=...). Invalid value null for parameter 'name' supplied: Metric name cannot be None. A key name must be provided. do I just have to avoid templating in the catalog when it comes to mlflow related entries?
    👀 2
    d
    e
    y
    • 4
    • 9
  • a

    Adrien Paul

    07/04/2025, 8:42 AM
    Hello, Is it possible to use transcoding with kedro-azureml plugin ? I feel like it's no possible ... Thanks guys 🙏
    d
    • 2
    • 3
  • j

    julie tverfjell

    07/04/2025, 10:20 AM
    Hi! I am wondering if anyone has experience with joining dataframes in Kedro and handling updates to the underlying dataframes? I am doing a stream-batch join, and i want to ensure that any updates to the batch dataframe gets propagated into my sink containing the joined data. The way I would want to solve this is to have a separate node that inputs my batch data and merges it into my sink with set intervals. In Kedro it is not possible to have two nodes outputting to the same dataframe. Is there a way to handle this in a diferent way? I thought about creating two instances of the batch dataset in the data catalog, which might omit the restriction kedro has on several nodes outputting to the same dataframe, but i don't know if it would be a good solution. To summarize: • I have a node that takes a streaming dataframe and a batch dataframe as input • The result is outputted to a sink (format: delta table) • I want my sink to reflect any updates to both data sources after the stream has started. • As of now, if there are any changes in the batch data, rows already existing in the sink will not be updated. • Also, i want to handle changes no matter when they arrive, so doing windowing is not an option. Any input will be appreciated 🙂
    d
    • 2
    • 2