Kedro #questions

Join Slack

Jamal Sealiti

05/28/2025, 11:32 AM

Hi, how can i setup kedro with Grafana for tracking node/pipeline data progress?

Jamal Sealiti

05/30/2025, 10:24 AM

How Kedro handling merging 2 streaming datasets on some merge keys? And deleting?

Jamal Sealiti

05/30/2025, 12:19 PM

Its possible to create custom delta table dataset with change data capture option? and how i can create a table form my custom schame before writstreaming?

Yury Fedotov

05/30/2025, 2:27 PM

Are small contributions to docs (like typo fixes) being accepted now? Asking as I see you’re migrating to mkdocs, so maybe not the best time in terms of avoiding merge conflicts

Trọng Đạt Bùi

06/02/2025, 10:06 AM

Hello Everyone! Has anyone tried to manually create pipeline(Not auto-register pipeline of kedro)?

Ankit K

06/02/2025, 3:19 PM

Hi all, I’m working on a Kedro pipeline (using the

kedro-vertexai

plugin, version

0.10.0

) where I need to track each pipeline run in a BigQuery table. We use a table_suffix (typically a date or unique run/session ID) to uniquely identify data and outputs for each pipeline run, ensuring that results from different runs do not overwrite each other and can be traced back to a specific execution. The challenge is that the kedro

session_id

KEDRO_CONFIG_RUN_ID

is not available at config load time, so early config logic (like setting a table_suffix) uses a date or placeholder value. This can cause inconsistencies, especially if nodes run on different days or the pipeline is resumed. (Currently pipeline takes ~2.5 days to run) We tried generating the table_suffix using the current date at config load time, but this led to issues: if a node runs on a different day or the pipeline is resumed, a new table_suffix is generated, causing inconsistencies and making it hard to track a single pipeline run. We also experimented with different Kedro hooks (such as before_pipeline_run and before_node_run) to set or propagate the run/session ID, but still faced challenges ensuring the value is available everywhere, including during config loading. What is the best practice in Kedro (with Vertex AI integration) for generating and propagating a unique run/session ID that is available everywhere (including config loading and all nodes), so that all tracking and table suffixes are consistent for a given run? Should this be set as an environment variable before Kedro starts, or is there a recommended hook or config loader pattern for this? Any advice or examples would be appreciated!

👀 1

Arnout Verboven

06/03/2025, 11:00 AM

Hi! If I have 2 configuration environments (

local

and

prod

), is it possible to know during pipeline creation which environment is run? Or how should I do this using proper Kedro patterns. Eg. I want to do something like:

Copy code

def create_pipeline(env: str = "local") -> Pipeline:
    if env == "prod":
        return create_pipeline_prod()
    else:
        return create_pipeline_local()

Abhishek Bhatia

06/10/2025, 5:38 AM

Hey team! Not a kedro question per se. What is the go to tooling for configuration management in data science projects outside of kedro (with OmegaConf)? Is Hydra the most popular choice? I am looking at the following features: 1. Global configs 2. Clear patterns for config type a. Static vs Dynamic b. Global vs Granular c. Constant vs Overridable 3. Param overriding with Globals 4. Param overriding within config file 5. Support for environment variables 6. Storing environment wise configs - DEV / STG / UAT / PROD etc 7. Interpolation with basic text concat 8. (Optional) Python function as resolvers in config (OmegaConf) 9. Config compilation artifact (i.e. I want to see how my config looks after resolving) 10. Invoking python scripts with arbitrary / alternate config paths 11. Invoking python scripts with specific param value Most of the above features are already there in kedro, but I need this functionality outside kedro. Eager to here the community's recommendation here! 🙂

Malek Bouzidi

06/10/2025, 12:34 PM

Hi all. I've been trying kedro for the past few weeks. Everything worked well except for kedro-viz. It doesn't display the previews of the datasets. I followed all the instructions in the doc but nothing worked. Can someone help me to know the reason why it doesn't work ?

Sharan Arora

06/10/2025, 5:53 PM

hi I'm getting an error when doing kedro run. Would you be able to help? I have java 17 installed and im unsure why the code gets stuck on that last line, I always have to abort

Sharan Arora

06/11/2025, 1:35 AM

just to follow up I'm receiving a FileNotFoundError: [WinError 2] The system cannot find the file specified error, I've double checked my path in environment variables and can't find an issue

Jonghyun Yun

06/11/2025, 9:46 PM

Hi Team, I'm using kedro 0.18.6. It seems to have a bug. When I create and run a part of composite pipeline, it actually run everything in it. For example, running pipe["a"] will trigger running pipe{"b"], pipe["c"] too. I don't think this is expected behavior. I cannot upgrade the kedro version above 0.18.xx. Was there a fix for this issue?

Trọng Đạt Bùi

06/12/2025, 6:41 AM

Has anyone tried to customize Spark Dataset to read multiple folders in HDFS?

Mattis

06/16/2025, 12:55 PM

I have configured a dynamic pipeline (catalog and nodes) with a hooks-file. Locally it´s running in a docker container without problems, but when push it to AzureML and run it there, even though i can see the whole pipeline (and all dynamically created nodes names) - i receive "pipeline does not contain that .. node". How is this even possible? Does anyone have a clue?

Wejdan Bagais

06/17/2025, 4:52 PM

Hi everyone! 👋 I’m currently exploring how to approach unit testing in Kedro, especially when working with large-scale data pipelines. I’d love to hear your thoughts on a few things: • Do you find unit tests valuable in the context of data pipelines? • How do you typically implement them in Kedro? • Given that data quality checks are often a key focus, how do you handle testing when the input datasets are huge? Creating dummy data for every scenario doesn’t always seem practical. Any tips, examples, or lessons learned would be greatly appreciated! Thanks in advance 🙏

Sharan Arora

06/18/2025, 7:53 PM

Hello, had a question The pipeline I'm trying to build includes credentials for a PostgreSQL DB. The idea is to pass off a containerized pipeline and facilitate the necessary data cleaning, transformation and storage required for further analytics. In credentials.yml, I have added the following

Copy code

postgresql_connection:
  host: "${oc.env:POSTGRESQL_HOST}"
  username: "${oc.env:POSTGRESQL_USER}"
  password: "${oc.env:POSTGRESQL_PASSWORD}"
  port: "${oc.env:POSTGRESQL_PORT}"

and each of these information are stored in a .env file in the same

local

folder however when I do

kedro run

postgresql_connection isn't recognized and we are unable to detect the actual values provided in the .env file that should be passed onto credentials.yml since I want this to be dynamic and based on user input. Any idea how to resolve this? Additionally what is the process to getting kedro to read credentials.yml as well? it seems on kedro run it only cares about the catalog.yml? is it just linking credentials in catalog? i tried but then it reads the dynamic string literally

Rachid Cherqaoui

06/20/2025, 11:21 AM

Hi everyone! 👋 I'm trying to load specific CSV files from an SFTP connection in Kedro, and I need to filter the files using a wildcard pattern. For example, I'd like to load only files that match something like:

Copy code

/doc_20250620*_delta.csv

But I noticed that YAML interprets

as an anchor, and it doesn't seem to behave like a wildcard here. How can I configure a dataset in

catalog.yml

to use a wildcard when loading files from an SFTP path (e.g. to only fetch files starting with a certain prefix and ending with

_delta.csv

)? Is there native support for this kind of pattern in Kedro's SFTPDataSet or do I need to implement a custom dataset? Any guidance or examples would be super appreciated! 🙏

Rachid Cherqaoui

06/23/2025, 7:34 AM

Hi everyone 👋 I'm currently working with Kedro and trying to load a CSV file hosted on an SFTP server using a

CSVDataset

. Here's the relevant entry from my `catalog.yml`:

Copy code

yaml

Copy code

cool_dataset:
  type: pandas.CSVDataSet
  filepath: 
<sftp://my-sftp-server/outbox/DW_Extracts/my_file.csv>
  load_args: {}
  save_args:
    index: False

When I run:

Copy code

python
df = catalog.load("cool_dataset")

I get the following error: It seems like Kedro/Pandas is trying to use ur`llib` to open the SFTP URL, which doesn't support the

sftp://

protocol natively. Has anyone successfully used Kedro to load files from SFTP? If so, could you share your config/setup?

Adrien Paul

06/23/2025, 5:02 PM

Hello, In vscode kedro plugging, is it possible to run kedro viz with --include-hooks ? Thanks guys 🙏

👀 1

Nathan W.

06/25/2025, 7:32 AM

Hello guys, I couldn't find any way to store API keys in a

.env

credentials.yml

and then use it in my nodes parameters to make API requests. Are there any simple solutions (without putting it in

parameters.yml

and then risk to push my key into production...) I missed ? Thanks a lot in advance for your response, Have a nice day!

👀 1

Fazil Topal

06/25/2025, 8:24 AM

hey everyone, I am building a system where i return the key/filepath of final dataset in the kedro pipeline. What's the ideal way of doing this? A method that also works for partitioned datasets where i get a list of filepaths? I have a catalog instance but somehow all methods are protected so im wondering if im missing something obvious here. I was doing catalog._get_dataset(output)._filepath which works only for non partitioned datasets

Jamal Sealiti

06/26/2025, 10:14 AM

Hi, placeholders for catalog.yml not working. I have in conf/base/parameters.yml bootstrap_servers: "localhost:9092" and in my catalog.yml trying to use placeholder like this ${bootstrap_servers} . but i get this error InterpolationKeyError: Interpolation key ' bootstrap_servers' not found

Rachid Cherqaoui

06/27/2025, 2:20 PM

hello, How I can put a credentials argument as an input in the pipelines function ?

👀 1

Pradeep Ramayanam

06/27/2025, 5:34 PM

Hi All, hope everyone is doing well! I have a weird file structure as attached and would love to hear if anyone has solved it before, I tried to solve it as attached but I am getting below error DatasetError: No partitions found in '/data/01_raw/nces_ccd/*/Staff/DataFile' Any help would be much appreciated, thanks in advance!!

👀 1

Rachid Cherqaoui

06/30/2025, 9:11 AM

Hi everyone, I have a versioned

.txt

file generated by a Kedro pipeline that I created, and I'd like to send it to a folder on a remote server via SFTP. After several attempts, I found it quite tricky to handle this cleanly within Kedro, especially while keeping things consistent with its data catalog and hooks system. Would anyone be able to help or share best practices on how to achieve this with Kedro? Thanks in advance for your support!

👀 1

Jamal Sealiti

06/30/2025, 11:29 AM

Hi, i have kafka->bronze->silver->gold streaming pipline and i want to see data from each stage on kedro vz, its possible?

olufemi george

07/02/2025, 4:52 PM

Hello. Newbie here. Pls whats the best practice for using kedro with airflow ( astro ). Should i ; 1. create 2 seperate projects ( astro and the kedro ) and then move the kedro project files into the airflow project ( where exactly do i put them? ) 2. create the airflow project and develop the kedro project within it.

minmin

07/03/2025, 12:51 PM

Hello, I am using kedro-mlflow and trying to namespace a pipeline at the same time to do a bunch of runs together. When trying to save a metric, if I use the namespace's names explicitly in the catalog, it works. i.e.:

model_1.mae:

type: <http://kedro_mlflow.io|kedro_mlflow.io>.metrics.MlflowMetricDataset

model_2.mae:

type: <http://kedro_mlflow.io|kedro_mlflow.io>.metrics.MlflowMetricDataset

if however i try and template the name in the catalog it fails:

"{model_name}.mae":

type: <http://kedro_mlflow.io|kedro_mlflow.io>.metrics.MlflowMetricDataset

I get the error message: DatasetError: Failed while saving data to dataset MlflowMetricDataset(run_id=...). Invalid value null for parameter 'name' supplied: Metric name cannot be None. A key name must be provided. do I just have to avoid templating in the catalog when it comes to mlflow related entries?

👀 2

Adrien Paul

07/04/2025, 8:42 AM

Hello, Is it possible to use transcoding with kedro-azureml plugin ? I feel like it's no possible ... Thanks guys 🙏

julie tverfjell

07/04/2025, 10:20 AM

Hi! I am wondering if anyone has experience with joining dataframes in Kedro and handling updates to the underlying dataframes? I am doing a stream-batch join, and i want to ensure that any updates to the batch dataframe gets propagated into my sink containing the joined data. The way I would want to solve this is to have a separate node that inputs my batch data and merges it into my sink with set intervals. In Kedro it is not possible to have two nodes outputting to the same dataframe. Is there a way to handle this in a diferent way? I thought about creating two instances of the batch dataset in the data catalog, which might omit the restriction kedro has on several nodes outputting to the same dataframe, but i don't know if it would be a good solution. To summarize: • I have a node that takes a streaming dataframe and a batch dataframe as input • The result is outputted to a sink (format: delta table) • I want my sink to reflect any updates to both data sources after the stream has started. • As of now, if there are any changes in the batch data, rows already existing in the sink will not be updated. • Also, i want to handle changes no matter when they arrive, so doing windowing is not an option. Any input will be appreciated 🙂