Kedro #questions

Swift

09/18/2025, 11:27 PM

👋 I am just starting with kedro. I am putting together an example pipeline to make sure I understand all the concepts before I build a larger project with it. One concept I have not able to fully figure out is how to work with APIs. The simple project I am trying to build is: 1. get top N articles on hacker news 2. fetch the items, to get the urls 3. use an api to summarize the url 4. save the summary, url, etc I am able to easily build the HackerNewsTopAPIDataset for getting the top items. However, I am not able to figure out how to get those item ids into the HackerNewsItemsAPIDataset. I am of course able to fill up the node function with all kinds of io and get it to work, which is what I did. However, everything I read says this is the wrong approach and node functions should be purely functional. I have stumbled into the stackoverflow question, https://stackoverflow.com/questions/73430557/dynamic-parameters-on-datasets-in-kedro, which talks about adding kwargs to the _load(). However I do not see how to pass arguments into the load without explicitly pulling datasets into the node function and calling it directly. This brings me back to doing io in node functions. Now I am left scratching my head on how to link datasets that require input to be able to function. Any insights or pointers would be greatly appreciated.

Galen Seilis

09/18/2025, 11:46 PM

Can a custom resolver be used to fill in a catalog dataset name? At a glance this would appear to conflict with the dataset factory notation, but I want to make sure.

Paul Haakma

09/22/2025, 8:09 AM

Hi all. Does anyone know of a way to configure a dataset to write out a GeoPandas dataframe to a DuckDb database table, and vice-versa? I tried using ibis.TableDataset to write but it complains that the GeoDataFrame object doesn't have an 'as_table' attribute. Would I have to implement a custom DuckDb dataset? It doesn't look too hard but I don't want to reinvent the wheel if there's already a way to do it...

Anton Nikishin

09/26/2025, 9:46 AM

Can I use an existing Databricks cluster with kedro-databricks? By default, kedro-databricks tries to create new resources, which my account doesn’t have permission to do. Is there a way to specify an existing cluster ID instead? I tried editing

conf/dev/databricks.yml

with the following code:

Copy code

default:
  tasks:
  - existing_cluster_id: 0924-121047-3jcdtqh1

But running

kedro databricks bundle --overwrite

raises an error:

Copy code

AssertionError: lookup_key task_key not found in updates: [{'existing_cluster_id': '0924-121047-3jcdtqh1'}]

Nik Linnane

10/01/2025, 6:19 PM

hi! im trying to deploy a pipeline using the databricks plugin. im able to

init

bundle

, and

deploy

(what looks to be) successfully (i can see the job and files created in the UI), but always get this error when running...

Copy code

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/.ipykernel/6689/command--1-4096408574:22
     20 import importlib
     21 module = importlib.import_module("classification_pipeline")
---> 22 module.classification_pipeline()

AttributeError: module 'classification_pipeline' has no attribute 'classification_pipeline'

it looks like theres confusion about the entry point. some additional details below that may/may not be helpful in debugging... • i'm following these instructions • my pipeline has

dev

qa

, and

prod

environments configured within

conf

and i'm trying to deploy

qa

• ive added an existing_cluster_id • the commands ive ran are ◦

kedro databricks init

◦

kedro databricks bundle --env qa --params runner=ThreadRunner

◦

kedro databricks deploy --env qa --runtime-params runner=ThreadRunner

◦

kedro databricks run classification_pipeline

• "classification_pipeline" is used for my package and project names any help is appreciated! @Jens Peder Meldgaard @Nok Lam Chan

Shah

10/02/2025, 10:49 AM

Hi everyone, I'm a novice to Kedro, experimenting with my first implementation. Trying to parametrize every function to take the maximum advantage of the platform. While attempting to access parameters defined in the 'parameters_xxx.yml' file, say for example 'data_processing' pipeline, I have two questions. But first, a glimpse into my

parameters_data_processing.yml

file:

Copy code

column_rename_params: # Suffix to be added to overlapping columns
    skip_cols: ['Date'] # Columns to skip while renaming
    co2: '_co2'
    motion: '_motion'
    presence: '_presence'

data_clean_params:
  V2_motion: {
        condition: '<0',
        new_val: 0
        }
  V2_presence: {
        condition: '<0',
        new_val: 0
        }

  infinite_values:
    infinite_val_remove: true
    infinite_val_conditions:
      - column_name: V2_motion
        lower_bound: -1e10
        upper_bound: 1e10
      - column_name: V2_presence
        lower_bound: -1e10
        upper_bound: 1e10

I am experimenting with different parameter styles: dictionaries of dictionaries, dictionary of lists etc. So the two questions are as following: 1. How do I pass the second or third level dictionary parameters to a node? e.g. how do I pass

column_rename_params['co2']

key's value to one node, and

column_rename_params['motion']

key's value to another? My attempt of passing inputs to a node as

inputs=['co2_processed', 'params:column_rename_params:co2', 'params:column_rename_params:skip_cols']

, has returned

"not found in the DataCatalog"

error. Do I need to define these parameters in

catalog.yml

? Since, the parameters are not defined in the catalog.yml, yet I can access the

"params:column_rename_params"

dictionary, I guess there must be a way to access the next level as well. As a workaround, I have simplified the dictionary, keeping everything on the base level (not nested dictionaries). 2. Curiousity: Why do we write

'params:<key>'

instead of

'parameters:<key>'

? Just curious because I do not remember to have defined any object as 'params'. I was just following the tutorial. Thanks ahead, and also thanks for Kedro and this slack workspace.

Shah

10/03/2025, 6:22 PM

Hello, I upgraded to kedro 1.0 accepting the suggestion from Kedro CLI output. A lot of things broke down, which I could put to work thanks to the documentation (0.19 > 1.0), including changing the internal kedro library function from catalog.filter() to catalog.list(). However, now I cannot follow through with other functionalities such as 'namespaces'. Is there a support for namespaces in 1.0? If yes, how?

Sreekar Reddy

10/04/2025, 9:56 AM

Hello, I am Srikar and I am mew to Open-Source contribution I have started exploring kedro , and solved some git issues and know the the fixes but I am not able build my development setup perfectly there are so many build issues I am diving into I have checked the make file and the and corresponding contributing.md files but I am not able to build my setup , so I need some help can anyone help me

Mamadou SAMBA

10/06/2025, 3:45 PM

Hi team We found an issue with how Kedro handles empty runtime parameters when triggered from Airflow. In our pipeline config, we have something like:

Copy code

some_dataset:
  type: spark.SparkDataset
  file_format: delta
  filepath: "gs://<bucket-prefix>${runtime_params:env}-dataset/app_usages"

Airflow correctly sends an empty string (
''
) for the

env

parameter, but Kedro interprets it as None
. So the final path becomes:

Copy code

gs://<bucket-prefix>None-dataset/

instead of:

Copy code

gs://<bucket-prefix>-dataset/

Here’s the simplified Airflow call:

Copy code

"params": build_kedro_params(
    [
        f"project={get_project_id()}",
        f"env={env_param}",  # env_param = ''
        
    ]
)

It looks like Kedro converts empty strings from runtime parameters into

None

during parsing. Has anyone else run into this issue with Kedro interpreting empty strings as
None
?

Stas

10/07/2025, 4:15 PM

Hello team, I need to run a pipeline daily, and my input datasets will be in different folders every day. Like, d:\data\{run_date}\dataset1.csv Is it possible to use a parameter in the catalog.yml file to substitute {run_date} with the actual value? Or what is the other way to achieve this?

Stas

10/09/2025, 11:07 AM

Hi, Is it possible to specify a parameter type that will be passed to a node function. I have a Node, like this Node(func=myfunc, inputs=["params:run_date"]) My func has signature def myfunc(rundate: datetime.date): So, it looks like Kedro passes run_date as str instead of datetime.date.

Shah

10/09/2025, 4:47 PM

Hi, Continuing with my experimentation with namespaces and inheriting/extending pipelines, I have a situation. My current workflow is following. I have namespaces implemented for each of the demo (train and evaluate LR model) and extended (train and model RF model), pipelines. (Continued in the replies to this message... )

Gianni Giordano

10/13/2025, 12:48 PM

Hello, we have a kedro pipeline with 450+ nodes and, as you can imagine, we're struggling with kedro-viz. It lags, freezes and takes a lot of time for a simple filter. Is there anything we can do to improve kedro-viz performance? Maybe in the settings or in the source code. Thanks

Stas

10/14/2025, 11:13 AM

Hi, I've created a custom dataset. Now, I can see it in the graph using kedro viz, but it shows the size = 0. What should I add to my custom dataset so that kedro viz will correctly show the size of the dataset?

Flavien

10/15/2025, 12:56 PM

Hi fellows, I am trying to update an old

kedro

code from

0.18.12

1.0.0

step by step, starting with version

0.19.15

. We had set up a test to be sure that our custom resolvers were working that reads as

Copy code

def test_custom_resolvers_in_example(
    project_path: Path,
) -> None:
    bootstrap_project(project_path=project_path)

    # Default value
    with KedroSession.create(
        project_path=project_path,
        env="example",
    ) as session:
        context = session.load_context()
        catalog = context._get_catalog()
        assert timedelta(days=1) == catalog.load("params:example-duration")
        assert datetime(1970, 1, 1, tzinfo=timezone.utc) == catalog.load(
            "params:example-epoch"
        )

It turns out this test was passing with version

0.18.12

with

CONFIG_LOADER_CLASS = OmegaConfigLoader

but it fails in version

0.19.15

. It seems that the environment is not taken into account and that the loader parses all the possible environments (therefore finding duplicates).

Copy code

E           ValueError: Duplicate keys found in .../conf/local/catalog.yml and .../conf/production/catalog.yml: hourly_forecasts, output_hourly_forecasts

Duplicate keys

doesn't seem to bring any message on Slack. Please let me know the mistake I made. Thanks in advance!

Stas

10/15/2025, 1:17 PM

Hi, what is the best practice to manage credentials? The docs mention two approaches: using environment variables and local files. But we need to get credentials dynamically, calling an API. Do you have any recommendations for this?

Flavien

10/15/2025, 3:33 PM

Hi folks, I don't know if I am making a mistake somewhere but I think there is some issue with the files on PyPi. I have installed 0.19.15 from https://pypi.org/project/kedro/0.19.15/#files. If you download the archive

kedro-0.19.15.tar.gz

and check

KedroSession.create

from

kedro.framework.session

, you will see that the signature has

extra_params

and not

runtime_params

. The source code on the GitHub repository for the tag 0.19.15 is correct though (same for 0.19.14). Please let me know if you see the same thing. 😅

Stas

10/16/2025, 1:32 PM

Hi, I use a hook to load external credentials as per https://docs.kedro.org/en/stable/extend/hooks/common_use_cases/#use-hooks-to-load-external-credentials In the hook I would like to use some parameters, like an external url and some other parameters to get external credentials. What is the best practice to store and get such kind of parameters? Are environment variables the only way to achieve this?

Copy code

def after_context_created(self, context):
    creds = get_credentials*(***url***, ***account***)*

Copy code

context.config_loader["credentials"] = {          
         **context.config_loader["credentials"],

**creds

Paul Haakma

10/17/2025, 4:06 AM

Does anyone know if there is a tutorial documentation page on creating custom datasets? This link throws a 404 https://docs.kedro.org/en/1.0.0/data/how_to_create_a_custom_dataset.html That link comes from the Github site: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/CONTRIBUTING.md

Ayushi

10/17/2025, 11:04 AM

Hello I am trying to connect two pipelines let's say P1 and P2 they both have namespace n1 and n2 specified where output from P1 is to be consumed as input in P2 , what is the best practice to specify such pipelines so that we can connect these pipelines, do we always need to explicitly mention in inputs of nodes of P2 - inputs = N1.output, or there is any other way we can do this?

Stas

10/20/2025, 1:58 PM

Hi, is it possible to pass a node output as a ApiDataset parameter? The use case is the following: 1) a node produces a list of ids 2) ApiDataset uses the list of ids as a parameter to call an API. 3) the next node gets this dataset that comes from ApiDataset

Pascal Brokmeier

10/20/2025, 2:08 PM

Adding to hugging face recently? I have a PoC here https://github.com/everycure-org/matrix/tree/feat/kedro-hf-dataset/pipelines/huggingface-dataset-demo Curious to get feedback on that and happy to bring to the kedro-plugins repo ofc. @Merel just fyi (and sorry for sleeping on our google sheets dataset, I nudged laurens to pick that back up)

Tim Deller

10/21/2025, 10:30 AM

Hi! Are there best practices regarding exploratory data analysis and data cleaning? Do you start with notebooks and move code to kedro nodes later on? Thanks for your suggestions

👌🏼 1

Shu-Chun Wu

10/24/2025, 2:14 PM

Hi team, Do you have example or use case already, how kedro working with Label Studio? #C03RKP2LW64

Mohamed El Guendouz

10/24/2025, 3:32 PM

Hi everyone ! 🙂 I’m facing an issue with Kedro-Viz. I have a node that performs a merge into a Delta Table. In this node, I pass two inputs: • the dataframe to be inserted, and • the destination Delta Table itself. Inside the node, I execute the merge logic directly. The problem is that Kedro-Viz treats the Delta Table as an input, whereas I’d like it to be represented as the output after the merge, so that the lineage is clearer and reflects the actual data flow. Is there a way to indicate which dataset is the true input and which one should be considered the final output in this kind of use case? Thanks for your help! 🙏

👀 1

Ayushi

10/27/2025, 6:35 AM

Hi Everyone! Issue Description: Getting

InterpolationResolutionError

kedro run after adding custom resolvers to CONFIG_LOADER_ARGS in settings.py Kedro run works fine if I comment out the custom resolver in settings.py but if I try to run kedro via this resolver i get an error saying globals key not found, content of settings.py

Copy code

CONFIG_LOADER_ARGS = {
 "custom_resolvers" : {
"Our_resolver": reference to resolver
}

Is it necessary to explicitly mention config patterns?so that it is able to find globals or configs correctly

NAYAN JAIN

10/29/2025, 1:56 PM

Hello Team, I have been following this for resolving S3 credentials at runtime: https://docs.kedro.org/en/1.0.0/extend/hooks/common_use_cases/#use-hooks-to-load-external-credentials However, I need to be able to connect to multiple S3 buckets (one for each dataset), and I need a few parameters at runtime to be able to assume AWS role and get credentials: account_id, role_arn, etc. To be able to do this with above approach, I would need my credential resolver hook to resolve based on the name of credential which could follow a special format (account_id/role_arn) and I cannot hardcode the names in the code. I need some lambda function values. Is this possible? Or would it be better to use config resolver instead as follows:

Copy code

weather:
 type: polars.EagerPolarsDataset
 filepath: <s3a://your_bucket/data/01_raw/weather*>
 file_format: csv
 credentials: ${s3_creds:123456789012,arn:role}

where s3_creds is a config resolver that returns a dictionary with access keys and secrets. One potential issue I see with this approach is that the credentials could expire if they are evaluated only at the beginning of pipeline and not every time a load or save is performed. Is there any better way to achieve what I want? • Dynamic credential resolution per dataset. • Credential refresh at load/save time.

Raghav Singh

10/29/2025, 6:51 PM

Hi all, I had a question about the following update for Polars datasets (https://github.com/kedro-org/kedro-plugins/issues/625). • Do we know when this implementation will happen? • In the meanwhile, how you would recommend solving this issue? ◦ I am trying to read parquet files stored on S3 that were written by spark and so need to use glob matching for it to work. Should we create a custom dataset?

Sejal Singh

10/30/2025, 8:59 AM

Hi all, I am experiencing several log corruption in GitLab CI where kedro logs are truncated, garbled, and sentences are cut off randomly, making them completely unreadable. I've already tried : PYTHONUNBUFFERED=1 PYTHONIOENCODING=utf-8 Has someone ever encountered this specific truncation/garbling issue in GitLab? Is this the known Rich library terminal detection issue in CI? Is there any kedro Specific solution to this problem?

Chekeb Panschiri

10/30/2025, 4:02 PM

Hi all, do you of a solution to write free text on the catalog under a data input about like where the data is coming and when it as been uploaded on the project and by who? Typically, I want the information to show on the kedro viz.