https://kedro.org/ logo
Join Slack
Powered by
# questions
  • n

    NAYAN JAIN

    10/29/2025, 1:56 PM
    Hello Team, I have been following this for resolving S3 credentials at runtime: https://docs.kedro.org/en/1.0.0/extend/hooks/common_use_cases/#use-hooks-to-load-external-credentials However, I need to be able to connect to multiple S3 buckets (one for each dataset), and I need a few parameters at runtime to be able to assume AWS role and get credentials: account_id, role_arn, etc. To be able to do this with above approach, I would need my credential resolver hook to resolve based on the name of credential which could follow a special format (account_id/role_arn) and I cannot hardcode the names in the code. I need some lambda function values. Is this possible? Or would it be better to use config resolver instead as follows:
    Copy code
    weather:
     type: polars.EagerPolarsDataset
     filepath: <s3a://your_bucket/data/01_raw/weather*>
     file_format: csv
     credentials: ${s3_creds:123456789012,arn:role}
    where s3_creds is a config resolver that returns a dictionary with access keys and secrets. One potential issue I see with this approach is that the credentials could expire if they are evaluated only at the beginning of pipeline and not every time a load or save is performed. Is there any better way to achieve what I want? • Dynamic credential resolution per dataset. • Credential refresh at load/save time.
    d
    d
    e
    • 4
    • 7
  • r

    Raghav Singh

    10/29/2025, 6:51 PM
    Hi all, I had a question about the following update for Polars datasets (https://github.com/kedro-org/kedro-plugins/issues/625). • Do we know when this implementation will happen? • In the meanwhile, how you would recommend solving this issue? ◦ I am trying to read parquet files stored on S3 that were written by spark and so need to use glob matching for it to work. Should we create a custom dataset?
    r
    m
    d
    • 4
    • 5
  • s

    Sejal Singh

    10/30/2025, 8:59 AM
    Hi all, I am experiencing several log corruption in GitLab CI where kedro logs are truncated, garbled, and sentences are cut off randomly, making them completely unreadable. I've already tried : PYTHONUNBUFFERED=1 PYTHONIOENCODING=utf-8 Has someone ever encountered this specific truncation/garbling issue in GitLab? Is this the known Rich library terminal detection issue in CI? Is there any kedro Specific solution to this problem?
    d
    • 2
    • 2
  • c

    Chekeb Panschiri

    10/30/2025, 4:02 PM
    Hi all, do you of a solution to write free text on the catalog under a data input about like where the data is coming and when it as been uploaded on the project and by who? Typically, I want the information to show on the kedro viz.
    n
    d
    j
    • 4
    • 9
  • f

    Flavien

    10/31/2025, 8:54 AM
    Hi fellows, I am cleaning dependencies in our
    kedro
    code and, upon scrutiny, I am a bit confused by the dependencies for
    databricks.ManagedTableDataset
    . In
    pyproject.toml
    , https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml, is stated
    Copy code
    hdfs-base = ["hdfs>=2.5.8, <3.0"]
    s3fs-base = ["s3fs>=2021.4"]
    ...
    databricks-managedtabledataset = ["kedro-datasets[hdfs-base,s3fs-base]"]
    databricks = ["kedro-datasets[databricks-managedtabledataset]"]
    But in the implementation, I don't see any reference to those two packages while the dataset requires
    pyspark
    which is not stated as a dependency if I am not mistaken. Could you tell me if my interpretation is incorrect?
    d
    j
    • 3
    • 6
  • g

    Gauthier Pierard

    11/03/2025, 11:30 AM
    hey, on kedro, version 0.19.12 whats the best way to use a parameter to overwrite part of the catalog and catalog_globals paths?
    a
    • 2
    • 4
  • a

    Ayushi

    11/03/2025, 1:09 PM
    Hi All, I am using the concept of namespace, since namespace adds prefix to params so we define the parameters.yml as follows: N1: N2: PARAM1: PARAM_VALUE Is there a way in kedro to resolve the params using directory folders name example N1/N2/parameters.yml and I set just param1: param_value instead of the nested representation? Consider the folder structure is as per layers/ namespaces
    a
    • 2
    • 1
  • m

    Mark Einhorn

    11/06/2025, 12:30 PM
    hey Kedro Community, I was wondering if someone may be able to help with something. We have a pipeline which runs just fine in our
    dev
    env, but when deploying and running in
    test
    , we are getting the following error:
    Copy code
    DatasetError: Failed while loading data from dataset ParquetDataset(filepath=psi-test-data/***********/data/02_intermediate/formatted_transactions_df.parquet, load_args={}, protocol=s3, save_args={}).
    An error occurred (PreconditionFailed) when calling the GetObject operation: At least one of the pre-conditions you specified did not hold
    What's weird is that the error is not consistent. Sometimes the node responsible runs through just fine, other times it errors out, without anything (at least what we can see) changing. Any help would be massively appreciated! @Tom McHale
    n
    • 2
    • 2
  • g

    Guillaume Tauzin

    11/10/2025, 9:18 AM
    Hi 👋! I have been wondering for a long time if it was possible to process partition datasets one at a time (or in chunks) in parallel (e.g. when running with ParallelRunner). Today as I was wandering around I found this: https://github.com/kedro-org/kedro/issues/1413 I understand the code that is provided, but it only demonstrates how to write a node that both input and output a partionned dataset. The issue description itselfs talks about creating multiple nodes that would be ran in parallel and I don't see much on this and most of the links are dead. Could someone shed some light on this? Is this currently possible? Thanks :)
    👀 1
    e
    • 2
    • 4
  • b

    Biel Stela

    11/11/2025, 10:27 AM
    Hello! I was wondering if any of you know or has seen using external none python programs inside a Kedro pipeline. For context, I'm dealing with large raster files and using python workarounds at dataset level using rioxarray or even rasterio is quite challenging and problematic to get it right and not memory hungry. On the other hand there is the command line program
    gdal
    , which is a CLI for a c++ lib (the one used under the hood by rasterio), that can handle the large files without problem because it does all the streaming and all sorts of nice things under the hood. So I want to integrate this processing in my existing pipeline. Is it a bad idea to have a custom dataset that calls an external program via
    subprocess
    or something similar ? have you ever seen a pattern like this before? Will God kill a kitten if I go with this approach? Thank you!
    🙀 1
    e
    • 2
    • 1
  • s

    Shah

    11/11/2025, 3:33 PM
    Hello, I have just started a fresh kedro project (1.0). The nodes and pipeline are all set. I can list the pipeline successfully. It works on a single .csv dataset. Not using pyspark. However, while trying to run, it first threw error:
    LinkageError occurred while loading main class org.apache.spark.launcher.Main java.lang.UnsupportedClassVersionError:
    A little google search told me it's not finding the java installation. To resolve, I installed the latest java (jdk25). Now, the error has changed to:
    Py4JJavaError: An error occurred while calling <http://None.org|None.org>.apache.spark.api.java.JavaSparkContext. : java.lang.UnsupportedOperationException: getSubject is not supported
    I have checked the java path, and it's pointing to
    /usr/lib/jvm/java-11-openjdk-amd64/
    despite explicitly mentioning
    /usr/lib/jvm/jdk-25.0.1-oracle-x64/bin
    in the environment. But I think the main issue is, it seems, with pyspark which is not launching, throwing the same error. Since I do not need pyspark in this project, is there a way to disable it for time being, just to test my pipeline? Or else, how else could I fix this? Thanks!
    e
    • 2
    • 3
  • r

    Ralf Kowatsch

    11/13/2025, 8:12 AM
    I do work with snowpark and I'm writing a snowpark DataSet. I dont like that I have to share the snowparksession via the DataSet and not per Node. I do have multiple possible situations as fasr as i know. 1-n Input datasets and 1-n Output datasets. I'm now implementing it with a singelton which forces me to use the same session over all nodes. I would prefer to have an individual session per node which would have the advantage of • Isolate workflows • allow parallel processing • Add different configuration for each session
    e
    • 2
    • 1
  • s

    Srinivas

    11/14/2025, 8:19 AM
    Hello, I am trying to connect to adfs using Databricks, I already have a code that is running in Azure VM, I take the code and try to connect to one of the datasets using
    Copy code
    with KedroSession.create(project_path=project_path,package_name="package", env="end") as session:
        session.run(node_names=["ds1"
                                 ])
    and the connection details are like this
    Copy code
    ds1:
      type: "${globals:datatypes.csv}"
      filepath: "abfss://<container>@<acount_name>.<http://dfs.core.windows.net/raw_data/ds1.csv.gz|dfs.core.windows.net/raw_data/ds1.csv.gz>"
      fs_args:
        account_name: "accountName"
        sas_token: "sas_token"
      layer: raw_data
      load_args:
        sep: ";"
        escapechar: "\\"
        encoding: "utf-8"
        compression: gzip
        #lineterminator: "\n"
        usecols:
    The token is fine, but I am getting this exception DatasetError: Failed while loading data from data set CSVDataset(filepath=, load_args={}, protocol=abfss, save_args={'index': False}). Operation returned an invalid status 'Server failed to authenticate the request. Please refer to the information in the www-authenticate header.' ErrorCode:NoAuthenticationInformation
    e
    • 2
    • 5
  • s

    Srinivas

    11/14/2025, 8:19 AM
    Can anyone please help me
  • a

    Ayushi

    11/14/2025, 12:29 PM
    Hello Team, If I have 20 nodes where I want to conditionally execute nodes, like node_1 if true else node_2 Is it possible in kedro? I did go through conditionally executing pipelines but was not able to find relevant docs for nodes
    👀 1
    e
    • 2
    • 1
  • c

    cyril verluise

    11/17/2025, 7:09 PM
    Hey, There is something strange happening. I have an environement with kedro 1.0.0 and kedro-datasets installed but when it runs (in CI), I get a DatasetError suggesting that kedro-datasets is not installed
    Copy code
    DatasetError: An exception occurred when parsing config for dataset 'summary':
    No module named 'tracking'. Please install the missing dependencies for 
    tracking.MetricsDataset:
    <https://docs.kedro.org/en/stable/kedro_project_setup/dependencies.html#install-d>
    ependencies-related-to-the-data-catalog
    Hint: If you are trying to use a dataset from `kedro-datasets`, make sure that 
    the package is installed in your current environment. You can do so by running 
    `pip install kedro-datasets` or `pip install kedro-datasets[<dataset-group>]` to
    install `kedro-datasets` along with related dependencies for the specific 
    dataset group.
    Any idea of what is happening?
    j
    r
    • 3
    • 3
  • f

    Fabian P

    11/19/2025, 12:50 PM
    Hello, i want to save multiple keras models in seperate partitions. I can save a single model without problems, however, when I try to switch to partitionedDataset i constantly run into errors when trying to save. My dataset is defined as: model_partitioned_{name}: type: partitions.PartitionedDataset path: data/07_model_output/versioned/{name}/ filename_suffix: ".tf" dataset: type: tensorflow.TensorFlowModelDataset save_args: save_format: tf Trying to save the corrsponding data leads to the following error: (<class 'kedro.io.core.DatasetError'>, DatasetError('Failed while saving data to dataset kedro_datasets.partitions.partitioned_dataset.PartitionedDataset(filepath=.../data_analysis/data/07_model_output/versioned/monte_carlo_models\', dataset="kedro_datasets.tensorflow.tensorflow_model_dataset.TensorFlowModelDataset(save_args={\'save_format\': \'tf\'}, load_args={\'errors\': \'ignore\'})").\nThe first argument to
    Layer.call
    must always be passed.'), <traceback object at 0x0000025E444A4540>) When debugging, i can save each model individually by model.save(), so i assume the error message is not truly valid.
    j
    • 2
    • 1
  • g

    galenseilis

    11/19/2025, 10:30 PM
    Why doesn't Kedro session internally bootstrap itself when being called outside of jupyter or kedro run? https://docs.kedro.org/en/stable/api/framework/kedro.framework.session/#kedro.framework.session.session.KedroSession
    j
    y
    • 3
    • 2
  • y

    Yufei Zheng

    11/20/2025, 5:35 PM
    Hi team, I am very new to Kedro/Pyspark, we have some UDF function defined within Kedro pipeline.. I am wondering do we have an example of building the Kedro dependencies using
    kedro package
    and pass these to
    spark executor
    , thanks! (Tried to run the package command but still hitting
    no module named xxx
    in spark executor)
    ➕ 1
    j
    • 2
    • 2
  • m

    Ming Fang

    11/21/2025, 12:22 AM
    Hi. I'm starting to learn Kedro using the quickstart tutorial here https://docs.kedro.org/en/stable/getting-started/install/#installation-prerequisites I was able to run these commands
    Copy code
    uvx kedro new --starter spaceflights-pandas --name spaceflights
    cd spaceflights
    But the next command
    Copy code
    uv run kedro run --pipeline __default__
    resulted in these errors
    Copy code
    [11/21/25 00:21:49] INFO     Using 'conf/logging.yml' as logging configuration. You can change this by setting the KEDRO_LOGGING_CONFIG environment variable accordingly.          __init__.py:270
                        INFO     Kedro project spaceflights                                                                                                                             session.py:330
    [11/21/25 00:21:51] INFO     Kedro is sending anonymous usage data with the sole purpose of improving the product. No personal data or IP addresses are stored on our side. To opt   plugin.py:243
                                 out, set the `KEDRO_DISABLE_TELEMETRY` or `DO_NOT_TRACK` environment variables, or create a `.telemetry` file in the current working directory with the              
                                 contents `consent: false`. To hide this message, explicitly grant or deny consent. Read more at                                                                      
                                 <https://docs.kedro.org/en/stable/configuration/telemetry.html>                                                                                                        
                        WARNING  Workflow tracking is disabled during partial pipeline runs (executed using --from-nodes, --to-nodes, --tags, --pipeline, and more).                  run_hooks.py:135
                                 `.viz/kedro_pipeline_events.json` will be created only during a full kedro run. See issue <https://github.com/kedro-org/kedro-viz/issues/2443> for                     
                                 more details.                                                                                                                                                        
    ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
    │ /home/coder/spaceflights/.venv/lib/python3.13/site-packages/kedro/io/core.py:187 in from_config  │
    │                                                                                                  │
    │    184 │   │                                                                                     │
    │    185 │   │   """                                                                               │
    │    186 │   │   try:                                                                              │
    │ ❱  187 │   │   │   class_obj, config = parse_dataset_definition(                                 │
    │    188 │   │   │   │   config, load_version, save_version                                        │
    │    189 │   │   │   )                                                                             │
    │    190 │   │   except Exception as exc:                                                          │
    │                                                                                                  │
    │ /home/coder/spaceflights/.venv/lib/python3.13/site-packages/kedro/io/core.py:578 in              │
    │ parse_dataset_definition                                                                         │
    │                                                                                                  │
    │    575 │   │   │   │   "related dependencies for the specific dataset group."                    │
    │    576 │   │   │   )                                                                             │
    │    577 │   │   │   default_error_msg = f"Class '{dataset_type}' not found, is this a typo?"      │
    │ ❱  578 │   │   │   raise DatasetError(f"{error_msg if error_msg else default_error_msg}{hint}")  │
    │    579 │                                                                                         │
    │    580 │   if not class_obj:                                                                     │
    │    581 │   │   class_obj = dataset_type                                                          │
    ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
    DatasetError: Dataset 'MatplotlibWriter' not found in 'matplotlib'. Make sure the dataset name is correct.
    Hint: If you are trying to use a dataset from `kedro-datasets`, make sure that the package is installed in your current environment. You can do so by running `pip install kedro-datasets` or `pip
    install kedro-datasets[<dataset-group>]` to install `kedro-datasets` along with related dependencies for the specific dataset group.
    d
    r
    • 3
    • 7
  • j

    Jan

    11/21/2025, 9:33 AM
    Hello everyone, we used Kedro to build a pipeline that validates datasets. The pipeline has grown a lot and sometimes we need validations to run in a certain order. I know that Kedro decides order based on node inputs and outputs and in the past we created dummy datasets to create dependencies but this makes the code harder to read and maintain. We also use Apache Airflow and in airflow dependencies are defined explicitly between nodes by a custom operator: Node1 >> Node2 >> Node3 This sounds like a problem that must have been encountered before so I wanted to ask around whether there already is a plugin or extension that provides this functionality?
    ➕ 1
    j
    • 2
    • 2
  • p

    Prachee Choudhury

    11/22/2025, 3:44 AM
    Hi Kedro Team, I recall there was a Python library/plug-in for text animation/design which may or may not have been a Kedro resource , which was posted about in the Kedro Slack. I cannot remember the name of the resource. Thank you
    d
    • 2
    • 3
  • a

    Ahmed Etefy

    11/22/2025, 8:58 PM
    Hey Kedro Team, I am wondering what the solution for the following use case is I have some iceberg tables, I run my pipelines with, and then at a later point in time I'd like to run on different versions of those iceberg tables leveraging the iceberg versions in the catalog so its committed in code Is there any recommendation for how to address this? Also should I be adding to the catalog multiple entries with different versions to reflect the different run instances? Does Kedro support in some way maintaining historical versions?
  • b

    Basem Khalaf

    11/22/2025, 10:26 PM
    Hi Kedro Team 🙌 I’m currently running an older Kedro project (version 0.19.3), which is causing compatibility issues with the latest Kedro-Viz I have installed (12.2.0). Could you please advise where I can find the compatibility matrix between Kedro versions, Kedro-Viz versions, and the corresponding supported Python versions? Many thanks—I truly appreciate your help in advance. Basem
    d
    • 2
    • 1
  • a

    Ahmed Etefy

    11/23/2025, 9:07 PM
    Hey team Is there a way to have pipeline specific parameters.yml, and spark.yml? I'd ideally like to colocate pipeline config in the same folder as the pipelines for easier collaboration and I would like pipeline runs to only load pipeline specific configurations I guess what I am looking for is "composable pipeline projects"
    l
    • 2
    • 3
  • g

    Gauthier Pierard

    11/24/2025, 1:48 PM
    hey, i have an
    after_context_created
    hook called
    AzureSecretsHook
    that saves some credentials in
    context
    . Can I use these
    credentials
    as node inputs?
    Copy code
    context.config_loader["credentials"] = {
                **context.config_loader["credentials"],
                **adls_creds,
            }
    self.credentials = context.config_loader["credentials"]
    so far only been able to use it by importing
    AzureSecretsHook
    and using
    AzureSecretsHook.get_creds()
    directly in the nodes
    Copy code
    @staticmethod
        def get_creds():
            return AzureSecretsHook.credentials
    l
    n
    • 3
    • 3
  • j

    Jonghyun Yun

    11/25/2025, 4:31 PM
    Hi Team, I have written Kedro pipelines for data processing, model training, and scoring. To deploy a trained model for realtime inference, I want to see if it's a good idea to reuse data processing and scoring pipelines. To minimize the latency, what's the best way to utilize nodes and pipes written in Kedro?
    g
    m
    • 3
    • 5
  • g

    Gauthier Pierard

    11/26/2025, 10:03 AM
    hey, just to confirm there is no
    AbstractDataset
    predefined currently for polars to delta table? would something like this do the job?
    Copy code
    class PolarsDeltaDataset(AbstractDataset):
        def __init__(self, filepath: str, mode: str = "append"):
            self.filepath = filepath
            self.mode = mode
    
        def _load(self) -> pl.DataFrame:
            return pl.read_delta(self.filepath)
    
        def _save(self, data: pl.DataFrame) -> None:
            write_deltalake(
                self.filepath,
                data,
                mode=self.mode
            )
    
        def _describe(self):
            return dict(
                filepath=self.filepath,
                mode=self.mode
            )
    l
    n
    • 3
    • 5
  • m

    Martin van Hensbergen

    11/27/2025, 10:56 AM
    Hello all. I am new here and am investigating whether our company should use Kedro. We work in a highly regulated industry where we need to trian and deploy ML models in a sound, versioned, reproducable way. Kedro seems to tick a lot of boxes when it comes to clear directory structure, Node/Pipeline concepts, tagging for partial execution, datacatalog, etc. I have a succesfully built a succesful kedro package with 3 pipelines: 1) preprocesisng, 2) model training and 3) inference. (1) and (2) work perfectly but it seems that at the realtime inference I have issues. After having batch trained ML models on large datasets, I need to use the trained models for per-point inference in a hosted service. So, training is batch wise but inference is point wise. It seems that you somehow need to do this by defining the inference input as MemoryDataset and then somehow load data in that and execute the inference pipeline. However, i cant seem to find a way to do this properly via KedrioSession. I wonder if the use case that I have is actually something that is supported out of the box for Kedro? Any advice on how to do this with minimal overhead? As for now I have defined input as
    MemoryDataset
    as input for the inference pipeline but I get "`DatasetError: Data for MemoryDataset has not been saved`" error when running:
    Copy code
    with KedroSession.create() as session:
        context = session.load_context()
        context.catalog.get("input").save("mydata")
        session.run(pipeline_name="inference")
    1. Is this the proper way to do it? 2. Is this a use case that is supported by Kedro or should I only use it for the batch training and use the output of those models manually in my service.
    g
    m
    • 3
    • 3
  • z

    Zubin Roy

    11/28/2025, 12:04 PM
    Hi all 👋 A quick question about Kedro versioning behaviour. Is it possible to do folder-level versioning rather than dataset-level versioning? My use case: I have a single node that outputs a dozen CSV files each week. We want to keep weekly snapshots, but downloading each versioned dataset individually is a bit painful ideally we’d like all files stored under a single timestamped folder. And also to me that's a much cleaner foldering way of storing the files and understanding the weekly snapshot. At the moment I’ve implemented this by generating a timestamp myself and returning a dictionary of partition keys, e.g.:
    Copy code
    timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H-%M-%S")
    
    return {
        f"{timestamp}/national_ftds_ftus_ratio_df": national_ftds_ftus_ratio_df,
        f"{timestamp}/future_ftds_predictions_by_month_df": future_ftds_predictions_by_month_df,
        ...
    }
    And my catalog entry is:
    Copy code
    forecast_outputs:
      type: partitions.PartitionedDataset
      dataset: pandas.CSVDataset
      path: s3://.../forecast/
      filename_suffix: ".csv"
    This works, but I’m not sure if I’m using
    PartitionedDataset
    in the most “Kedro-native” way or if there’s a better supported pattern for grouping multiple outputs under a single version. It’s a minor problem, but I’d love to hear any thoughts, best practices, or alternative approaches. Thanks!
    l
    • 2
    • 2