https://kedro.org/ logo
Join Slack
Powered by
# questions
  • b

    Biel Stela

    11/11/2025, 10:27 AM
    Hello! I was wondering if any of you know or has seen using external none python programs inside a Kedro pipeline. For context, I'm dealing with large raster files and using python workarounds at dataset level using rioxarray or even rasterio is quite challenging and problematic to get it right and not memory hungry. On the other hand there is the command line program
    gdal
    , which is a CLI for a c++ lib (the one used under the hood by rasterio), that can handle the large files without problem because it does all the streaming and all sorts of nice things under the hood. So I want to integrate this processing in my existing pipeline. Is it a bad idea to have a custom dataset that calls an external program via
    subprocess
    or something similar ? have you ever seen a pattern like this before? Will God kill a kitten if I go with this approach? Thank you!
    ๐Ÿ™€ 1
    e
    • 2
    • 1
  • s

    Shah

    11/11/2025, 3:33 PM
    Hello, I have just started a fresh kedro project (1.0). The nodes and pipeline are all set. I can list the pipeline successfully. It works on a single .csv dataset. Not using pyspark. However, while trying to run, it first threw error:
    LinkageError occurred while loading main class org.apache.spark.launcher.Main java.lang.UnsupportedClassVersionError:
    A little google search told me it's not finding the java installation. To resolve, I installed the latest java (jdk25). Now, the error has changed to:
    Py4JJavaError: An error occurred while calling <http://None.org|None.org>.apache.spark.api.java.JavaSparkContext. : java.lang.UnsupportedOperationException: getSubject is not supported
    I have checked the java path, and it's pointing to
    /usr/lib/jvm/java-11-openjdk-amd64/
    despite explicitly mentioning
    /usr/lib/jvm/jdk-25.0.1-oracle-x64/bin
    in the environment. But I think the main issue is, it seems, with pyspark which is not launching, throwing the same error. Since I do not need pyspark in this project, is there a way to disable it for time being, just to test my pipeline? Or else, how else could I fix this? Thanks!
    e
    • 2
    • 3
  • r

    Ralf Kowatsch

    11/13/2025, 8:12 AM
    I do work with snowpark and I'm writing a snowpark DataSet. I dont like that I have to share the snowparksession via the DataSet and not per Node. I do have multiple possible situations as fasr as i know. 1-n Input datasets and 1-n Output datasets. I'm now implementing it with a singelton which forces me to use the same session over all nodes. I would prefer to have an individual session per node which would have the advantage of โ€ข Isolate workflows โ€ข allow parallel processing โ€ข Add different configuration for each session
    e
    • 2
    • 1
  • s

    Srinivas

    11/14/2025, 8:19 AM
    Hello, I am trying to connect to adfs using Databricks, I already have a code that is running in Azure VM, I take the code and try to connect to one of the datasets using
    Copy code
    with KedroSession.create(project_path=project_path,package_name="package", env="end") as session:
        session.run(node_names=["ds1"
                                 ])
    and the connection details are like this
    Copy code
    ds1:
      type: "${globals:datatypes.csv}"
      filepath: "abfss://<container>@<acount_name>.<http://dfs.core.windows.net/raw_data/ds1.csv.gz|dfs.core.windows.net/raw_data/ds1.csv.gz>"
      fs_args:
        account_name: "accountName"
        sas_token: "sas_token"
      layer: raw_data
      load_args:
        sep: ";"
        escapechar: "\\"
        encoding: "utf-8"
        compression: gzip
        #lineterminator: "\n"
        usecols:
    The token is fine, but I am getting this exception DatasetError: Failed while loading data from data set CSVDataset(filepath=, load_args={}, protocol=abfss, save_args={'index': False}). Operation returned an invalid status 'Server failed to authenticate the request. Please refer to the information in the www-authenticate header.' ErrorCode:NoAuthenticationInformation
    e
    • 2
    • 5
  • s

    Srinivas

    11/14/2025, 8:19 AM
    Can anyone please help me
  • a

    Ayushi

    11/14/2025, 12:29 PM
    Hello Team, If I have 20 nodes where I want to conditionally execute nodes, like node_1 if true else node_2 Is it possible in kedro? I did go through conditionally executing pipelines but was not able to find relevant docs for nodes
    ๐Ÿ‘€ 1
    e
    • 2
    • 1
  • c

    cyril verluise

    11/17/2025, 7:09 PM
    Hey, There is something strange happening. I have an environement with kedro 1.0.0 and kedro-datasets installed but when it runs (in CI), I get a DatasetError suggesting that kedro-datasets is not installed
    Copy code
    DatasetError: An exception occurred when parsing config for dataset 'summary':
    No module named 'tracking'. Please install the missing dependencies for 
    tracking.MetricsDataset:
    <https://docs.kedro.org/en/stable/kedro_project_setup/dependencies.html#install-d>
    ependencies-related-to-the-data-catalog
    Hint: If you are trying to use a dataset from `kedro-datasets`, make sure that 
    the package is installed in your current environment. You can do so by running 
    `pip install kedro-datasets` or `pip install kedro-datasets[<dataset-group>]` to
    install `kedro-datasets` along with related dependencies for the specific 
    dataset group.
    Any idea of what is happening?
    j
    r
    • 3
    • 3
  • f

    Fabian P

    11/19/2025, 12:50 PM
    Hello, i want to save multiple keras models in seperate partitions. I can save a single model without problems, however, when I try to switch to partitionedDataset i constantly run into errors when trying to save. My dataset is defined as: model_partitioned_{name}: type: partitions.PartitionedDataset path: data/07_model_output/versioned/{name}/ filename_suffix: ".tf" dataset: type: tensorflow.TensorFlowModelDataset save_args: save_format: tf Trying to save the corrsponding data leads to the following error: (<class 'kedro.io.core.DatasetError'>, DatasetError('Failed while saving data to dataset kedro_datasets.partitions.partitioned_dataset.PartitionedDataset(filepath=.../data_analysis/data/07_model_output/versioned/monte_carlo_models\', dataset="kedro_datasets.tensorflow.tensorflow_model_dataset.TensorFlowModelDataset(save_args={\'save_format\': \'tf\'}, load_args={\'errors\': \'ignore\'})").\nThe first argument to
    Layer.call
    must always be passed.'), <traceback object at 0x0000025E444A4540>) When debugging, i can save each model individually by model.save(), so i assume the error message is not truly valid.
    j
    • 2
    • 1
  • g

    galenseilis

    11/19/2025, 10:30 PM
    Why doesn't Kedro session internally bootstrap itself when being called outside of jupyter or kedro run? https://docs.kedro.org/en/stable/api/framework/kedro.framework.session/#kedro.framework.session.session.KedroSession
    j
    y
    • 3
    • 2
  • y

    Yufei Zheng

    11/20/2025, 5:35 PM
    Hi team, I am very new to Kedro/Pyspark, we have some UDF function defined within Kedro pipeline.. I am wondering do we have an example of building the Kedro dependencies using
    kedro package
    and pass these to
    spark executor
    , thanks! (Tried to run the package command but still hitting
    no module named xxx
    in spark executor)
    โž• 1
    j
    • 2
    • 2
  • m

    Ming Fang

    11/21/2025, 12:22 AM
    Hi. I'm starting to learn Kedro using the quickstart tutorial here https://docs.kedro.org/en/stable/getting-started/install/#installation-prerequisites I was able to run these commands
    Copy code
    uvx kedro new --starter spaceflights-pandas --name spaceflights
    cd spaceflights
    But the next command
    Copy code
    uv run kedro run --pipeline __default__
    resulted in these errors
    Copy code
    [11/21/25 00:21:49] INFO     Using 'conf/logging.yml' as logging configuration. You can change this by setting the KEDRO_LOGGING_CONFIG environment variable accordingly.          __init__.py:270
                        INFO     Kedro project spaceflights                                                                                                                             session.py:330
    [11/21/25 00:21:51] INFO     Kedro is sending anonymous usage data with the sole purpose of improving the product. No personal data or IP addresses are stored on our side. To opt   plugin.py:243
                                 out, set the `KEDRO_DISABLE_TELEMETRY` or `DO_NOT_TRACK` environment variables, or create a `.telemetry` file in the current working directory with the              
                                 contents `consent: false`. To hide this message, explicitly grant or deny consent. Read more at                                                                      
                                 <https://docs.kedro.org/en/stable/configuration/telemetry.html>                                                                                                        
                        WARNING  Workflow tracking is disabled during partial pipeline runs (executed using --from-nodes, --to-nodes, --tags, --pipeline, and more).                  run_hooks.py:135
                                 `.viz/kedro_pipeline_events.json` will be created only during a full kedro run. See issue <https://github.com/kedro-org/kedro-viz/issues/2443> for                     
                                 more details.                                                                                                                                                        
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ /home/coder/spaceflights/.venv/lib/python3.13/site-packages/kedro/io/core.py:187 in from_config  โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚    184 โ”‚   โ”‚                                                                                     โ”‚
    โ”‚    185 โ”‚   โ”‚   """                                                                               โ”‚
    โ”‚    186 โ”‚   โ”‚   try:                                                                              โ”‚
    โ”‚ โฑ  187 โ”‚   โ”‚   โ”‚   class_obj, config = parse_dataset_definition(                                 โ”‚
    โ”‚    188 โ”‚   โ”‚   โ”‚   โ”‚   config, load_version, save_version                                        โ”‚
    โ”‚    189 โ”‚   โ”‚   โ”‚   )                                                                             โ”‚
    โ”‚    190 โ”‚   โ”‚   except Exception as exc:                                                          โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚ /home/coder/spaceflights/.venv/lib/python3.13/site-packages/kedro/io/core.py:578 in              โ”‚
    โ”‚ parse_dataset_definition                                                                         โ”‚
    โ”‚                                                                                                  โ”‚
    โ”‚    575 โ”‚   โ”‚   โ”‚   โ”‚   "related dependencies for the specific dataset group."                    โ”‚
    โ”‚    576 โ”‚   โ”‚   โ”‚   )                                                                             โ”‚
    โ”‚    577 โ”‚   โ”‚   โ”‚   default_error_msg = f"Class '{dataset_type}' not found, is this a typo?"      โ”‚
    โ”‚ โฑ  578 โ”‚   โ”‚   โ”‚   raise DatasetError(f"{error_msg if error_msg else default_error_msg}{hint}")  โ”‚
    โ”‚    579 โ”‚                                                                                         โ”‚
    โ”‚    580 โ”‚   if not class_obj:                                                                     โ”‚
    โ”‚    581 โ”‚   โ”‚   class_obj = dataset_type                                                          โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    DatasetError: Dataset 'MatplotlibWriter' not found in 'matplotlib'. Make sure the dataset name is correct.
    Hint: If you are trying to use a dataset from `kedro-datasets`, make sure that the package is installed in your current environment. You can do so by running `pip install kedro-datasets` or `pip
    install kedro-datasets[<dataset-group>]` to install `kedro-datasets` along with related dependencies for the specific dataset group.
    d
    r
    • 3
    • 7
  • j

    Jan

    11/21/2025, 9:33 AM
    Hello everyone, we used Kedro to build a pipeline that validates datasets. The pipeline has grown a lot and sometimes we need validations to run in a certain order. I know that Kedro decides order based on node inputs and outputs and in the past we created dummy datasets to create dependencies but this makes the code harder to read and maintain. We also use Apache Airflow and in airflow dependencies are defined explicitly between nodes by a custom operator: Node1 >> Node2 >> Node3 This sounds like a problem that must have been encountered before so I wanted to ask around whether there already is a plugin or extension that provides this functionality?
    โž• 1
    j
    • 2
    • 2
  • p

    Prachee Choudhury

    11/22/2025, 3:44 AM
    Hi Kedro Team, I recall there was a Python library/plug-in for text animation/design which may or may not have been a Kedro resource , which was posted about in the Kedro Slack. I cannot remember the name of the resource. Thank you
    d
    • 2
    • 3
  • a

    Ahmed Etefy

    11/22/2025, 8:58 PM
    Hey Kedro Team, I am wondering what the solution for the following use case is I have some iceberg tables, I run my pipelines with, and then at a later point in time I'd like to run on different versions of those iceberg tables leveraging the iceberg versions in the catalog so its committed in code Is there any recommendation for how to address this? Also should I be adding to the catalog multiple entries with different versions to reflect the different run instances? Does Kedro support in some way maintaining historical versions?
  • b

    Basem Khalaf

    11/22/2025, 10:26 PM
    Hi Kedro Team ๐Ÿ™Œ Iโ€™m currently running an older Kedro project (version 0.19.3), which is causing compatibility issues with the latest Kedro-Viz I have installed (12.2.0). Could you please advise where I can find the compatibility matrix between Kedro versions, Kedro-Viz versions, and the corresponding supported Python versions? Many thanksโ€”I truly appreciate your help in advance. Basem
    d
    • 2
    • 1
  • a

    Ahmed Etefy

    11/23/2025, 9:07 PM
    Hey team Is there a way to have pipeline specific parameters.yml, and spark.yml? I'd ideally like to colocate pipeline config in the same folder as the pipelines for easier collaboration and I would like pipeline runs to only load pipeline specific configurations I guess what I am looking for is "composable pipeline projects"
    l
    • 2
    • 3
  • g

    Gauthier Pierard

    11/24/2025, 1:48 PM
    hey, i have an
    after_context_created
    hook called
    AzureSecretsHook
    that saves some credentials in
    context
    . Can I use these
    credentials
    as node inputs?
    Copy code
    context.config_loader["credentials"] = {
                **context.config_loader["credentials"],
                **adls_creds,
            }
    self.credentials = context.config_loader["credentials"]
    so far only been able to use it by importing
    AzureSecretsHook
    and using
    AzureSecretsHook.get_creds()
    directly in the nodes
    Copy code
    @staticmethod
        def get_creds():
            return AzureSecretsHook.credentials
    l
    n
    • 3
    • 3
  • j

    Jonghyun Yun

    11/25/2025, 4:31 PM
    Hi Team, I have written Kedro pipelines for data processing, model training, and scoring. To deploy a trained model for realtime inference, I want to see if it's a good idea to reuse data processing and scoring pipelines. To minimize the latency, what's the best way to utilize nodes and pipes written in Kedro?
    g
    m
    • 3
    • 5
  • g

    Gauthier Pierard

    11/26/2025, 10:03 AM
    hey, just to confirm there is no
    AbstractDataset
    predefined currently for polars to delta table? would something like this do the job?
    Copy code
    class PolarsDeltaDataset(AbstractDataset):
        def __init__(self, filepath: str, mode: str = "append"):
            self.filepath = filepath
            self.mode = mode
    
        def _load(self) -> pl.DataFrame:
            return pl.read_delta(self.filepath)
    
        def _save(self, data: pl.DataFrame) -> None:
            write_deltalake(
                self.filepath,
                data,
                mode=self.mode
            )
    
        def _describe(self):
            return dict(
                filepath=self.filepath,
                mode=self.mode
            )
    l
    n
    • 3
    • 5
  • m

    Martin van Hensbergen

    11/27/2025, 10:56 AM
    Hello all. I am new here and am investigating whether our company should use Kedro. We work in a highly regulated industry where we need to trian and deploy ML models in a sound, versioned, reproducable way. Kedro seems to tick a lot of boxes when it comes to clear directory structure, Node/Pipeline concepts, tagging for partial execution, datacatalog, etc. I have a succesfully built a succesful kedro package with 3 pipelines: 1) preprocesisng, 2) model training and 3) inference. (1) and (2) work perfectly but it seems that at the realtime inference I have issues. After having batch trained ML models on large datasets, I need to use the trained models for per-point inference in a hosted service. So, training is batch wise but inference is point wise. It seems that you somehow need to do this by defining the inference input as MemoryDataset and then somehow load data in that and execute the inference pipeline. However, i cant seem to find a way to do this properly via KedrioSession. I wonder if the use case that I have is actually something that is supported out of the box for Kedro? Any advice on how to do this with minimal overhead? As for now I have defined input as
    MemoryDataset
    as input for the inference pipeline but I get "`DatasetError: Data for MemoryDataset has not been saved`" error when running:
    Copy code
    with KedroSession.create() as session:
        context = session.load_context()
        context.catalog.get("input").save("mydata")
        session.run(pipeline_name="inference")
    1. Is this the proper way to do it? 2. Is this a use case that is supported by Kedro or should I only use it for the batch training and use the output of those models manually in my service.
    g
    m
    • 3
    • 3
  • z

    Zubin Roy

    11/28/2025, 12:04 PM
    Hi all ๐Ÿ‘‹ A quick question about Kedro versioning behaviour. Is it possible to do folder-level versioning rather than dataset-level versioning? My use case: I have a single node that outputs a dozen CSV files each week. We want to keep weekly snapshots, but downloading each versioned dataset individually is a bit painful ideally weโ€™d like all files stored under a single timestamped folder. And also to me that's a much cleaner foldering way of storing the files and understanding the weekly snapshot. At the moment Iโ€™ve implemented this by generating a timestamp myself and returning a dictionary of partition keys, e.g.:
    Copy code
    timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H-%M-%S")
    
    return {
        f"{timestamp}/national_ftds_ftus_ratio_df": national_ftds_ftus_ratio_df,
        f"{timestamp}/future_ftds_predictions_by_month_df": future_ftds_predictions_by_month_df,
        ...
    }
    And my catalog entry is:
    Copy code
    forecast_outputs:
      type: partitions.PartitionedDataset
      dataset: pandas.CSVDataset
      path: s3://.../forecast/
      filename_suffix: ".csv"
    This works, but Iโ€™m not sure if Iโ€™m using
    PartitionedDataset
    in the most โ€œKedro-nativeโ€ way or if thereโ€™s a better supported pattern for grouping multiple outputs under a single version. Itโ€™s a minor problem, but Iโ€™d love to hear any thoughts, best practices, or alternative approaches. Thanks!
    l
    d
    • 3
    • 4
  • l

    Lรญvia Pimentel

    12/02/2025, 12:08 AM
    Hi, everyone! I'm trying to parametrize a folder name via
    --params
    at runtime, but Kedro isnโ€™t picking it up. In my
    parameters.yml
    I have:
    Copy code
    data_ingestion:
      queries:
        queries_folder: "${runtime_params:folder}"
    Then, in the pipeline creation:
    Copy code
    conf_path = str(settings.CONF_SOURCE)
    conf_loader = OmegaConfigLoader(conf_source=conf_path)
    params = conf_loader["parameters"]
    
    queries_folder = params["data_ingestion"]["queries"]["queries_folder"]
    
    query_files = [f for f in os.listdir(queries_folder) if f.endswith(".sql")]
    When I run:
    kedro run -p data_ingestion_s3 --params=folder=custom_folder
    I get an error saying
    "folder " not found, and no default value provided.
    Has anyone used runtime parameters inside parameter files like this? Do you know if this is expected, or should I be loading params differently? I would appreciate any guidance you could give me! Thanks ๐Ÿ™‚ Note: I am using kedro version 1.0.0
    r
    d
    • 3
    • 4
  • j

    Jon Cohen

    12/02/2025, 1:35 AM
    Hi! I'm starting a new project up and my business partner wants to use kedro for it. I'm not super happy about being forced to use pip as opposed to something like pixi or poetry. Is there an official way to use a different package manager?
    d
    • 2
    • 6
  • n

    NAYAN JAIN

    12/02/2025, 3:09 PM
    Hello All! Is there any workaround for using
    kedro viz
    or
    kedro viz build
    when your catalog expects runtime parameters? I am not able to use these commands without manually deleting the catalog files. Are there any plans to support
    --conf-source
    or
    --params
    in the kedro viz command?
    ๐Ÿ‘€ 1
    m
    • 2
    • 3
  • m

    Matthias Roels

    12/02/2025, 4:39 PM
    What's the current view on the future of using omegaconf? It seems the project is currently in โ€œkeeping the lights onโ€ mode cfr https://github.com/omry/omegaconf/issues/1200
    ๐Ÿ‘ 1
    m
    • 2
    • 1
  • a

    Anna-Lea

    12/03/2025, 2:25 PM
    Hi Team ๐Ÿ‘‹ In our workflow it happens quite often that we have a situation where: a node takes a
    PartitionedDataset
    as input and
    PartitionDataset
    as output. So something like this:
    Copy code
    def my_node(inputs: dict[str, Callable[[], Any]]) -> dict[str, Any]:
        results = {}
        for key, value in inputs.items():
            response = my_function(value())
            results[key] = response
        return results
    Ideally, I would want: โ€ข the internal
    for
    loop to run in parallel I've noticed that @Guillaume Tauzin mentioned a similar situation
    m
    g
    • 3
    • 6
  • o

    olver-dev

    12/07/2025, 12:30 PM
    @olver-dev has left the channel
  • r

    Ralf Kowatsch

    12/08/2025, 4:07 PM
    Hi, sorry for the question but it happend multiple times that links that were referenced like docs.kedro.org/en/1.1.1/nodes_and_pipelines/run_a_pipeline.html#load-and-save-asynchr were dead. Is there some kind of issue I'm not aware of?
    m
    r
    • 3
    • 5
  • m

    marrrcin

    12/08/2025, 8:17 PM
    Hi guys, long time no see! ๐Ÿ˜Š What would be current recommended way to do fan-out/fan-in processing in Kedro? Let's assume I would like to use either ParallelRunner or ThreadRunner to achieve some level of concurrency in the fan-out part.
    ๐Ÿ‘€ 2
    r
    • 2
    • 2
  • m

    Marcus Warnerfjord

    12/09/2025, 11:43 AM
    Hi guys! I'm a bit new to the data engineering that kedro is bulding on, have limited experience in database management, and have a use-case that I'm struggling to find any straight-forward solution to. My pipeline is in essence creating datasets of DNA-sequences, which I send to an API (AlphaGenome) for property predictions, and then preform data analysis on. All of these sequence and predictions also needs to be stored in a global database for later use. The quirk of this workflow is that the API is enforces rate limiting, which not only motivates the need to only request predictions for new sequences (as predictions take a lot of time to produce), but also to have a failsafe if the API terminates the request, so that I can at least store the predictions I've received before that point. Additionally, the amount of sequences are substantial and this creates strict requirements on efficency in memory and speed. The consequences of this is that I always first need to check what's in the database before updating it, creating a circular dependency. I also need some sort of batch logic for updating my database to store progress. And because of this being a big database I can't really have any version logic that duplicates data. I know that one can create workaround by having two different dataset share the same dataset path but I have a feeling that there might be other design practices that are more aligned with kedros. Either way, I'm guessing that there at least exist a lot of similar use-cases that perform database updating who might have solutions to this. Thankful for any help, and if you have additional recommendation on good kedro-datasets for this, even better! :)
    ๐Ÿ‘€ 1