NAYAN JAIN
10/29/2025, 1:56 PMweather:
type: polars.EagerPolarsDataset
filepath: <s3a://your_bucket/data/01_raw/weather*>
file_format: csv
credentials: ${s3_creds:123456789012,arn:role}
where s3_creds is a config resolver that returns a dictionary with access keys and secrets. One potential issue I see with this approach is that the credentials could expire if they are evaluated only at the beginning of pipeline and not every time a load or save is performed.
Is there any better way to achieve what I want?
• Dynamic credential resolution per dataset.
• Credential refresh at load/save time.Raghav Singh
10/29/2025, 6:51 PMSejal Singh
10/30/2025, 8:59 AMChekeb Panschiri
10/30/2025, 4:02 PMFlavien
10/31/2025, 8:54 AMkedro code and, upon scrutiny, I am a bit confused by the dependencies for databricks.ManagedTableDataset. In pyproject.toml, https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml, is stated
hdfs-base = ["hdfs>=2.5.8, <3.0"]
s3fs-base = ["s3fs>=2021.4"]
...
databricks-managedtabledataset = ["kedro-datasets[hdfs-base,s3fs-base]"]
databricks = ["kedro-datasets[databricks-managedtabledataset]"]
But in the implementation, I don't see any reference to those two packages while the dataset requires pyspark which is not stated as a dependency if I am not mistaken. Could you tell me if my interpretation is incorrect?Gauthier Pierard
11/03/2025, 11:30 AMAyushi
11/03/2025, 1:09 PMMark Einhorn
11/06/2025, 12:30 PMdev env, but when deploying and running in test , we are getting the following error:
DatasetError: Failed while loading data from dataset ParquetDataset(filepath=psi-test-data/***********/data/02_intermediate/formatted_transactions_df.parquet, load_args={}, protocol=s3, save_args={}).
An error occurred (PreconditionFailed) when calling the GetObject operation: At least one of the pre-conditions you specified did not hold
What's weird is that the error is not consistent. Sometimes the node responsible runs through just fine, other times it errors out, without anything (at least what we can see) changing. Any help would be massively appreciated! @Tom McHaleGuillaume Tauzin
11/10/2025, 9:18 AMBiel Stela
11/11/2025, 10:27 AMgdal , which is a CLI for a c++ lib (the one used under the hood by rasterio), that can handle the large files without problem because it does all the streaming and all sorts of nice things under the hood. So I want to integrate this processing in my existing pipeline. Is it a bad idea to have a custom dataset that calls an external program via subprocess or something similar ? have you ever seen a pattern like this before? Will God kill a kitten if I go with this approach?
Thank you!Shah
11/11/2025, 3:33 PMLinkageError occurred while loading main class org.apache.spark.launcher.Main java.lang.UnsupportedClassVersionError:
A little google search told me it's not finding the java installation. To resolve, I installed the latest java (jdk25). Now, the error has changed to:
Py4JJavaError: An error occurred while calling <http://None.org|None.org>.apache.spark.api.java.JavaSparkContext. : java.lang.UnsupportedOperationException: getSubject is not supported
I have checked the java path, and it's pointing to /usr/lib/jvm/java-11-openjdk-amd64/ despite explicitly mentioning /usr/lib/jvm/jdk-25.0.1-oracle-x64/bin in the environment.
But I think the main issue is, it seems, with pyspark which is not launching, throwing the same error.
Since I do not need pyspark in this project, is there a way to disable it for time being, just to test my pipeline? Or else, how else could I fix this?
Thanks!Ralf Kowatsch
11/13/2025, 8:12 AMSrinivas
11/14/2025, 8:19 AMwith KedroSession.create(project_path=project_path,package_name="package", env="end") as session:
session.run(node_names=["ds1"
])
and the connection details are like this
ds1:
type: "${globals:datatypes.csv}"
filepath: "abfss://<container>@<acount_name>.<http://dfs.core.windows.net/raw_data/ds1.csv.gz|dfs.core.windows.net/raw_data/ds1.csv.gz>"
fs_args:
account_name: "accountName"
sas_token: "sas_token"
layer: raw_data
load_args:
sep: ";"
escapechar: "\\"
encoding: "utf-8"
compression: gzip
#lineterminator: "\n"
usecols:
The token is fine, but I am getting this exception
DatasetError: Failed while loading data from data set CSVDataset(filepath=, load_args={}, protocol=abfss, save_args={'index': False}). Operation returned an invalid status 'Server failed to authenticate the request. Please refer to the information in the www-authenticate header.' ErrorCode:NoAuthenticationInformationSrinivas
11/14/2025, 8:19 AMAyushi
11/14/2025, 12:29 PMcyril verluise
11/17/2025, 7:09 PMDatasetError: An exception occurred when parsing config for dataset 'summary':
No module named 'tracking'. Please install the missing dependencies for
tracking.MetricsDataset:
<https://docs.kedro.org/en/stable/kedro_project_setup/dependencies.html#install-d>
ependencies-related-to-the-data-catalog
Hint: If you are trying to use a dataset from `kedro-datasets`, make sure that
the package is installed in your current environment. You can do so by running
`pip install kedro-datasets` or `pip install kedro-datasets[<dataset-group>]` to
install `kedro-datasets` along with related dependencies for the specific
dataset group.
Any idea of what is happening?Fabian P
11/19/2025, 12:50 PMLayer.call must always be passed.'), <traceback object at 0x0000025E444A4540>)
When debugging, i can save each model individually by model.save(), so i assume the error message is not truly valid.galenseilis
11/19/2025, 10:30 PMYufei Zheng
11/20/2025, 5:35 PMkedro package and pass these to spark executor, thanks! (Tried to run the package command but still hitting no module named xxx in spark executor)Ming Fang
11/21/2025, 12:22 AMuvx kedro new --starter spaceflights-pandas --name spaceflights
cd spaceflights
But the next command
uv run kedro run --pipeline __default__
resulted in these errors
[11/21/25 00:21:49] INFO Using 'conf/logging.yml' as logging configuration. You can change this by setting the KEDRO_LOGGING_CONFIG environment variable accordingly. __init__.py:270
INFO Kedro project spaceflights session.py:330
[11/21/25 00:21:51] INFO Kedro is sending anonymous usage data with the sole purpose of improving the product. No personal data or IP addresses are stored on our side. To opt plugin.py:243
out, set the `KEDRO_DISABLE_TELEMETRY` or `DO_NOT_TRACK` environment variables, or create a `.telemetry` file in the current working directory with the
contents `consent: false`. To hide this message, explicitly grant or deny consent. Read more at
<https://docs.kedro.org/en/stable/configuration/telemetry.html>
WARNING Workflow tracking is disabled during partial pipeline runs (executed using --from-nodes, --to-nodes, --tags, --pipeline, and more). run_hooks.py:135
`.viz/kedro_pipeline_events.json` will be created only during a full kedro run. See issue <https://github.com/kedro-org/kedro-viz/issues/2443> for
more details.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/coder/spaceflights/.venv/lib/python3.13/site-packages/kedro/io/core.py:187 in from_config │
│ │
│ 184 │ │ │
│ 185 │ │ """ │
│ 186 │ │ try: │
│ ❱ 187 │ │ │ class_obj, config = parse_dataset_definition( │
│ 188 │ │ │ │ config, load_version, save_version │
│ 189 │ │ │ ) │
│ 190 │ │ except Exception as exc: │
│ │
│ /home/coder/spaceflights/.venv/lib/python3.13/site-packages/kedro/io/core.py:578 in │
│ parse_dataset_definition │
│ │
│ 575 │ │ │ │ "related dependencies for the specific dataset group." │
│ 576 │ │ │ ) │
│ 577 │ │ │ default_error_msg = f"Class '{dataset_type}' not found, is this a typo?" │
│ ❱ 578 │ │ │ raise DatasetError(f"{error_msg if error_msg else default_error_msg}{hint}") │
│ 579 │ │
│ 580 │ if not class_obj: │
│ 581 │ │ class_obj = dataset_type │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
DatasetError: Dataset 'MatplotlibWriter' not found in 'matplotlib'. Make sure the dataset name is correct.
Hint: If you are trying to use a dataset from `kedro-datasets`, make sure that the package is installed in your current environment. You can do so by running `pip install kedro-datasets` or `pip
install kedro-datasets[<dataset-group>]` to install `kedro-datasets` along with related dependencies for the specific dataset group.Jan
11/21/2025, 9:33 AMPrachee Choudhury
11/22/2025, 3:44 AMAhmed Etefy
11/22/2025, 8:58 PMBasem Khalaf
11/22/2025, 10:26 PMAhmed Etefy
11/23/2025, 9:07 PMGauthier Pierard
11/24/2025, 1:48 PMafter_context_created hook called AzureSecretsHook that saves some credentials in context . Can I use these credentials as node inputs?
context.config_loader["credentials"] = {
**context.config_loader["credentials"],
**adls_creds,
}
self.credentials = context.config_loader["credentials"]
so far only been able to use it by importing AzureSecretsHook and using AzureSecretsHook.get_creds() directly in the nodes
@staticmethod
def get_creds():
return AzureSecretsHook.credentialsJonghyun Yun
11/25/2025, 4:31 PMGauthier Pierard
11/26/2025, 10:03 AMAbstractDataset predefined currently for polars to delta table?
would something like this do the job?
class PolarsDeltaDataset(AbstractDataset):
def __init__(self, filepath: str, mode: str = "append"):
self.filepath = filepath
self.mode = mode
def _load(self) -> pl.DataFrame:
return pl.read_delta(self.filepath)
def _save(self, data: pl.DataFrame) -> None:
write_deltalake(
self.filepath,
data,
mode=self.mode
)
def _describe(self):
return dict(
filepath=self.filepath,
mode=self.mode
)Martin van Hensbergen
11/27/2025, 10:56 AMMemoryDataset as input for the inference pipeline but I get "`DatasetError: Data for MemoryDataset has not been saved`" error when running:
with KedroSession.create() as session:
context = session.load_context()
context.catalog.get("input").save("mydata")
session.run(pipeline_name="inference")
1. Is this the proper way to do it?
2. Is this a use case that is supported by Kedro or should I only use it for the batch training and use the output of those models manually in my service.Zubin Roy
11/28/2025, 12:04 PMtimestamp = datetime.utcnow().strftime("%Y-%m-%dT%H-%M-%S")
return {
f"{timestamp}/national_ftds_ftus_ratio_df": national_ftds_ftus_ratio_df,
f"{timestamp}/future_ftds_predictions_by_month_df": future_ftds_predictions_by_month_df,
...
}
And my catalog entry is:
forecast_outputs:
type: partitions.PartitionedDataset
dataset: pandas.CSVDataset
path: s3://.../forecast/
filename_suffix: ".csv"
This works, but I’m not sure if I’m using PartitionedDataset in the most “Kedro-native” way or if there’s a better supported pattern for grouping multiple outputs under a single version.
It’s a minor problem, but I’d love to hear any thoughts, best practices, or alternative approaches. Thanks!