Pedro Sousa Silva
04/11/2024, 10:36 AMkedro package
consider additional folders/files to be packaged? In my case, this is the basic structure:
<my_package_name>
├── conf
├── pyproject.toml
├── requirements.txt
└── src
├── my_package_name
│ ├── __init__.py
│ ├── __main__.py
│ ├── ...
│ ├── pipelines
│ ├── utils
│ ├── databricks.py
│ ├── sharepoint.py
│ ├── datasets
│ │ └── custom_delta_table.py
The regular behavior will ignore not only conf/
folder, but also the utils/
and datasets/
, therefore preventing me from doing stuff like from my_package_name.utils import sharepoint
Any workarounds?Yaroslav Starukhin
04/12/2024, 1:19 AMVinay Agrawal
04/12/2024, 4:30 AMArthur Bernardo
04/12/2024, 6:34 PM_base_mlflow_artifact: &base_mlflow_artifact
type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
_base_mlflow_metrics: &base_mlflow_metrics
type: kedro_mlflow.io.metrics.MlflowMetricsDataSet
{{model}}.train_metrics:
<<: *base_mlflow_metrics
{{model}}.test_metrics:
<<: *base_mlflow_metrics
Running the pipeline locally on my computer, the error does not occur.Yaroslav Starukhin
04/12/2024, 9:25 PMYury Fedotov
04/14/2024, 1:08 AMparameters
references will not be namespaced, but params:
references will.
Do parameters
here refer to a keyword argument of pipeline
wrapper, or it means that in node definition, I can use e.g. "learning_rate": "parameters:learning_rate"
instead of "learning_rate": "params:learning_rate"
and it will not be namespaced?
P.S. if the answer is that it's about pipeline
wrapper, then the follow up question is... Is there any way to somehow prohibit namespacing params at the node
definition? The background for the question is the following. I'm using namespace pipelines to process different datasets via the same multi-node logic, but parameters should be the same. And without prohibiting namespacing those, each time I use pipeline
wrapper I need to map a crazy number of parameters to prohibit namespacing those since I want them to be reused by all modular pipelines.Abhishek Bhatia
04/15/2024, 1:06 PMkedro>=0.19
, but struggling with getting my catalog and parameters being discovered by OmegaConfigLoader
The folder structure for parameters
is something like this (same for catalog)
conf/base/
└── parameters/
├── <usecase1>/
│ └── usecase1.yml
└── <usecase2>/
├── <usecase2a>/
│ ├── usecase2a_1.yml
│ └── usecase2a_2.yml
└── <usecase2b>/
└── usecase2b.yml
This used to work with kedro<0.19
and OmegaConfigLoader
, but now I am really struggling with how to set the glob pattern for:
1. Discovering catalog in any folder depth as long as they are under a folder named catalog
2. Discovering parameters in any folder depth as long as they are under a folder named parameters
3. Discovering catalog globals (where to place them?)
4. Discovering parameter globals (where to place them?)
Thanks! 🙂Iñigo Hidalgo
04/16/2024, 10:55 AMGiovanna Cavali
04/16/2024, 5:25 PMMatthias Roels
04/16/2024, 6:56 PM0.18
, I found out that when you save intermediate data, kedro loads the data from file storage again instead of using the dataset already in memory. This wastes precious I/O operation in my pipeline run. Is there a specific reason why it was implemented this way?Abhishek Bhatia
04/17/2024, 5:13 AMSergey S
04/17/2024, 5:15 PM...
├── 2024-04-17-test-run2
│ ├── 01_raw <-- Raw immutable data
│ ├── 02_intermediate <-- Typed data
│ ├── 03_primary <-- Domain model data
│ ├── 04_feature <-- Model features
│ ├── 05_model_input <-- Often called 'master tables'
│ ├── 06_models <-- Serialised models
│ ├── 07_model_output <-- Data generated by model runs
│ ├── 08_reporting <-- Ad hoc descriptive cuts
├── 2023-03-01-test-run1
│ ├── 01_raw <-- Raw immutable data
│ ├── 02_intermediate <-- Typed data
│ ├── 03_primary <-- Domain model data
│ ├── 04_feature <-- Model features
│ ├── 05_model_input <-- Often called 'master tables'
│ ├── 06_models <-- Serialised models
│ ├── 07_model_output <-- Data generated by model runs
│ ├── 08_reporting <-- Ad hoc descriptive cuts
...
Gabriel Aguiar
04/18/2024, 12:21 PMus_6_gas_by_date_tracking_metrics.json
and the file looks like this:
```json
{
"rmse": 2681.4785,
"mae": 2211.2415,
"r2": -3568893.6356
}
• My catalog.yml
entry for the tracking dataset is as follows:
"{name_us}_{base}_{split_type}_tracking_metrics":
type: kedro.extras.datasets.tracking.MetricsDataSet
filepath: data/09_tracking/{name_us}_{base}_{split_type}_tracking_metrics.json
• The relevant node in my Kedro pipeline is defined like this:
node(
func=tracking_metrics,
inputs=[f"{name_us}_{base}_{split_type}_test_predicted"],
outputs=f'{name_us}_{base}_{split_type}_tracking_metrics',
name=f"tracking_{name_us}_{base}_{split_type}_node"
)
• The tracking_metrics
function used in the node is implemented as follows:
from math import sqrt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
def tracking_metrics(test_data: pd.DataFrame) -> dict[str, float]:
target_cols = [col for col in test_data.columns if col.endswith('_target')]
if not target_cols:
raise ValueError("No target column found in the test data.")
target_col = target_cols[0]
test_data = test_data.dropna(subset=[target_col, 'Prediction'])
rmse = np.round(sqrt(mean_squared_error(test_data[target_col], test_data['Prediction'])), 4)
mae = np.round(mean_absolute_error(test_data[target_col], test_data['Prediction']), 4)
r2 = np.round(r2_score(test_data[target_col], test_data['Prediction']), 4)
return {'rmse': rmse, 'mae': mae, 'r2': r2}
The file is stored at C:\Dev\kedro_pelopt\sentinela-palletizing\peloptmize\data\09_tracking\
, and I've confirmed that the metrics are being correctly saved. However, when I go to Kedro Viz's experiment tracking section, it shows "No data to display."
I'm currently using Kedro version 0.19.3
, Kedro datasets 2.1.0
and Kedro Viz version 8.0.1
. Here's what I've tried so far:
• Made sure the Kedro Viz server is pointing to the correct Kedro project.
• Restarted the Kedro Viz server after changes.
• Cleared the browser cache and tried accessing in an incognito window.
• Checked the catalog.yml
for correct file paths.
• Looked for any necessary configurations in kedro_viz.yaml
.
• Opened the developer console in the browser for any errors but didn't find any clues.
Does anyone have suggestions on what else I can check or try to resolve this? Thank you in advance for your help!Richard Purvis
04/18/2024, 7:42 PMDatasetError
as the tool tries to load images from the catalog that don't exist. We don't have local access to the data. I also would like to deploy to github pages but anticipate the same issue.Xavier Coubez
04/20/2024, 9:38 PM"TypeError: "delimiter" must be a 1-character string"
...
DataSetError: Failed while saving data to data set CSVDataSet(filepath=…, load_args={'sep': \t}, protocol=file, save_args={'sep': \t}).
"delimiter" must be a 1-character string
The entry in the catalog is the following:
list_xxxx:
type: pandas.CSVDataSet
filepath: data/02_intermediate/list_xxxx.txt
save_args:
sep: "\t" and "typing tab" and "small arrow"
header:
index:
layer: primary
Is there a way to actually create a tab separated file as output to a node? I tried to find some answers both online and in this slack but no luck so far. Will keep searching. 🙂
Thanks a lot for your help!Anh Van
04/21/2024, 6:51 PMfrom kedro_datasets.pandas import DeltaTableDataset
dataset = DeltaTableDataset(catalog_type='UNITY', \
catalog_name='mycatalog', database='mydatabase', table='table1', save_args={'mode': 'overwrite'})
df_data = dataset.load()
I got the following error:
KeyError: 'unity'
Could anyone let me know why does the error happen and how to fix it? Thank youClement
04/22/2024, 7:45 AMZubin Roy
04/22/2024, 10:22 AMpd.read_sql_query
to read in SQL tables which I think is what causes it to take so long. And was wondering if there were quicker alternative ways to load tables that people use? Thanks!
kedro_athena_test:
type: pandas.SQLQueryDataSet
sql: "select * from sigma.fact_wiki_media limit 100000;"
marrrcin
04/22/2024, 11:45 AMon_pipeline_error
is not executed when there's an exception in the dataset implementation?Tom McHale
04/22/2024, 11:50 AMformatted_event_df:
type: pandas.ParquetDataset
filepath: "<s3://s3_bucket/filename.parquet>"
kedro.io.core.DatasetError: Failed while saving data to data set ParquetDataset(filepath=s3_bucket/file.parquet, load_args={}, protocol=s3, save_args={}).
Any ideas for what's going wrong here.
My custom class code is below:
from typing import Any, Dict
import numpy as np
from <http://kedro.io|kedro.io> import AbstractDataset
from pyathena import connect
from pyathena.pandas.cursor import PandasCursor
import pandas as pd
import boto3
import time
from <http://kedro.io|kedro.io> import AbstractDataset
class PyAthenaSQLDataset(AbstractDataset[np.ndarray, np.ndarray]):
"""``CustomAthenaPandasDataset`` loads / save image data from a given
sql query as a pandas dataframe
"""
def __init__(self, sql_query: str, s3_staging_dir: str, region_name: str):
"""Creates a new instance of CustomAthenaPandasDataset to load / save
data for given sql query.
Args:
sql_query: sql query for athena table
s3_staging_dir: s3 staging dict for env on aws
region_name: name of the region input.
"""
self.sql_query = sql_query
self.s3_staging_dir = s3_staging_dir
self.region_name = region_name
def _load(self) -> pd.DataFrame:
"""
Returns:
"""
cursor = connect(
s3_staging_dir=self.s3_staging_dir,
region_name=self.region_name,
cursor_class=PandasCursor,
).cursor()
pandas_df = cursor.execute(self.sql_query).as_pandas()
return pandas_df
def _save(self, data: pd.DataFrame) -> None:
"""Saves data to the specified filepath."""
return data
def _describe(self) -> Dict[str, Any]:
"""Returns a dict that describes the attributes of the dataset."""
return dict(s3_query=self.sql_query, save_location=self.s3_staging_dir)
Benjamin Wallyn
04/22/2024, 2:06 PMGalen Seilis
04/22/2024, 11:10 PMJuan Pablo Usuga Cadavid
04/23/2024, 9:29 AMBrandon Meek
04/24/2024, 6:55 PMglobals
in jupyter?quantumtrope
04/24/2024, 7:37 PMVersioning doesn’t work with PartitionedDataset. You can’t use both of them at the same time.However, I found on github pull request #447 that implies that you can. So, which is it? Are there any docs for that example? Maybe my specific question is which order do you define things? Using the "pikachu versioned dataset" from the advanced tutorial, would it be:
pikachu:
type: partitions.PartitionedDataset
path: data/01_raw/pokemon-images-and-types/images/images
filename_suffix: '.png'
dataset:
type: kedro_pokemon.datasets.image_dataset.ImageDataset
versioned: true
Anh Van
04/24/2024, 8:13 PMBrandon Meek
04/24/2024, 9:49 PMruntime_params
but session.run()
doesn't take runtime_params
, it seems like those should be put in KedroContext
but creating a new context looses all of the other information and there doesn't seem to be an easy way to create a new context from an existing one or just add extra_params
. Is there something I'm missing or is this a weird use-case?Afiq Johari
04/25/2024, 6:58 AMGiovanna Cavali
04/25/2024, 1:30 PMArtur Dobrogowski
04/25/2024, 3:03 PM