Kedro #questions-so

rss

07/26/2023, 9:48 PM

logging in python and kedro, how to log only DEBUG info to a file and INFO to console I'm trying to configure logging so that INFO level messages go to the console and DEBUG level messages go to a file instead. So far, I am able to get working INFO to console and DEBUG to file, the problem is that the DEBUG is also being output to the console and I'm not sure why. In particular, I'm using kedro to organize my project and it has some other features I'm trying to figure out. The following works insofar as the console gets INFO and only DEBUG level messages are saved in the file....

rss

08/21/2023, 4:38 AM

Getting Kedro globals.yml to work with OmegaConfigLoader I just started evaluating Kedro for use, and I began with a small project where I read data from a MS-SQL Server. The pipeline will run with a few months in between, and with new date ranges every time. To get this parameter (date) into the pipeline I looked into using globals.yml. The kedro run command works, I have created just one node that loads the data. I load some parameters from globals.yml to use in catalog.yml. If I use the standard lines in settings.py from kedro.config import...

rss

08/28/2023, 5:48 PM

in Kedro, how to handle tar.gz archives from the web I have a tar.gz file that I am downloading from this link: http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html What is the best way to fully integrate this TSV data into kedro, perhaps with an API dataset first, and then a node to extract it? Tar.gz files are not a default supported kedro dataset type.

➕ 1

rss

08/28/2023, 8:28 PM

Define column names when reading a spark dataset in kedro With kedro, how can I define the column names when reading a spark.SparkDataSet? below my catalog.yaml. user-playlists: type: spark.SparkDataSet file_format: csv filepath: data/01_raw/lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv load_args: sep: "\t" header: False # schema: # filepath: conf/base/playlists-schema.json save_args: index: False I have been trying to use the following schema, but it doesn't seem to be accepted (schema...

rss

08/30/2023, 8:08 AM

in kedro / pyspark how to use MemoryDataset I am trying to use a MemoryDataset with kedro, in order to not save the intermeiate result to disk. # nodes.py def preprocess_format_tracksessions(tracksess: DataFrame, userid_profiles:pd.DataFrame , parameters: Dict) -> MemoryDataset: In the pipeline I am defining the node output and inputs: # pipeline.py def create_pipeline(**kwargs) -> Pipeline: return pipeline([ node( func=preprocess_format_tracksessions, inputs= ["track_sessions",...

✅ 1

rss

09/25/2023, 8:58 AM

How to import Custom json encoder class into data cataloge I’ve a df which stores lists in a column. I am saving the df with all columns in json using config_new: type: json.JSONDataSet filepath: data/01_raw/new_config.json save_args: indent: 6 It’s saving all columns ok , except the column with list as string. As in: “T”:[{ “Col1”: “9” “Col2”: “[“7”,”9”,”0”,”5”]” }] As you can see above col2 list is coming out as string I am using json encoder class as below in a python script and saving it under src : import json ...

rss

10/18/2023, 8:48 AM

Kedro simple project run error: ModuleNotFoundError followed by ValueError I am trying to setup a simple python ML project in Kedro. It is very simple, one pipeline of three nodes, a data loading node, a model computing node and a model evaluating node that just prints the accuracy. This is just to learn to use Kedro for an upcoming project. Despite the simplicity I am struggling to make this work. Basically my kedro run outputs ModuleNotFoundError: No module named 'kedro_mnist.pipelines.' followed by KeyError: 'pipeline' The above exception was the direct cause...

👀 1

rss

10/26/2023, 11:18 PM

Kedro viz blank page I created a sample pipeline that works correctly with kedro run, however when I try to visualise it with kedro viz I'm getting basically a blank page, even though the terminal doesn't show a single error. The only detail I've found was in the inspection mode:

https://i.stack.imgur.com/46hpN.png▾

The operating system I'm running it on is windows 10, when I launch it on WSL everything is completely normal. Any ideas why it is happening and how to solve it?

✔️ 1

rss

10/28/2023, 11:38 AM

KEDRO - How to specify an arbitrary binary file in catalog.yml? I'm currently working on a datascience project using LLMs (Large language models). Weights for models usually come in different formats, most frequently .bin or .gguf, and I'd like to keep it that way. However the only way to store binary files I know is to use type: pickle.PickleDataset like so test_model: # simple example without compression type: pickle.PickleDataSet filepath: data/07_model_output/test_model.pkl backend: pickle I'm not okay with that as I want my model files to be...

rss

12/27/2023, 11:28 AM

Kedro and Streamlit Integration - Running Kedro Pipeline with Custom DataCatalog I am working on integrating Kedro, a data pipeline framework, with Streamlit, a popular Python web app framework, to build a data processing application. The primary goal is to run a specific Kedro pipeline from within my Streamlit app, using a custom DataCatalog to manage and load DataFrames. Problem Details: Integration Background: I have successfully integrated Kedro and Streamlit, and I can run Kedro pipelines from my Streamlit app. However, I want to pass custom data loaded in Streamlit...

rss

01/05/2024, 6:08 PM

Kedro viz failing I am using kedro 0.19.1 on PyCharm. I have installed kedro and kedro viz. When I run 'kedro viz run' it is failing. Is there any additional steps I need to follow here? Error: kedro.framework.cli.utils.kedroCLiError: func: , didn't return True within specified timeout Thanks. I tried kedro viz run and was expecting a pipeline graph for my kedro pipeline.

rss

01/08/2024, 6:28 PM

Cannot access runtime parameter when using kedro run --params Kedro fails to resolve for runtime parameters passed via the CLI with the following error : InterpolationResolutionError( omegaconf.errors.InterpolationResolutionError: Runtime parameter 'start_date' not found and no default value provided. I am trying to run my pipeline with runtime parameters via the CLI as follows : kedro run --params=start_date='2023-10-10',end_date='2023-10-11' or kedro run --params start_date='2023-10-10',end_date='2023-10-11' I expected to be able to use these...

rss

01/19/2024, 2:18 PM

kedro ipython, how to access the spark session I am able to load a spark dataset in a kedro ipython session. First, I configured the spark session as described here. Then I launched a kedro ipython session with ipython --ext kedro.extras.extensions.ipython or kedro ipython Then, I am able to load spark datasets as defined in the catalog from kedro.framework.session import KedroSession from kedro.framework.startup import...

rss

01/22/2024, 12:18 PM

kedro spark session configuration values not found I am unable to access the kedro spark session configuration from an ipython console. # /conf/base/spark.yml spark.driver.maxResultSize: 30g spark.scheduler.mode: FAIR spark.driver.memory: 15g spark.executor.memory: 15g spark.executor.cores: 4 # settings for the UI spark.ui.port: 10 spark.ui.enabled: true As you can see the spark.driver.maxResultSize is defined. after running this I get a NoSuchElementException. Any idea why? kedro ipython %reload_kedro spark =...

rss

01/23/2024, 8:18 AM

kedro pyspark starter cannot load spark.SparkDataset The kedro starter project immediately fails to recognize the spark dataset kedro new --starter=spaceflights-pyspark-viz cd projectpath pip install -r requirements.txt kedro ipython Class 'spark.SparkDataset' not found, is this a typo? Tested with kedro 0.19.1 and .2 any idea?

rss

02/07/2024, 5:18 PM

How can i deactivate the automatic type conversion in parquetdataset? I'm trying to load a dataset to parquet, then save it in my s3 bucket. When I try this, automatically tries to convert my columns to int or double. For example: I have a column named ventas_deals.person_phone_value, and that column saves someting like this '571234567890'. The error is this: ParquetDataSet(filepath=analytics-datalake-prod-primary-s3bucket/datasets_kedro/menu_property_matching_match_invento ry/data.parquet, load_args={}, protocol=s3, save_args={}). ("Could not convert...

Carol Buarque

03/15/2024, 1:40 PM

@Carol Buarque has left the channel

rss

03/26/2024, 2:28 PM

How to run a kedro pipeline I am trying to run this algorithm which is in a kedro pipeline. I have read the documentation about Kedro, and managed to open a Jupyter notebook with Kedro kernel and ran some cells with the commands that were in this kedro <a...

👀 2

rss

04/02/2024, 8:38 AM

How do I run a Kedro pipeline on a particular input csv dataset that contains a list of requests that have to be evaluated My main input is a list of checks (in a csv file) that I have to evaluate on a huge dataset. In my main node, I parse the csv file, and for each row, I get the corresponding data from the catalog, extract the data, conduct my analysis and finally, if I have 5 rows in my input csv file, I append all 5 outputs (which is a lot of rows) vertically on top of each other (because the number of datasets my pipeline returns depends on the number of rows in the input csv file, and in Kedro you can only...

rss

04/05/2024, 2:28 PM

Is there a way to overwrite a Kedro dataset query in code? I want to be able to overwrite the WHERE clause in Kedro dataset queries. Let's say I have the following catalogue entry: some_table.raw: type: pandas.GBQQueryDataset sql: SELECT * FROM database.table WHERE date >= {start_date} Then, in the code, I want to overwrite it with something like: catalog.load("some_table.raw", query={"start_date": "2024-01-01"}) I know it's impossible since the load method supports no arguments except for the dataset name. But perhaps there are some...

Viorel Teodorescu

04/10/2024, 11:09 AM

Hi everyone. Can I use PyArmor with Kedro? namely can I run python scripts after they have beeen obfuscated with PyArmor within the Kedro framework? https://github.com/dashingsoft/pyarmor

sandesh devkatte

06/13/2024, 4:41 AM

Hi Team, I am working on kedro integration with kubeflow but it is not working getting vresion errors. like when i use kedro version >0.19 kedro-kubrflow plugin is not working and when use kedro version <0.19 kedro-dataset is not supporting..! why this happening any idea??

rss

06/19/2024, 9:38 AM

python directory not read as string I have a script in python where I want to read in a path to a directory in order to read in the files within this directory. When debugging it throws me an error. When I specify the path within the function it all works, however if I specify it outside the function it throws me an error path = "users/folder" def read_data(path): directory_csv = path for filename in os.listdir(path): # here it throws me the error if filename.endswith('.csv'): file_path = os.path.join(path,...

rss

06/20/2024, 2:58 PM

How to update a Kedro pipeline instead of replacing it? I have a dataset which I have to increment every run, but it is instead replacing it. The catalog is generalized, except for the result.csv, which is a concatenated file of all the other files generated by the pipelines. "{namespace}.{dataset_name}@csv": type: pandas.CSVDataset filepath: data/01_raw/{namespace}/{dataset_name}.csv versioned: True "{namespace}.result": type: pandas.CSVDataset filepath: data/01_raw/{namespace}/result.csv In the nodes.py file, I'm using a simple...

👀 1

user

09/01/2024, 6:18 PM

How do I add multiple .md files to the catalog in Kedro I have multiple .md files that I want to process. I want to add them all under a single name in catalog. But .md files isn't supported by the framework. Example: I have multiple files on data/01_raw/folder_name/ , I want to be able to read all the .md files in there.

rss

10/01/2024, 9:38 PM

Create Kedro PartitionedDataset of PartitionedDatasets I'm working in a kedro project where I want to automatically label thousands of audio files, apply transformations to them and then store them in a folder of folders, each subfolder corresponding to one label. I want that folder of folders to be a catalog entry on my yml file I followed this Kedro tutorial and created my own custom dataset for saving/loading .wav files in kedro...

rss

10/25/2024, 12:58 PM

How to Incrementally Append and Upsert Rows in Kedro directly to PostgreSQL DBt? I'm working on a Kedro project where I have a dataset defined in catalog.yml as follows: daily_stats_dataset: type: ${datasets.orm_table} orm_model: my_proj.schemas.sqla_schemas.DailyStats credentials: my_database monthly_stats_dataset: type: ${datasets.orm_table} orm_model: my_proj.schemas.sqla_schemas.MonthlyStats credentials: my_database I need to incrementally append new rows from my daily_stats_dataset to my monthly_stats_dataset. However the daily stats are aggregated...

rss

12/10/2024, 1:18 PM

Conditional Nodes in Kedro Pipelines In Kedro, I have a pipeline with various nodes. However, based on the output of a certain node, I'd like to skip the next node. It is possible to determine a specific pipeline at the initialisation of the run, but not change the nodes to use during the run. A solution I thought of is to create a node class, which takes a specific "skip_node" argument as input, which can be set based on the output of a previous node (or perhaps through a hook). However, this is a bit of a hacky solution. Is it...

👀 1

rss

04/08/2025, 1:58 PM

Python Kedro - Retrieve and use params inside registry_pipeleine.py In one of my project, i want to build my pipeline dynamically (as a sequence of several pipelines) according values passed as parameter from the kedro CLI For example, i have 3 pipelines (pipelineA ,pipelineB and pipelineC) If my parameter passed from the CLI is ID=X then i want to build: pipeline_to_run = pipelineA + pipelineB + pipelineC If it is ID=Y, i want to build pipeline_to_run = pipelineA + pipelineC But i encounter an issue, i'm not able to retrieve the parameter value. Does someone...

rss

06/30/2025, 6:58 PM

Saving Spark's MLlib model using Kedro data catalog Consider the model that is trained in this exemple at Kedro's documentation from typing import Any, Dict from kedro.pipeline import node, pipeline from pyspark.ml.classification import RandomForestClassifier from pyspark.sql import DataFrame def train_model(training_data: DataFrame) -> RandomForestClassifier: """Node for training a random forest model to classify the...