Kedro #questions

Flavien

07/31/2023, 3:42 PM

Hi fellows, I followed the documentation for packaging Iris on databricks and it works really well 👍. I wanted to go a step further, using

ManagedTableDataset

— which works great too — and run different independent pipelines defined on the same project, but I did not manage to do so. I modified the

databricks_run.py

to account for a

--pipeline

option but I think the problem is in packaging the project which does not take into account pipelines created through

kedro pipeline create

if I am not mistaken (but I probably am). Would you point me towards my mistake? Thanks!

Jon Cohen

07/31/2023, 6:23 PM

I'm setting up some better monitoring infrastructure for our data pipeline. I've only done observability for web servers before and don't know much about the monitoring ecosystem for data pipelines. Are there any services or systems that people here like using for this purpose?

👍 1

👀 1

Emilio Gagliardi

07/31/2023, 8:32 PM

Has anyone incorporated an LLM pipeline in a kedro project yet? I'd like to try using OpenAI to perform some processing on a collection of json documents and I'd love to see a working example or hear about any lessons. THanks kindly!

meharji arumilli

08/01/2023, 8:46 AM

Hi, Anyone here is running kedro project in Apache airflow? I have a question regarding logging. The DAG runs in airflow. However the logs we see in the console when a kedro project is run locally is not visible in the airflow UI. The UI shows only

Copy code

*** Found local files:
***   * /opt/airflow/logs/dag_id=test-fi/run_id=scheduled__2023-07-02T08:24:20.451204+00:00/task_id=preprocess/attempt=1.log
[2023-08-01, 08:24:21 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: test-fi.preprocess scheduled__2023-07-02T08:24:20.451204+00:00 [queued]>
[2023-08-01, 08:24:21 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: test-fi.preprocess scheduled__2023-07-02T08:24:20.451204+00:00 [queued]>
[2023-08-01, 08:24:21 UTC] {taskinstance.py:1308} INFO - Starting attempt 1 of 2
[2023-08-01, 08:24:21 UTC] {taskinstance.py:1327} INFO - Executing <Task(KedroOperator): preprocess> on 2023-07-02 08:24:20.451204+00:00
[2023-08-01, 08:24:21 UTC] {standard_task_runner.py:57} INFO - Started process 114 to run task
[2023-08-01, 08:24:21 UTC] {standard_task_runner.py:84} INFO - Running: ['***', 'tasks', 'run', 'test-fi', 'preprocess', 'scheduled__2023-07-02T08:24:20.451204+00:00', '--job-id', '486', '--raw', '--subdir', 'DAGS_FOLDER/test_fi_dag.py', '--cfg-path', '/tmp/tmpzsz4yrlp']
[2023-08-01, 08:24:21 UTC] {standard_task_runner.py:85} INFO - Job 486: Subtask preprocess
[2023-08-01, 08:24:21 UTC] {task_command.py:410} INFO - Running <TaskInstance: test-fi.preprocess scheduled__2023-07-02T08:24:20.451204+00:00 [running]> on host 829fb522c236
[2023-08-01, 08:24:21 UTC] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='***' AIRFLOW_CTX_DAG_ID='test-fi' AIRFLOW_CTX_TASK_ID='preprocess-rre' AIRFLOW_CTX_EXECUTION_DATE='2023-07-02T08:24:20.451204+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2023-07-02T08:24:20.451204+00:00'
[2023-08-01, 08:24:21 UTC] {test_fi_dag.py:61} INFO - Executing task preprocess, using model version: 20230801
[2023-08-01, 08:37:16 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1

Can anyone make a configuration suggestion that could show the complete process log in the airflow UI. Thanks!!

Jordan Barlow

08/01/2023, 9:33 AM

Hi, a question regarding SQLQueryDataSet: Can I point the catalog entry to a

.sql

file?

Copy code

shuttle_id_dataset:
  type: pandas.SQLQueryDataSet
  sql: data/path/to/query.sql
  credentials: db_credentials

👀 1

Elena Mironova

08/01/2023, 1:24 PM

Hi team, After yesterdays release of

kedro-datasets==1.5.0

, our CI started failing during system tests which do a

kedro run

for a pipeline with spark (see the screenshot). As far as i can see,

SparkDataSet

is still defined with the same name as before. When we used

kedro-datasets==1.4.2

the same tests were running smoothly. I also couldn't find anything specific in the release notes - do we have to update our code (mb some import statements or how it is specified within the requirements)?

👀 1

Erwin

08/01/2023, 7:42 PM

Hi! Anyone using pyspark + OmegaConfigLoader? I have an issue: I cannot even do a

kedro run,

since

_resolve_credentials

fails [i dont have any credential in my project]

AttributeError: 'str' object has no attribute 'items'

✅ 1

meharji arumilli

08/02/2023, 9:12 AM

Hi, I have my config loader as below. It mainly assings the model_version variable to self.params. The model_version is generated using timestamp.

class MyTemplatedConfigLoader(TemplatedConfigLoader):

def __init__(self, conf_source, env, runtime_params):

os.environ["model_version"] = datetime.now().strftime('%Y%m%d-%H%M%S')

self.params = os.environ

super().__init__(conf_source=conf_source, env=env, runtime_params=runtime_params, globals_dict=self.params)

CONFIG_LOADER_CLASS = MyTemplatedConfigLoader

This generates a unique model_version when the project is run in kedro. This model_version is used in the filepaths in Catalog to save the outputs from different nodes. However, when this kedro project is packaged and run in airflow, each node is generating a new model_version which causes the subsequent nodes to fail as it expects the output (file path with model_version) from previous node as input. Does anyone working with kedro and airflow offer a hack for this to keep the model_version unique across all nodes or tasks in airflow?

Fazil B. Topal

08/02/2023, 1:14 PM

hey everyone, Quick question regarind using data catalog with python api. Following this documentation, I have the following questions • Should catalog.py be in conf/ folder? (same as where catalog.yaml is) • Does that work the same with nodes when i do

kedro run

or do i have to explicitly use this python object and load the data on my own? • Is it possible to define some sections in the yaml file and other parts in python? I know i can do something in the hooks but I wanted to check if there is way where this catalog variable would be accessible by the user? Thanks in advance! 🙂

Trevor

08/02/2023, 5:15 PM

Is there a way to dump the parameters to a file or access the parameters of the current run conveniently? If I run my Kedro pipeline and override parameter xyz to be 5 instead of 3 for that run only, is it possible to dump the parameters.yml with the overwritten parameter xyz?

Trevor

08/02/2023, 5:35 PM

~~Sorry, thought I was putting those previous messages in a single thread.~~ Fixed New question, new thread: Is there a way to set a parameter in a node? If my first node calls a function

date()

that simply gets the datetime date, can I assign that date to a parameter?

Fazil B. Topal

08/03/2023, 4:08 PM

Hey all, I have a slight problem with multi catalog files. Im using OmegaConfigLoader and I have the following structure. catalog in bigquery gets recognized but not s3. Is this the expected behavior? Thanks in advance

Ankit Kansal

08/03/2023, 4:42 PM

Hey Team,

Ankit Kansal

08/03/2023, 4:43 PM

What is the latest way of implementing kedro / databricks in azure environment ? Is there any standardised approach of setting things up from development & production stand-point ?

Daniel Kirel

08/03/2023, 8:25 PM

Hey team, two questions on `kedro-mlflow`: 1. Is there a way to log git commit tag/sha through

kedro-mlflow

? 2. Is there a good way to save input datasets without needing to create separate MLFlow artifact datasets and a node to read and save datasets? Appreciate any help/guidance on this 🙏

Sid Shetty

08/04/2023, 3:29 PM

Hey team, I am saving partitioned dataset with pyspark parquet data types, catalog entry:

Copy code

cpa_llm.blocking_output@partitions:
  type: PartitionedDataSet
  path: data/cpa_llm/blocking_output
  overwrite: True
  filename_suffix: ".parquet"
  dataset:
    type: spark.SparkDataSet
    file_format: parquet
    save_args:
      mode: overwrite

When I read the same data as a spark dataset I get the error that

AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

but when I read from one of the particular partitions it infers the schema. Was wondering if there maybe a step I am missing here or if you recommend some other data type over parquet to store the files. Appreciate any help here 😄

Emilio Gagliardi

08/04/2023, 5:12 PM

Hi everyone, I have a basic question about how to save a dataset in a kedro notebook. I understand how to load a dataset, but I'm not clear how to save a dataset. I have a custom dataset. I'm not sure if I need to pass in the catalog properties/credentials manually or how to pass the data. The custom dataset connects to a mongo db so I need to pass in credentials. Thanks kindly,

Emilio Gagliardi

08/06/2023, 2:52 AM

I was working with GPT 4 to brainstorm how to connect to an azure container blob that stores 1-to-many JSON files. The suggestion it provided was not what I expected and I wonder if someone can comment? I want to create a partitioned dataset and the underlying files are JSON. GPT 4 suggested the following which references a

<http://kedro.contrib.io|kedro.contrib.io>.azure.JSONBlobDataSet

which I can't find in the documentation under 18.12, but under 15.6. Did something change in the way kedro organizes contrib.io? GPT 4 also said that the built-in kedro JSON dataset doesn't work on azure. Any guidance is appreciated. THanks kindly,

Copy code

my_partitioned_dataset:
  type: kedro.io.PartitionedDataSet
  path: <your_blob_folder_path>
  credentials: azure_blob_storage
  dataset:
    type: kedro.contrib.io.azure.JSONBlobDataSet <- is this valid?
    container_name: <your_container_name>
    credentials: azure_blob_storage

Jackson

08/07/2023, 3:17 AM

Hi, I am curious about where should we put our folder in kedro project. For example, I have a

dataset

folder which store my defined pytorch Dataset class and another module called

model

, I will need to import the dataset and model classes into my kedro nodes. What are the best practices to store these module?

Jackson

08/07/2023, 3:34 AM

Also, why when I run kedro run it works, but when I run using python src/../nodes.py it show no module named xxx?

Fazil B. Topal

08/07/2023, 9:45 AM

hey all, I know it's been asked many times but i am yet to find a solution on kedro node running order. I am building steps which creates some tables in bigquery (since the query is complex it is being done in a multi stage way so 01-query1.sql, 02.query2.sql etc. Each of these are a node in kedro but since my custom dataset implementation (creating tables in bigquery) only implemented a

load

method, i define outputs as

None

in the node. Question is how can I create a Ordered Pipeline in kedro? Im willing to hack the Pipeline class a bit but too many stuff going on there so seeking some help here. thanks in advance! 🙂

Debanjan Banerjee

08/07/2023, 10:40 AM

kedro

versioned

always points to a new version once writing the data right ? Can we ensure there is a

prod

version created that the rest of the datasets always read from in production and we can change it in params or somewhere when we want to? for eg., we can do this manually by doing this parameters.yml

Copy code

run_date: &run_date 20230101

version : *run_date --this can also be prod/dev/uat etc.

catalog.yml

Copy code

weather:
  type: spark.SparkDataSet
  filepath: <s3a://your_bucket/data/01_raw/weather/${version}/file.csv>
  file_format: csv

but this wont usilise the

versioned: True

feature. Any way we can achieve the above functionality from

versioned

? That would be much cleaner imo

Thomas Gölles

08/08/2023, 9:40 AM

Hi. Is there a way to get the current run name? Like in Kedro viz experiment tracking I get names like "2023-08-08T08.17.05.592Z". I am using mlflow and tensorboard as well at the moment and want to have consistent naming in every tracking tool.

Rosana EL-JURDI

08/08/2023, 9:50 AM

Hello Everyone

🧵 1

Rosana EL-JURDI

08/08/2023, 9:50 AM

I hope you are all doing well?.

🧵 1

Rosana EL-JURDI

08/08/2023, 9:51 AM

I am running into an issue with kedro installation. Kedro installation seems to work fine with proper version and kedro info workings well

🧵 1

Rosana EL-JURDI

08/08/2023, 9:52 AM

but when I try to run kedro ipython I recieve the following error message: Traceback (most recent call last): File "/home/usename/.local/lib/python3.10/site-packages/IPython/core/shellapp.py", line 282, in init_extensions self.shell.extension_manager.load_extension(ext) File "/home/usename/.local/lib/python3.10/site-packages/IPython/core/extensions.py", line 76, in load_extension return self._load_extension(module_str) File "/home/usename/.local/lib/python3.10/site-packages/IPython/core/extensions.py", line 91, in _load_extension mod = import_module(module_str) File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked ModuleNotFoundError: No module named 'kedro'

🧵 1

Rosana EL-JURDI

08/08/2023, 9:53 AM

Is anyone familiar with this error ?

🧵 1

Rosana EL-JURDI

08/08/2023, 9:53 AM

Thank you

🧵 1

Nok Lam Chan

08/08/2023, 10:40 AM

Hello everyone, I have a question regarding the usage of environments in combination with the OmegaConfigLoader.

I have a file called
catalog_globals.yml
in my
base/
config folder, and also in my
prod/
config folder. When I execute
kedro run --env=prod
, the settings from the file in
base/
are still used.

cc @Gerrit Schoettler