Kedro #questions

user

08/05/2022, 6:28 PM

how to use kedro.versioning in latest version of kedro? I have previously used kedro version 0.17.6 in my project.Now i have upgraded my version to 0.18.2.But in latest version of kedro there is no module of kedro.versioning.So i am getting a error that module is not founded.Can anyone please suggest something

✅ 1

user

08/06/2022, 7:38 AM

ModuleNotFoundError: No module named 'kedro.versioning' i have upgraded my kedro to latest version.But i have used kedro.versioning in my project.And latest kedro has no module of this name.Can anyone please suggest anything

✅ 1

Tom Taylor-Vigrass

08/11/2022, 1:09 PM

Anyone seen this error on kedro viz before? Just upgraded to 5.0 (wasn’t seeing the err on 4.7.2)

Copy code

AttributeError: 'TranscodedDataNode' object has no attribute 'original_version'

user

08/16/2022, 8:08 AM

Is it possible to automate creating readme content using sphinx in kedro? Kedro uses sphinx formatting already, and when creating a pipeline it automatically creates a README.md file. With sphinx you can automate creating documentation. I want to know if and how it is possible to make sphinx automate writing readme files?

✅ 1

user

08/20/2022, 10:58 PM

dynamic parameters on datasets in Kedro I would like to call an api to enrich an existing dataset. My approach would be to wrap the API client in an APIDataSet and the other dataset is just a CSVDataSet. Then I'd use both as an input on a Node. For example, I've got keywords in the CSVDataSet and would like to enrich them with Google News using an api. I need the keywords from...

👍 1

user

08/23/2022, 2:28 PM

How to use generators with kedro? Thanks to David Beazley's slides on Generators I'm quite taken with using generators for data processing in order to keep memory consumption minimal. Now I'm working on my first kedro project, and my question is how I can use generators in kedro. When I have a node that yields a generator, and then run it with kedro run --node=example_node, I get the following error: DataSetError: Failed while saving data to data set...

👍 1

Mavis Tian

08/25/2022, 4:29 PM

Hi everyone!

Mavis Tian

08/25/2022, 4:30 PM

Does anyone know how to avoid getting a software installation form while running kedro command?

Andrew Stewart

08/31/2022, 6:34 PM

Quick question: Where should one be managing the version number of a kedro project?

project_version

pyproject.toml

seems to correspond to the version of Kedro, not the actual project at hand. Is the package version in

src/setup.py

the right place, or is that being controlled by some higher level process?

Faisal Malik

09/07/2022, 9:13 AM

Hi, quick question, I currently use kedro

0.17.4

but we want to convert our kedro pipeline into prefect flow using this approach but I notice this approach only available starting from kedro `0.18.0` while on latest kedro

0.17

it's not present. I tried to install prefect on the same environment as my kedro

0.17.4

but looks like it causes a dependency issue. Should I upgrade my kedro? and if that so, how hard it'll be to upgrade from kedro

0.17.4

to kedro

0.18.2

Toni

09/09/2022, 1:21 PM

Hello kedro team! I have a kedro issue, let's see if you can help me... We have a kedro pipeline that trains a model and generates a dataframe as output. The problem we now have is that we need to loop that pipeline to generate multiple dataframes (that, at the end, we want to concatenate to have a single table). Is possible to, given a parameter of

set_targets = ['a', 'b', 'c']

, we can loop the same pipeline for each value of that list without "copying" that pipeline? We may have a different length and names for that "`set_of_targets`", and thus we want to avoid manual work... Also, we need the outputs to have "dynamic" names in the catalog in order to save all the outputs (

score_{{target}}

...

score_a

score_b

score_c

)... I think this could be done with

jinja

, but no idea where to start... Thank you very much!

user

09/14/2022, 5:48 AM

How to do SQL like querying parquet the files in kedro I'm new to kedro, I'm just wondering if I could do SQL like querying the parquet files instead of using Dataframe API's. Please help me out if there is a way. Thanks in advance!

Yetunde

09/14/2022, 8:43 AM

I copied @Ashish Verma's question from the #C03RKNSN3U0 channel to here: Hey team, I am still struggling with Kedro + Databricks integrations. After resolving all the package conflicts, I am encountering a never seen before error. While creating the Kedro session, I am facing the Py4JSecurityException error. Error and screenshot below for reference. py4j.security.Py4JSecurityException: Constructor public org.apache.spark.SparkConf(boolean) is not whitelisted. Can you please help me on this?? Solutions I find on google is to create a new cluster, which is not an option for us. Also, I tried removing the context.py which initialize the custom spark context, this is also not working. Let me know if there is something else I need to do, thanks. 🙂 Thanks Ashish Verma

Toni

09/14/2022, 9:38 AM

Hi! Quick question: if an entry in the

data catalog

uses

versioned: True

, when I use

catalog.load(...)

in a notebook, does it always load the last version of that entry? How can I indicate the version to load? Thank you!

Riley Brady

09/14/2022, 8:16 PM

(

0.18.1

) It seems that

kedro run --tag some_tag1,some_tag2

will run any nodes with

some_tag1

some_tag2

. Is there any functionality to use AND instead of OR? My workaround right now is to create a custom tag of

some_tag1-some_tag2

and then calling that directly. It would be nice if I could list out a few tags and only run nodes that have all of them. But I understand why OR is the default.

Kasper Janehag

09/15/2022, 9:42 AM

(

0.17.7

). Hi! I have some problems with running Kedro on a with a self-hosted Hadoop cluster. As part of a pipeline, I have a transcoded registered dataset

table@pandas

and a

table@spark

, with the following settings.

Copy code

...table@pandas:
  type: "${datasets.parquet}"
  filepath: "${base_path_spark}/…/master_table"
 
..._table@spark:
  <<: *pq
  filepath: "${base_path_spark}/…/master_table"

The

base_path_spark

is a HDFS location. These are then used in a pipeline in the following matter.

Copy code

spark_to_pandas = pipeline(
        pipe=Pipeline(
           [
                node(
                    func=spark_utils.to_pandas,
                    …
                    outputs=f"..._table@spark",
               )
           ]
       )
   )
 
    data_cleaning = pipeline(
        pipe=Pipeline(
           [
                node(
                    func=enforce_schema_using_dict,
                    inputs={
                        "data": f"..._table@pandas",
                   },
     …
               )
           ]
       )
   )

The

data_cleaning

node is suppose to pick up the output from the

spark_to_pandas

node, using the transcoded dataset. However, an

DataSetError

is raised with the following message

Copy code

Exception has occurred: DataSetError
[Errno 2] No such file or directory: 'hadoop': 'hadoop'
Failed to instantiate Dataset 'telco_churn.master_table@pandas' of type 'kedro.extras.datasets.pandas.parquet_dataset.ParquetDataSet'.

If we remove the transcoding in the DataCatalog and register the datasets as individual registries the error disappears. Anyone know how to proceed from this kind of error? Could it be related to client specific Hadoop environment? How can we proceed with trouble shooting?

Toni

09/16/2022, 9:24 AM

Hi team! How can I save an

np.array

with the

catalog

? Is there a way to save this

np.array

as CSV "easily"? I cannot use the

pandas.CSVDataSet

because it is not a dataframe. I think that this can be done with trascoding datasets, but I do not know if there is a

dataset

for

np.arrays

in kedro.

user

09/17/2022, 10:58 AM

DataSetError in Docker Kedro deployment I try to deploy example Kedro starter project (pandas-iris). I successfuly run it locally (kedro run), and then, having kedro-docker install, init a Docker, build image and push it to my registry. Unfortunately, both kedro docker run and docker run myDockerID/iris_image generate the same error: DataSetError: Failed while loading data from data set CSVDataSet(filepath=/home/kedro/data/01_raw/iris.csv, load_args={}, protocol=file, save_args={'index': False}). [Errno 2] No such file or...

✅ 1

Olivia Lihn

09/20/2022, 11:11 PM

Hi everyone! We are trying to deploy a kedro pipeline in spark*,* using

--master yarn

and

--deploy-mode cluster

, not locally or client-mode. Has anyone tried this? If so, what are the extra files/code you added to make

spark-submit

work?

Jonas Kemper

09/27/2022, 10:30 AM

Hi friends, has anyone ever deployed kedro projects behind some kind of

lightweight HTTP API

? I'm thinking one

POST request

to start a run and then a

GET request

to poll the run status etc. ? Is there any reference material that you could point me to?

user

10/03/2022, 1:48 PM

How to run a kedro pipeline interactively like a fuction I would like to run kedro pipelines in jupyter notebook with different inputs, so something like this: data = catalog.load('my_dataset') params = catalog.load('params:my_params') pipelines['my_pipeline'](data=my_dataset, params=my_params) Is there a way to do this? Also, if I have to feed some inputs to other nodes but the starting one (for example the second node), how would this be done?

user

10/07/2022, 7:58 AM

How to change the kedro configuration environment in jupyter notebook? I want to run a kedro pipeline in the base env using jupyter notebook. I do this the following way: %reload_kedro --env=base session.run(pipeline_name='dpfm1') Doing this, the %reload_kedro command raises the following error: RuntimeError: Could not find the project configuration file 'pyproject.toml' in --env=base. If you have created your project with Kedro version >> kedro, version 0.18.2 What's the matter here?

✅ 1

user

10/07/2022, 2:18 PM

Is there a way to include an Azure Databricks Lakehouse query as a DataCatalog dataset in kedro? We want to use kedro to control our ML pipelines in Azure Databricks. We are querying (and joining) relatively large tables in Databricks' Lakehouse. Therefore, we would like to include those joins in the DataCatalog without bringing the full precedent tables into memory. Something like: scooters_query: type: pandas.SQLQueryDataSet credentials: scooters_credentials sql: select * from cars where gear=4 load_args: index_col: [name] Is there a way to perform this in Databricks?

user

10/07/2022, 4:58 PM

import fsspec throws error (AttributeError: 'EntryPoints' object has no attribute 'get') import fsspec throws error (AttributeError: 'EntryPoints' object has no attribute 'get') []

✅ 1

user

10/08/2022, 6:38 PM

Kedro on Databricks: Cannot import SparkDataset Cannot import SparkDataset in Databricks using; from kedro.extras.datasets.spark import SparkDataSet [1]:

https://i.stack.imgur.com/wkDIJ.jpg▾

user

10/13/2022, 8:18 AM

Kedro template configuration does not load globals.yml configuration into catalog.yml FOR Jupyter Lab It works for the CLI but not for Jupyter Lab. I have just recently upgraded from 0.17.1 to 0.18.3. Have made changes to settings.py which uses the Templated Config Loader. I have copied the content of https://github.com/kedro-org/kedro/blob/main/kedro/ipython/__init__.py to .ipython/profile_default/startup/00-kedro-init.py and I am still seeing the Jupyter Notebook trying to read...

user

10/13/2022, 2:38 PM

TypeError: __init__() got an unexpected keyword argument 'config_loader' Getting this error while running Kedro with Session.run() on Databricks TypeError: init() got an unexpected keyword argument 'config_loader'

user

10/14/2022, 8:58 AM

kedro PartitionedDataSet lazy writting to spare memory? I am working with PartionedDataSet in kedro. One of the data set is of type pillow.ImageDataSet: raw_images: type: PartitionedDataSet

Maren Eckhoff

10/14/2022, 6:17 PM

Hi team, is it possible to pass a constant into a kedro node? Something like this:

Copy code

node(my_fun, 
inputs = {input_data: my_data, input_params: params:my_params, constant: 4}
outputs = output_data})

user

10/17/2022, 3:18 PM

Include Quarto rendering in kedro pipeline and pass it inputs/outputs I am using kedro to make some comparative analysis. In a quarto report I have some chunks containing evaluation of output_var1 and output_var2 for example plot_function(output_var1) plot_function(output_var2) At the end of the pipeline, I would like to compute my report with quarto using the outcome of my pipeline, without saving it to the data catalog. from quarto import render def create_pipeline(**kwargs) -> Pipeline: return pipeline([node(func=function1,...