Kedro #questions

Nikola Shahpazov

03/15/2023, 1:17 PM

Hi guys, Quick question Is there a way to interpolate a SQLDataset query in

catalog.yml

passing some argument\parameter. Example:

Copy code

yaml
person:
  type: pandas.SQLQueryDataSet
  sql: "SELECT * FROM public.people WHERE id = ${id};"
  credentials: db_credentials

Thanks in advance!

rss

03/15/2023, 1:18 PM

Interpolate sql in SQLDataset in catalog.yml Is there a way to interpolate a SQLDataset query in catalog.yml passing some argument\parameter. Example: person: type: pandas.SQLQueryDataSet sql: "SELECT * FROM public.people WHERE id = ${id};" credentials: db_credentials Thanks in advance!

Nikola Shahpazov

03/15/2023, 2:43 PM

Another question from me 😅 What would be the proper way to describe a dataset in the catalog with mongodb source? I can see there is pandas.SQLDataset, but is there something similar for mongodb?

Olivier Ho

03/16/2023, 3:00 PM

Small question, is there any dataset that is glob compatible? For example, if I have a folder of images.

Ricardo Araújo

03/16/2023, 11:31 PM

Hey! Should OmegaConf work for data catalog entries? It works fine for parameters, but interpolation keys in the data catalog fails to resolve (

InterpolationKeyError: Interpolation key 'temp' not found

Andrew Stewart

03/16/2023, 11:49 PM

Is kedro-docker intended more to facilitate local interactive environment? As opposed to packaging a self-contained image artifact intended for distribution to something like an ECS compute cluster ?

Olivier Ho

03/17/2023, 9:16 AM

hello, is there a dataset type to read a file in bytes?

Olivier Ho

03/17/2023, 11:06 AM

if you use yield in a node to obtain an iterable, how can we store it in a partitioned dataset as the partitioned dataset requires a dictionary

Slackbot

03/17/2023, 12:08 PM

This message was deleted.

Abhishek Gupta

03/17/2023, 2:52 PM

Hi Everyone! Getting this error while executing a pipeline.

Ricardo Araújo

03/17/2023, 6:15 PM

This is sort of a follow up on previous question that was solved. Say I have a project that ingests a large dataset, but I only process a part of it -- it is a big time series, I want to process a specific month. I want to pass a CLI argument to do that, which I currently can by overriding a parameter. However, I'd also like the output to be written to different places depending on the argument (that is, I want the e.g. filename to be prefixed with the CLI argument).

Andrew Stewart

03/18/2023, 2:46 AM

Where in the kedro project structure do most folks manage their sql files? • data seems like it could be appropriate • maybe src? • maybe a separate sql dir?

Andrej Zachar

03/20/2023, 1:36 AM

Hello, I would like to know how can if I can pass from multiple nodes living in a different namespace / tag exactly the same output so it can be reused then later

Copy code

node(
    first_namespace_fn,
    inputs=["some_input"],
    outputs="shared_name_so_it_can_reused_somewhere_else",
    namespace="first"
),

node(
    second_namespace_fn,
    inputs=None,
    outputs="shared_name_so_it_can_reused_somewhere_else",
    namespace="second"
),

node(
    third_common_fn,
    inputs='shared_name_so_it_can_reused_somewhere_else',
    outputs="final_output",
),

Thank you!

Andrej Zachar

03/20/2023, 1:40 AM

PS: It is exactly the opposite problem as described here - https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html?highlight=namespace#using-a-modular-pipeline-multiple-times.

Chew Lee

03/20/2023, 8:11 AM

Hi all, does anyone have experience using Kedro with GCS? My user is able to use

gsutil

to read and write files to the bucket. Kedro run also successfully reads/writes files from/to GCS. But when trying to load a dataset from the Catalog in jupyter notebook, I get a 401 access denied. I have a credentials.yml file set up with

Copy code

my_gcp_credentials:
  client_id: <REDACTED>
  client_secret: <REDACTED>
  refresh_token: <REDACTED>
  type: <REDACTED>

which was obtained using

Copy code

gcloud auth login
gcloud auth application-default login

and copying the contents of the resulting json

Armen Paronikyan

03/20/2023, 11:25 AM

Hi guys, I would like to know if I can have access to the data in the credentials.yml file in kedro hooks?

03/20/2023, 1:54 PM

Hi all, can someone share the installation guide for kendro?

Javier del Villar

03/20/2023, 4:01 PM

Hi everybody, is it there something equivalent to

pandas.SQLQueryDataSet

in spark? can I get the same functionality in spark? I can not make queries with

spark.SparkJDBCDataSet

, am I missing something? Thanks in advance!

Cyril Verluise

03/20/2023, 9:53 PM

Hello there, I hope that this finds you well. Thanks for the awesome work! Sorry to come with an issue. Issue I'm trying to set up kedro pipeline test as part of my CI/CD using GH action. Everything goes well until I receive a git related error

Copy code

TypeError: HEAD is a detached symbolic reference as it points to 
'dc15ea87ce9d917bafb09d5d7bddb2aaf44f5989'

Full error log and GH action config in thread. What I tried I have tried to checkout with fetch depth 0 but this did not fix the issue (I had a similar issue when building doc from GH action which was fixed using the above trick). Environment kedro version: 0.18.6 OS: ubuntu latest Any ideas?

sujdurai

03/21/2023, 2:39 AM

Team, wondering if there is a way to control the node order execution in kedro, or an option to wait before executing another node. Context: I have a

node

that is used in two pipelines. They use the same input tables, but I expect the

node

in the second pipeline to run only after my first pipeline, because, the input files for the

node

in the second pipeline will be updated as part of the first pipeline run. Because I have registered both the pipelines to run as default in the

registry

, the

node

from the second pipeline runs sooner than I expect - I don’t want that.

Copy code

# Pipeline A
Input X, Y --> node1 + node2 + node3 --> Output X (i.e Input X after update)

# Pipeline B 
Input X(after update from Pipeline A), Y --> node1 + node4 + node5. --> Output Z

Order of execution (node_Pipelinename)
node1_A
node1_B
node3_A
node2_A
node4_B
node5_B

Expected order of execution
node1_A
node3_A
node2_A
node1_B
node4_B
node5_B

Dotun O

03/21/2023, 1:28 PM

Hi all quick newbie Kedro question here. If wanted to call catalog.load directly within the pipeline to observe the dataframes, how do I get the current catalog in the pipeline run. I see that kedro has kedro.io import DataCatalog but not sure how to get the specific catalog context

R P

03/21/2023, 7:09 PM

Hi everyone, I'm using Kedro with two main configuration envs: "conf/base" and "conf/test", and I'm running

kedro run --env=test

when I need to run a quick pipeline check. However, I have some code in my "settings.py" file that I must not run when I'm using the "conf/test" env, but I'm not managing to get this environment information in the "settings.py" code so I can write a simple if/else condition. What is the best way to do this? Thanks for this awesome open-source tool!

Javier del Villar

03/21/2023, 7:35 PM

Hi everybody, i'm not new to kedro, but i'm new to using kedro, pyspark and databricks at the same time. The logs appear after all the jobs have been completed, is it there a way to see the logs as they occur? I think is more a databricks question Thanks in advance!

Anjali Datta

03/22/2023, 1:00 AM

I’m inexperienced, so this is basic question. I’m trying to add datasets programmatically. I’ve made a catalog.py file that contains:

from <http://kedro.io|kedro.io> import DataCatalog

from <http://kedro.io|kedro.io> import PartitionedDataSet

from kedro.extras.datasets.pandas import CSVDataSet

from kedro.config import ConfigLoader

conf_paths = ['conf/base', 'conf/local']

conf_loader = ConfigLoader(conf_paths)

atlas_regions = conf_loader.get('atlas_regions*') # A .yml file consisting of regions with names

catalog_dictionary = {}

for region in atlas_regions['regions']:

name = region['name']

# catalog_dictionary[f'{name}_data_right'] = PartitionedDataSet(path = '../ClinicalDTI/R_VIM/', \

#     dataset = '<http://programmatic_datasets.io|programmatic_datasets.io>.nifti.NIfTIDataSet', filename_suffix = f'seedmasks/{name}_R_T1.nii.gz')

catalog_dictionary[f'{name}_data_right'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")

# catalog_dictionary[f'{name}_data_right_output'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")

io = DataCatalog(catalog_dictionary)

print(io.list())

(Kedro version 0.17.7) Running catalog.py prints the expected list of datasets. But what do I need to do to be able to use these datasets in a pipeline?

Balachandran Ponnusamy

03/22/2023, 2:51 PM

Hi Kedro Team...Getting attached error when we submit job in dataproc cluster to run a Data Engineering pipeline, we have a datafile in ".txt.gz" format. Same if we run it in .master(local[*]) , it works fine. but fails when we submit with saprk.master:yarn and spark.submit.deploymentmode: client Any idea where it is going wrong?

Stephane Durfort

03/22/2023, 4:22 PM

Hello, while playing with the

OmegaConfigLoader

to eventually replace the

TemplateConfigLoader

in my pipeline, I noticed that • variable interpolation does not seem to be applied on nested parameters (as in the

model_options

example mentioned in the documentation) • using

kedro run --params

only update parameters but does not propagate to references of these parameters in the configuration ? Am I doing something wrong ?

Priyanka Patil

03/22/2023, 5:43 PM

Hello team, I have the following catalog entry in my yaml file. Columns parameter below is not working. Am I missing something here? Thank you in advance!

Copy code

raw_dataset:
  type: spark.SparkDataSet
  filepath: "/data/01_raw/data.csv"
  file_format: csv
  load_args:
    header: True
    inferSchema: True
    index: False
    columns: ["a", "b", "c"]

Valentin Martinez Gama

03/22/2023, 9:42 PM

Hello team. I have created a custom Class that inherits from sklearn BaseEstimator, TransformerMixin. So

class CustomClass((BaseEstimator, TransformerMixin)

I have created an object of that class and saved it to my Kedro catalog as a pickle object. Now the problem is when I try using

catalog.load()

on a pipeline to load that object I get the following error:

DataSetError: Failed while loading data from data set PickleDataSet(backend=pickle,

filepath=……./data/06_models/custom_model_V1.pkl,

load_args={}, protocol=file, save_args={}).

Can’t get attribute ‘CustomClass’ on <module ‘__main__’ from ‘……venv/bin/kedro’>

I was able to make it work on a notebooks by first importint the class from the py file where it was defined:

from custom_classes import CustomClass

But when runing a kedro pipeline that uses this object as an input loaded from the catalog adding the import at the top of the pipeline fill did not fix it. Any usggestions on how to fix this?

Kenny B

03/23/2023, 10:27 PM

hello, I'm trying see if the following functionality exists for versioned datasets: 1. list all available versions of the catalog item 2. limit the number of versions created of this dataset, ie - limit is 10, clean up the oldest 11th version when I save a newer version

Maxime Steinmetz

03/23/2023, 11:20 PM

How can a predictive modelling project be designed for easy switching between steps, such as missing values imputation methods, class balancing methods, model types and so on? Should nodes be used to dispatch data to different implementations based on parameters, or should nodes containing the concrete logic be used? Alternatively, would a pipeline factory that produces a pipeline made of concrete nodes be more suitable?