Kedro #questions

Jo Stichbury

12/01/2022, 11:14 AM

Hi team! I have a pair of questions about the plotly chart visualisation of the spaceflights tutorial, described in the docs for 18.3 here. Context: I'm working on a revision of those docs to make it a bit more straightforward by adding a new pipeline for

reporting

. More context: I took the basic spaceflights starter as it will be after 18.4, which means I stripped out the namespaces/modular pipelines, so the example code is more straightforward. You can see the starter on the repo here (when we put out release 18.4 it'll be merged and available immediately for access via

kedro new --starter=spaceflights

) but right now you'll need to use

kedro new --starter=spaceflights --checkout=68a27db42335366b07f9362f677d69684ec4e942

OK, so here's my example code with a reporting pipeline but when I

kedro run

and then

kedro viz

I see a different graphic to the one in the docs: TL;DR -- what are the questions? Q1: Is this viz correct? If it is not supposed to look like this, please roast my pipeline. Q2: I tried to save my visualisation with
kedro viz --save-file my_shareable_pipeline.json
but when I then reload it with
kedro viz --load-file my_shareable_pipeline.json
I don't see the chart. So question 2 is: what's wrong with my viz? Thanks in advance for any advice. LMK if you need more information.

shawn

12/01/2022, 3:34 PM

Hey Everyone,

shawn

12/01/2022, 3:38 PM

Context: I am trying to run a job in Databricks based off the .whl file packaged by Kedro: Kedro Version: 0.18.3 Error:

ValueError: Given configuration path either does not exist or is not a valid directory: /databricks/driver/conf/base

Q1: Is the issue on due to the .whl file itself or the way I am configuring the job ? Q2: Would this be due to a permissions issue on the environment I am using ?

shawn

12/01/2022, 3:38 PM

Thank you so much for your help in advance!!

Jan

12/02/2022, 8:17 AM

Hi! I'm trying to this run_only_missing Example. However, in the docs it says I need to supply a _hook_manager_. I setup hooks (even though I don't need them at the moment) but I don't know what exactly to supply to the

run_only_missing

as _hook_manager._ Can anyone assist? 🙂

✅ 1

Anu Arora

12/02/2022, 3:58 PM

Hi Team One qq, are you aware of any other better way to orchestrate kedro pipeline on databricks using ADF? The way I was doing so far was to orchestrate the notebook using ADF where the databricks notebook contains the code to unzip the wheel contents of kedro project -> install the libraries through requirements.txt -> then run the kedro pipeline

Eugene P

12/02/2022, 4:41 PM

Hi kedroids! Sorry for noob question. I’m working with sql database as source of data and pandas.SQLQueryDataSet works well

Copy code

sample_sql_query_data:
  type: pandas.SQLQueryDataSet
  credentials: postgres_re_db
  sql: SELECT * FROM rr_norm.sample_gov_torgi

Unfortunately, the amount of queries grows fast and catalog.yaml starts bloating with long query strings. Also, it looks like not a good idea to keep sql queries strings within the catalog.yaml itself for reproducibility. What would be the most kedroic/pythonic approach to extract queries from the catalog.yaml to a separate folder/module? AFAIK (or understood from googling) yaml doesn’t natively has include/import features?

shawn

12/05/2022, 3:07 PM

Hey everyone! I am getting the following error with kedro and I am not sure why conf/base is under site packages Context: I am trying to run a job in Databricks based off the .whl file packaged by Kedro: Kedro Version: 0.18.3 Error:

ValueError: Given configuration path either does not exist or is not a valid directory: /databricks/driver/conf/base

marrrcin

12/06/2022, 8:45 AM

How is the release cycle of Kedro coordinated? Right now, kedro

0.18.4

is already in PyPI, but starters are not tagged yet, making our CI/CD pipelines fail:

kedro.framework.cli.utils.KedroCliError: Kedro project template not found at git+https://github.com/kedro-org/kedro-starters.git . Specified tag 0.18.4. The following tags are available: 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.17.6, 0.17.7, 0.18.0, 0.18.1, 0.18.2, 0.18.3.

Can we expect tagging today? 🤔 Maybe there should be some fallback mechanism for kedro starters to use versioning similar to Python (e.g.

~=0.18.0

but for tags).

Yifan

12/06/2022, 10:40 AM

Hey everyone! I would like to know if there is a tool or Kedro module capable of profiling each node in a pipeline? Basically I want to analyse the execution time of each node (from loading the first input dataset to the end of saving the last chunk of output to the storage) in my pipeline, and I am aware of the possibility of using log files. However, for a pipeline with hundreds of nodes, manually analysing the log files is almost impossible. Do you have any suggestions? Thank you!

Pallavi Kumari

12/06/2022, 11:41 AM

Hi everyone, I have to call Kedro nodes or pipelines in my Django project. eg: for simulation ,we need to call kedro pipeline and want to use its output as input of django apis. please suggest some solution for this.

user

12/06/2022, 12:18 PM

how to call kedro pipline or nodes in Django framework I have to call Kedro nodes or pipelines in Django API, and need to use Kedro pipelines or nodes output as input in Django API. Not getting any solution. Please suggest some solution for this. How to call Kedro pipeline or nodes in Django APIs?

✅ 1

Fabian

12/07/2022, 9:14 AM

Hi Team, is it possible to save the same output in two different catalog entries? I want to save my data to a parquet file for further usage and as csv. Is that possible without modifying my nodes?

Jan

12/07/2022, 9:30 AM

Hello! Is it possible to register a data catalog entry as a versioned file (versioned=True) via kedro.io.DataCatalog ? I only find information about how to do this in the yml file.

Fabian

12/07/2022, 12:06 PM

Hello everyone, a kedro viz question: I have a modular pipeline with two outputs: first some intermediate data that is further processed within the pipeline, and the final data. I instantiate the pipeline with a namespace and added both data to the catalog. In kedro viz, only the final data is shown as output of my modular pipeline. The intermediate data is shown seperated without any connections. However, when i expand the modular pipeline, the intermediate data is shown as output of the specific node. I want the intermediate data to be shown as a result of my unexpanded modular pipeline, especially when using it as input for other pipelines. However, that is not the case. Is the observed case the intended behavior? And what could I do to change it?

Olivia Lihn

12/07/2022, 12:29 PM

hi everyone! I am running a kedro pipeline in Databricks Repo, using the kedro docs. The pipeline runs end-to-end but i encountered an error:

Copy code

OperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)

My guess is that the run session info cannot be saved because of writing permissions on databricks Repo. We have deleted

logging.yml

and to be honest this is more of an annoying error (as the pipeline runs). Any ideas on how can we avoid this?

Maurits

12/07/2022, 5:29 PM

Hi all, I'm facing a

java.lang.OutOfMemoryError: Java heap space

error storing a JSON-file of 2.5M rows on AWS S3 via a Kedro pipeline. ECS Compute has 104 GB memory already. Any recommendation how to configure this? Repartition experience? Spark config? Or work around it?

Olga Chumakova

12/07/2022, 9:33 PM

Hi all! Do Kedro nodes allow to have optional inputs and outputs? I have an evaluation function with in-time and out-of-time testing. However, I want to do both tests only for certain models and apply in-time for the rest. Do I need to build separate functions for these two scenarios or can I set out-of-time inputs/outputs as optional?

🤔 1

Tooba Mukhtar

12/07/2022, 9:53 PM

Hi team, I am trying to set up the Layer functionality in kedro viz. Have defined all the layers in the yml files but 2 of the layers are not being displayed in Kedro Viz. I can see the nodes being displayed but they are being assigned to incorrect layers (example: Model output instead of reporting). What could be the reason for this?

👍 1

Jaakko

12/08/2022, 8:53 AM

The documentation still instructs to use

kedro build-reqs

but when running

kedro build-reqs

I get the following deprecation warning:

DeprecationWarning: Command 'kedro build-reqs' is deprecated and will not be available from Kedro 0.19.0.

How should project dependencies be managed after

build-reqs

is not available anymore? Can the documentation be updated accordingly?

Jo Stichbury

12/08/2022, 10:39 AM

Please could I get a bit of help with an issue that's been reported over on GitHub, but looks more like it's a question for here? I've directed the user to come over here for some further help but thought I'd highlight it now to get the ball rolling: https://github.com/kedro-org/kedro/issues/2104

Shreyas Nc

12/08/2022, 1:09 PM

Hi , I want to use the pillow.ImageDatSet. But getting an error. Pasting the changes here: The documentation doesnt have YAML API described too. Am I missing something?

Copy code

imageset:
  type: PartitionedDataSet
  dataset: {
      "type": pillow.ImageDataSet
  }
  path: <path_to_data>
  filename_suffix: ".jpg"

getting below error:

kedro.io.core.DataSetError:
Object 'ImageDataSet' cannot be loaded from 'kedro.extras.datasets.pillow'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.pillow.ImageDataSet:
<https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>.
Failed to instantiate DataSet 'imageset' of type 'kedro.io.partitioned_dataset.PartitionedDataSet'.
kedro.framework.cli.utils.KedroCliError:
Object 'ImageDataSet' cannot be loaded from 'kedro.extras.datasets.pillow'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.pillow.ImageDataSet:
<https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>.
Failed to instantiate DataSet 'imageset' of type 'kedro.io.partitioned_dataset.PartitionedDataSet'.
Run with --verbose to see the full exception
Error:
Object 'ImageDataSet' cannot be loaded from 'kedro.extras.datasets.pillow'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.pillow.ImageDataSet:
<https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>.
Failed to instantiate DataSet 'imageset' of type 'kedro.io.partitioned_dataset.PartitionedDataSet'.

Shreyas Nc

12/08/2022, 1:11 PM

Manilson António Lussati

12/09/2022, 2:19 AM

Hello everyone I have been studying ways to use dbx using the kedro template. Have any of you gone through this?

Sebastian Pehle

12/09/2022, 9:37 AM

Hello everyone. Lets say i created a reporting pipeline in a notebook (pull data, compute cols, export excel/csv). I then packaged everything into a kedro project and everything is fine. Then the customer wants some alterations of the reports, new columns or something like that. How would i proceed to "develop" inside kedro? Transferring dirty notebook code into clean nodes is one thing, but how would i proceed to develop once everything is a node in a pipeline? In jupyter notebooks or regular py files i can run the code until some point and then alter my dataframes as i wish. How would i approach this in a kedro framework? I hope this makes sense ;)

Max S

12/09/2022, 10:26 AM

Hey Team, QQ regarding versioning. I think I am clear regarding versioned datasets. Searching the docs I could not find anything regarding versioned parameters. Given that I trigger a pipeline run, I create versioned datasets (if I choose to do so), but can I also create a versioned save of the used parameters (from one or more

yaml

files?) Or am I thinking about this the wrong way and there is a good reason that this is not possible? Thanks!

Balazs Konig

12/09/2022, 12:19 PM

Hi Team! 🦜 Quick question hopefully: How can I specify

schema

for a

SparkDataSet

in the catalog entry itself? What’s the best practice to represent the

StructType()

object in yaml? EDIT: or is the best practice to always save the schema to a separate params file and add just the

file_path

to the catalog entry?

Adam_D

12/09/2022, 3:49 PM

Hey Team! I am newer to AWS and I have followed the Kedro AWS Batch Deployment Guide but I am getting a dataset error like this stackoverflow question. I do not want to put datasets into the docker container. I want to be able to read from and write to S3. The AWS tutorial puts the S3 URL as an environment variable. Do I need to do this for each dataset? I'm really looking for how to connect my docker container to S3 to run in a kedro pipeline. Thanks in advance for your help and let me know if I need to provide more detail.

John Melendowski

12/10/2022, 12:41 AM

Any plans to make a conda feedstock for kedro-viz?

🔥 1

Mathilde Lavacquery

12/12/2022, 2:54 PM

Hi Kedro Team, what would be the best practice to pass parameters both in pipeline_registery and in the catalog ? e.g., I have a pipeline that runs for different countries and different brands, some pipelines / datasets are at country level, some are at country x brand level. All my pipelines are using namespacing to deal with the “scope” (ie the countries / brands) My pipeline registery looks like that:

Copy code

def register_pipeline():

    countries = ["a", "b"]
    brands = ["1", "2", "3"]
    return {
        "preprocess_macro": preprocess_macro_pipeline(countries=countries),
        "preprocess_brand": preprocess_brand_pipeline(countries=countries, brands=brands),
        "train_model": train_model_pipeline(countries=countries, brands=brands),
    }

and my catalog looks like that:

Copy code

{% for country in ["a", "b"] %}
{% for brand in ["1", "2", "3"] %}

{{ country }}.pre_master_macro:
    ...

{{ country }}.{{ brand }}.master:
    ...

{{ country }}.{{ brand }}.model:
    ...

Would there be a way to single pass countries / brands in both ? The usecase is that we are developing a generic pipeline that can be replicated in different regions / for different brands according to the client