Kedro #questions

Tooba Mukhtar

12/12/2022, 3:30 PM

Team, I am facing the following issue with kedro viz: I have namespaced a modular pipeline called Reporting which is formed by calling another pipeline. It has several outputs. The problem is that there are 2 hanging nodes which don’t connect as the output of Reporting pipeline whereas they should. They are defined as output in the pipeline in code. These hanging nodes are not terminal nodes (meaning that they feed into another function within the pipeline) Note: When i collapse the reporting pipeline, you can see these nodes being connected inside the pipeline (2nd and 3rd screenshot). However when you see the aggregated version of node then these 2 nodes are hanging and NOT connected as output of reporting pipeline. Does anyone have an idea how to deal with it or is this a bug in kedro viz? Would be great if someone from the kedro-viz team can connect and help to resolve this. Thanks!

Yetunde

12/12/2022, 3:51 PM

set up a reminder “The Kedro team is on break from Thursday, the 22nd of December. We would love for the community to band together to support each other in this time because we won't be available for support questions. We'll see in you in the new year from the 4th of January. ” in this channel at 09:00 Monday 19th December, Greenwich Mean Time.

Yetunde

12/12/2022, 3:53 PM

set up a reminder “The Kedro team is on break from Thursday, the 22nd of December - Wednesday, 4th of January. We hope that you have a great holiday break (if you're taking one) and we'll see you in the new year.” in this channel at 09:00 Thursday 22nd December, Greenwich Mean Time.

Anirudh Dahiya

12/13/2022, 11:40 AM

Hi team, I'm new to Kedro and am getting up to speed with it. When I try to run kedro viz, I get the error stating - 'no such command as viz'. Could you please help me with this?

✅ 1

Olivia Lihn

12/13/2022, 9:30 PM

[DATABRICKS - AZURE] Hi team! I'm loading data into the catalog from an azure blob storage account, but i only have access to auth SAS and a fixed token. This is how credentials would look like in a databricks notebook:

Copy code

spark.conf.set("fs.azure.account.auth.type.<storage-account>.<http://dfs.core.windows.net|dfs.core.windows.net>", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.<http://dfs.core.windows.net|dfs.core.windows.net>", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.<http://dfs.core.windows.net|dfs.core.windows.net>", "<token>")

how should i set this credentials in the catalog? Do i need to create a custom dataset?

Thaiza

12/13/2022, 11:23 PM

Guys, have you seen this error before? (covered is just the name of the env which I prefer do not reveal..)

Rickard Ström

12/14/2022, 6:06 PM

I would like to run a series of 3 pipelines individually for 30+ different input datasets and save the result from each individually too. What is the recommended way to do this? Would a combination of using hooks to register the datasets in the catalog with a loop to register the pipelines in the pipeline_registry.py as discussed here work? https://kedro-org.slack.com/archives/C03RKP2LW64/p1667919274697439?thread_ts=1667910377.475889&cid=C03RKP2LW64 Or should I play with the environment? tagging team member @Adrien Couetoux 👋🙏

Jordan

12/15/2022, 10:55 AM

I’m writing a a bit about pipeline inputs and outputs in all of the

README.md

files of my project. How should I document the

type

of a partitioned dataset? Those function outputs need to be of the form

dict[str, <type>]

, but when the dataset is loaded back it’s going to be

dict[str, Callable[[], <type>]]

Simon Myway

12/15/2022, 11:42 AM

Hi everyone, I was wondering if there is a repo with some template pipelines for standard data science tasks? E.g. a basic pipeline to make a train-test split or to generate a model evaluation report. Thanks!

Balazs Konig

12/15/2022, 1:27 PM

Hi Team, for PartitionedDataSets or any other way of reading in multiple files at the same time, how can we specify regex-style notations? Eg. I have files called

data_type_a_1.csv

, d`ata_type_a_2.csv`… and I want to read those in together - tried to simply put

in the filename_suffix, eg.

"data_type_a_*.csv"

, but that’s not working, so I’m definitely missing something simple here 😅

Anastasiia

12/15/2022, 2:01 PM

Hi all 🙂 we want to show a short demo on Experiment Tracking. Does anybody have better quality gif? Or slides that shortly showcase the process? We want to show something like that (but the gif quality is bad):

https://kedro.readthedocs.io/en/0.17.7/_images/experiment-tracking_demo_small.gif▾

Thank you so much in advance

❤️ 1

Jaakko

12/15/2022, 6:59 PM

At the end of my data science pipeline I need to save multiple plots. The number of plots depends on hyperparameters of the model and there could be around 5-30 plots. How would I do this with Kedro? I took a look at https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.matplotlib.MatplotlibWriter.html. However, there is only one example using YAML api (which I think I need to use to be able to see pictures when looking at my experiments through kedro viz) and in that example only one plot is saved. There are also examples where a list of plots is saved but there the python api is used and with the python api approach I can't figure out how to get the list of images be displayed in my experiments section in Kedro viz.

Maurits

12/15/2022, 8:06 PM

Hi all, Any experience on running Kedro on AWS EMR (with MWAA)? Or what's your recommended computing service in AWS?

Slackbot

12/19/2022, 9:00 AM

Reminder: The Kedro team is on break from Thursday, the 22nd of December. We would love for the community to band together to support each other in this time because we won't be available for support questions. We'll see in you in the new year from the 4th of January.

Szymon Czop

12/19/2022, 11:22 AM

Hi guys is there possibility to get the code from which this https://demo.kedro.org/ viz was created? I would really appreciate it. ❤️

Maurits

12/19/2022, 11:24 AM

Hi all, Who is using Great Expectations with Kedro? Or alternatives to recommend for data validation within Kedro pipelines? Thanks!

kedro-great

seems outdated and generates an error for me while running

kedro great init

Copy code

ImportError: cannot import name 'BatchMarkers' from 'great_expectations.datasource.types'

Luiz Henrique Aguiar

12/19/2022, 1:15 PM

Hi, everyone! I have a question related to how to use Kedro inside Databricks, since whenever I try to use a "kedro run" in the repository, an error happens related to Spark: apparently, Databricks' native Spark is giving conflict with the Spark used inside the project, in a Hook (in the code below, you can see the Hook definition and the error).

Copy code

sc = SparkContext(conf=spark_conf, appName="Kedro")
    
    
    _spark_session = (
        SparkSession.builder
        .appName(context._package_name)
        .enableHiveSupport()
        .master("local[*,4]")
        .getOrCreate()
    )
    
    
    _spark_session.sparkContext.setLogLevel("WARN")

Error: Error: py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.spark.SparkException: In Databricks, developers should utilize the shared SparkContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sc. When running a job, you can access the shared context by calling SparkContext.getOrCreate(). The other SparkContext was created at: CallSite(SparkContext at DatabricksILoop.scala:353,org.apache.spark.SparkContext.(SparkContext.scala:114) I've tried to delete the Hook and make the Spark settings directly in the Cluster, without success. I have tried to configure it directly in the Spark Session, also without success. I also followed the instructions in the documentation to use a repository within Databricks, but since the base project does not use this Hook, it did not give the error. Has anyone had a similar error? I thought I could run if I turned the project into a Wheel, but I can't use the "kedro package" since the project can't run inside Databricks. I would be grateful for any ideas, thank you!

Matheus Sampaio

12/19/2022, 2:53 PM

Hi everyone, one quick help please 😄 Does anyone know how I print the

kedro nodes execution order

on a Databricks Notebook? Thanks

✅ 1

user

12/19/2022, 5:08 PM

Kedro, running inference on user input I have a pipeline with the model I want to use. Outside of the project, I have an app.py file where I'm going to create the UI/UX for my users to run my model. Right now I'm just using a sample string but later on, you can imagine that there will be a textbox for users to type. How can I pass the user input as an input to the pipeline? I though I would be able to do so with the kedro.framework.session.session.KedroSession as seen in the code below, but doing so results in the error...

Dhaval Thakkar

12/20/2022, 8:19 AM

Hi, So I am currently trying to integrate the Great expectations hook for my data loading process. But the issue is the Hook is not getting registered and I am unable to move forward Here is the issue Please use the latest develop branch for the following project to look through the issue : https://github.com/DhavalThkkar/ecom-analytics All help would be appreciated. Also, can this be directly used as a hook while converting the pipeline to Prefect or Airflow?

✅ 1

Simon Myway

12/20/2022, 8:51 AM

Hi team, has anyone tried to develop a multipage pdf matplotlib dataset where you would pass it a list of figure and it would save them in a single multipage pdf file?

Frits

12/20/2022, 9:24 AM

Hello! Does anyone know how to properly update an existing data set from the catalog? It is not possible for us to download the whole data set again (from a sql db), so we would love to be able to download only the data after a certain date, which is determined at runtime.

Dhaval Thakkar

12/20/2022, 10:26 AM

Is there any easy way to use great expectations with kedro? This is my current workflow 1. Create expectations suite by following all the steps listed on the great-expectations documentation website 2. Copy the expectation that is generated and paste the <expectation_name.json> in the data/01_raw/ folder 3. Create the

hooks.py

file for great-expectations and register the hook in the

settings.py

file 4. Execute the following command

kedro run

5. Now I was expecting this to work directly but now I am getting this error `ValueError: Unable to load datasource

files_datasource

-- no configuration found or invalid configuration.` Please use the latest develop branch for the following project to look through the issue : https://github.com/DhavalThkkar/ecom-analytics This is extremely difficult. Can someone guide me if I am doing anything wrong?

Pedro Abreu

12/20/2022, 11:24 AM

Hey team 🙂 1. Is there a way to make dataset parameters depend on kedro parameters? What we’re trying to do: have a top-level parameter to define if a dataset write should be an append or upsert 2. Is it possible to run kedro-viz on databricks? Thanks

Dhaval Thakkar

12/20/2022, 3:02 PM

Any help for great expectations would really be appreciated

user

12/20/2022, 3:38 PM

How to use Kedro with Great-expectations? I am using Kedro to create a pipeline for ETL purposes and column specific validations are being done using Great-Expectations. There is a hooks.py file listed in Kedro documentation here. This hook is registered as per the instructions mentioned on Kedro-docs. This is my current workflow: Create expectations suite by following all the steps listed on the great-expectations...

Dhaval Thakkar

12/21/2022, 10:08 AM

I have updated the question above for the steps that I used to recreate the issue again even on a docker container

Daniel Bull

12/21/2022, 12:00 PM

Hi everyone! I am adding a data fabrication step to a project I'm working on. I've created a node which runs the fabricator I'm using outputting a dictionary of inter-related pandas DataFrames. The rest of the pipeline however is built to use spark - I'm struggling to figure out the best way to convert the fabricated DataFrames spark DataFrames. Any suggestions would be greatly appreciated!

Seth

12/21/2022, 1:11 PM

Hi team! I’d like to read existing partitions and write new partitions to the same PartitionedDataSet in a single Pipeline. However, with a single DataCatalog entry this creates a CircularDependencyError. What is the proper way to handle such situations in Kedro? I can create identical Catalog entries, however it doesn’t feel like the correct solution for this problem.

Anu Arora

12/21/2022, 1:25 PM

Hi Team 🙂 I am trying to run kedro-viz on databricks and following this article: https://kedro.readthedocs.io/en/latest/deployment/databricks.html#running-kedro-viz-on-databricks but even after installing kedro(kedro, version 0.18.4) and kedro-viz; still the run_viz command is not found; Am I missing something? Error:

Copy code

UsageError: Line magic function `%run_viz` not found.