Kedro #questions

Jo Stichbury

11/20/2022, 4:11 PM

Just bumping this to the top of your Monday morning so it's visible (I think it got a bit lost on Friday!)

user

11/20/2022, 9:38 PM

Grouping raw datasets in a kedro visualization I am looking for a way to group all of the raw datasets in a kedro pipeline visualization into one collapsible/expandable "node", similar to the way that namespaces are collapsible/expandable. In order to do this with a namespace, however, it seems that you need a function with inputs and outputs, which obviously would not be applicable at the raw data stage. Here is my current visualization: <a href="https://i.stack.imgur.com/DQBO8.png" rel="nofollow noreferrer">enter image description...

Leo Casarsa

11/21/2022, 1:57 PM

❓ What is Kedro's view/opinion on the structure of projects which contain a frontend component ❓ For example, let's assume that the model outputs of my Kedro pipeline is fed into a Python Dash application - where should the source code for the Dash application live?

Ahmed Afify

11/21/2022, 3:40 PM

Hi everyone, I have just started using Kedro and still learning the basics. I followed this documentation (How to integrate Amazon SageMaker into your Kedro pipeline: https://kedro.readthedocs.io/en/stable/deployment/aws_sagemaker.html) that executes the Spaceflights tutorial on AWS Sagemaker, but I was able only to run the first 3 nodes as the pipeline failed at split_data([model_input_table,parameters]) -> [X_train@pickle,X_test,y_train,y_test]. The error is KeyError: 'features'. I noticed as well that S3 was not updated with any dataset although they are present in the catalog.yml as instructed in the documentation. Please advise.

Francisca Grandón

11/21/2022, 7:16 PM

Hi everyone! I need some help with the kedro debugger. I was trying to set up the launch.json file in the documentation for debugging, and I was wondering if it is possible to integrate this with a docker python debugger. In other words, I want to start the debugger from the container terminal, is this possible?? I also posted the question in stackoverflow if its not that clear here. I would really appreciate some help!

user

11/21/2022, 7:48 PM

Integrate kedro debugger with docker container I'm currently trying to set up the debugger for kedro in VScode adding the following launch.json as the documentation suggests. The thing is, I have 2 different docker containers that I would like to use to debug my kedro pipelines, so this lauch.json file does not work for me, because it executes the debugger in the normal terminal, not inside the docker...

Zihao Xu

11/21/2022, 11:16 PM

Hi team, we are trying to use the experiment tracking feature of kedro within databricks, but are running into the following error:

Copy code

INFO     Loading data from 'modeling.model_best_params_' (JSONDataSet)...   ]8;id=949765;file:///databricks/python/lib/python3.8/site-packages/kedro/io/data_catalog.py\data_catalog.py]8;;\:]8;id=434875;file:///databricks/python/lib/python3.8/site-packages/kedro/io/data_catalog.py#343\343]8;;\

DataSetError: Loading not supported for 'JSONDataSet'

where we have the following catalog entry:

Copy code

modeling.model_best_params_:
  type: tracking.JSONDataSet
  filepath: "${folders.tracking}/model_best_params.json"
  layer: reporting

The same code runs completely fine locally, but is failing within data braicks. Could you please help us understand why?

Moinak Ghosal

11/22/2022, 8:30 AM

Hi Team. Can you please help why my catalog shows only parameter and not the datasets?

Ankar Yadav

11/22/2022, 11:49 AM

Hi Team, is there a way to increase verbosity in kedro, if I am running a model (say tensorflow) and I want to see metrics of each epoch, I am currently unable to see each iteration metrics when I run them as part of pipeline. Something like this:

Ankar Yadav

11/22/2022, 1:04 PM

Hi Team, also is there a way to drop certain data frames from pipelines which arent required anymore without defining them in catalogs, my assumption is that if you dont define in catalogs it continues to stay in Memorydataset, right?

Andreas Adamides

11/23/2022, 12:09 PM

Hi, I can see that in 0.18.2 release this has been added:

Kedro now uses the Rich library to format terminal logs and tracebacks

Is there any way to revert to plain console logging and not use rich logging when running a Kedro pipeline using the Sequential Runner from the API and not via

kedro

CLI?

Copy code

runner = SequentialRunner()
runner.run(pipeline_object, catalog, hook_manager)

I tried to look for configuration, but I believe you can only add configuration if you are in a kedro project and intend to run with Kedro CLI. Any ideas?

Afaque Ahmad

11/24/2022, 9:43 AM

Hi Team, I'm working on a use-case wherein I need to make certain values from a

cache

made available inside the

_load

method of multiple Kedro Datasets. How to go about it? Can we use hooks? or anything simpler?

✅ 1

Fabian

11/24/2022, 12:13 PM

Hi Team, I'm just getting started with kedro in Pycharm IDE. For this, I set up a new project and added the python scripts of my previous project as source root. I managed to set up a first running pipeline and to run it within a jupyter notebook. Now the problem: When i want to visualize the pipeline from command line (kedro viz) inside the project folder, apparently the imported source root is not found. However, I can visualize it with line magic %kedro_viz from inside the notebook. I feel like both ways of visualization should work. Did i set up the project in a wrong way?

✅ 1

Jose Alejandro Montaña Cortes

11/24/2022, 7:40 PM

Hi everyone i am currently developing a project which uses GCP credentials. The problem i am facing is that i want to deploy a container of this pipeline but the secrets should no be in the container. I want to know if by using kedro-docker package the secrets are not added to the docker container or in case they do what can i do to handle these credentials with the docker deployment 😄 thanks

Afaque Ahmad

11/25/2022, 6:53 AM

Hi Team, I've created a method called

get_spark

inside the

ProjectContext

which I need to access in the

register_catalog

hook. How can I access that function?

Elias

11/25/2022, 10:13 AM

I get a weird DataSetError:

Copy code

kedro.io.core.DataSetError: 
__init__() got an unexpected keyword argument 'table_name'.
DataSet 'inspection_output' must only contain arguments valid for the constructor of `kedro.extras.datasets.pandas.sql_dataset.SQLQueryDataSet`.

Elias

11/25/2022, 10:13 AM

Copy code

catalog.yml:

inspection_output:
  type: pandas.SQLQueryDataSet
  credentials: postgresql_credentials
  table_name: shuttles
  layer: model_output
  save_args:
    index: true

Elias

11/25/2022, 10:13 AM

according to documentation table_name is the correct keyword: https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.SQLTableDataSet.html

Shreyas Nc

11/25/2022, 10:26 AM

Hi, I created a catalog.py under src/my_test/catalog.py and added changes as in the documentation:

Copy code

from <http://kedro.io|kedro.io> import DataCatalog
from kedro.extras.datasets.pillow import ImageDataSet

io = DataCatalog( {
            "cauliflower": ImageDataSet(filepath="data/01_raw/cauliflower"),
    }
)

But I dont see this in the catalog and get an error when I reference this in the pipeline node that the entry doesnt exist in the catalog. Am I missing something here? Note: this is on the latest version of kedro kedro, version 0.18.3 I just joined the channel, if I am bot using the right format or channel to ask this question, please let me know . Thanks in advance!

Anu Arora

11/25/2022, 1:45 PM

Hi Team, I am trying to make dbx work with kedro 0.18 using wheel file. I have resolved majority of the issues but i am stuck on one issue(hopefully the last); while executing the

dbx execute <workflow-name> --cluster-id=<cluster-id>

; kedro is failing on the below error;

Copy code

/local_disk0/.ephemeral_nfs/envs/pythonEnv-f0037269-19cc-4c81-9dc2-43bcd22cd8ff/lib/python3.8/site-packages/kedro/framework/startup.py in _get_project_metadata(project_path)
     64 
     65     if not pyproject_toml.is_file():
---> 66         raise RuntimeError(
     67             f"Could not find the project configuration file '{_PYPROJECT}' in {project_path}. "
     68             f"If you have created your project with Kedro "

RuntimeError: Could not find the project configuration file 'pyproject.toml' in /databricks/driver.

I can see that the file was never packaged but I am not sure if it was supposed to be packaged or not. Plus it is pointing to working directory as /databricks/driver somehow. Below is the python file I am running: as spark_python_task

Copy code

from kedro.framework.project import configure_project
from kedro.framework.session import KedroSession

package_name = "project_comm"

configure_project(package_name)


with KedroSession.create(package_name,env="base") as session:
    session.run()

Any help would be great!! PS: I have tried with dbx deploy and launch as well and is still facing the same issue

✅ 1

Karl

11/26/2022, 12:27 AM

Good afternoon Kedro team, My group is evaluating Kedro as an ETL framework. So far it's working quite well - thank you for building and supporting this tool. I have some questions about best practices that I can ask in separate threads, but more immediately: is there a standard way to silence the DeprecationWarnings? The DeprecationWarning logging messages clutter up the log and make the CLI difficult to use. I tried the strategy posted here in GitHub using Hooks but this didn't work. Is there a standard way to silence these warnings?

Fabian

11/26/2022, 1:27 PM

Hi Team, I am experimenting with modular pipelines. The pipeline template takes 5 parameters, out of which 2 vary within the different namespaces. The other 3 parameters are static within in my current use-case, but might require adaptions in future. Therefore, I would like to define the 3 static parameters as constant for pipeline template, without having to re-define with each namespaced pipeline. However, because of the namespaces of each instantiation the pipelines do not find the 3 static parameters without defining them for each namespace. How can I do this in a proper way?

Yousri

11/28/2022, 3:27 PM

Hello kedro team, I'm actually working on project of Churn prediction and i finish all pipeline and the job work fine. I work on Kedro 0.16.5 because it was compatible with some packages on our environment. After packging the project now i'm able to run it from command line with:

Copy code

python3 -m project_name.run

But i have question about parameters. When i run the packaged project i can't anymore pass parameters to the project or modifiy the parameters.yml so my question is how to pass arguments when i run a packaged kedro project ?

Afaque Ahmad

11/29/2022, 6:56 AM

Hi Team, Is there a way I can access the

catalog

dict in the

after_node_run

hook?

✅ 1

Fabian

11/29/2022, 9:59 AM

Hi Team,

another beginner's question: I have created a pipeline that nicely analyzes my DataFrame. Now, I add a new level of complexity to my DataFrame and want to execute the pipeline on each level, similiar to a function in groupby.apply.

Can I do this without modifiying the pipeline itself? E.g., splitting the DataFrame ahead of the pipeline and remerging it afterwards while leaving the existing pipeline as it is?

Ankar Yadav

11/29/2022, 11:37 AM

Hi Team, can I add an optional node to pipeline, It should be executed only if specific parameter is set?

Balazs Konig

11/29/2022, 3:09 PM

Hi Team 🦜, How can I limit the loaded config for native latest kedro in 2 dimensions, eg. `kedro run --pipeline dc_xyz --env dev`: 1. by env (

conf/dev/

) 2. by pipeline (

conf/base/data_connectors/xyz/

) Is there a simple way to achieve this double filter without much hacking?

Jan

11/30/2022, 10:29 AM

Hi all! Is there a way to run an environment exclusively, meaning that

conf/base

will not be loaded? I would like to do something like

kedro run --env=prod

and in the

prod

env I have a catalog that is prefixed (e.g.

file: data/prod/01_raw/file.txt

) so that I can have the prod data separated. I would like to avoid leakage of development data into the prod env. For example if I add a new step and create a new entry in the data catalogue (

base

) and forget to add this entry in the prod catalog it will be used later on in the prod environment by default because it is not overwritten? Instead I would like to get an error or implicitly use a MemoryDataset, in other words: don't load

conf/base

. Does this make sense? 😄 Edit: Just realizing that this behaviour would be possible if I just use

conf/base

as the prod env and always develop in a

conf/dev

env. However, ideally I would like to use by default the

conf/base

and only work in prod by specifying it explicitly to avoid mistakenly changing something there 🤔

👍 1

Qiuyi Chen

11/30/2022, 6:35 PM

Hi team, Hope this message finds you well. I try to add a list of dataframes as an input to kedro pipeline, here is what I did, but it is not working when I try to put multiple dataframes, can you help me with it? Thank you very much

Copy code

from pyspark.sql import DataFrame

def function_a(params: Dict, *df_lst: List[DataFrame]):

     report = pd.Dataframe()
     for df  in df_lst:
           temp = function(df,params)
     report = pd.concat([report,temp])

     return report

I can run function like this

Copy code

Function_a(params, df1,df2,df3)

But in the pipeline, how can I define the node and catalog in this situation. Here is what I did, please let me know where I did it wrong

Copy code

def create_pipeline(**kwargs):
   return Pipeline(
      [ node( function = function_a,
              Inputs = ["params", "df_lst"],
              outputs= "report",
      ]
   )

Catalog =  DataCatalog(
       data_sets={"df_lst": df1},
       feed_dict={"params":params, },
   )

I can only run the pipeline when df_lst is just one dataframe, but I do want it do be something like “df_lst”: df_1,df_2,df_3 …df_n(n>3)

Fabian

12/01/2022, 10:59 AM

Hi Team, is there an example of how to programmatically create pipelines? In my usecase i want to apply the same pipeline on a variable number of datasets. Output names of my pipeline should be dependant on the input filenames.