Kedro #questions

Join Slack

Mark Pinches

12/21/2022, 2:46 PM

Hi Kedro!

Mark Pinches

12/21/2022, 2:46 PM

Hi Kedro,

👋 2

Jordan

12/21/2022, 9:38 PM

Can someone explain why the behaviour of

kedro build-reqs

was changed? It used to build a

requirements.txt

file from a

<http://requirements.in|requirements.in>

file, and now it builds a

requirements.lock

file from a

requirements.txt

file.

Vladimir Filimonov

12/22/2022, 8:23 AM

Hey everyone! Am I missing something or https://github.com/kedro-org/kedro-plugins/blob/main/Makefile#L1 can not be working locally without manual pre-configuration?

$(plugin)

is never defined nor I found any instructions in the repo to have it defined prior to running make

Slackbot

12/22/2022, 9:00 AM

Reminder: The Kedro team is on break from Thursday, the 22nd of December - Wednesday, 4th of January. We hope that you have a great holiday break (if you're taking one) and we'll see you in the new year.

Eugene P

12/22/2022, 2:52 PM

Hi everyone! Wanted to check with you as more experienced kedroids if I’m doing something stupid. Several preliminary steps of my workflow requires running some heavy-lifting sql queries with Postgres/PostGIS (must be executed in particular order). At the moment I’m doing it the following way: 1. I have separate folder with SQL queries 2. I use catalog to declare

pandas.SQLQueryDataSet

. One for each query. 3. I have generic node function to call SQL query, returning empty df like this

Copy code

def run_sql_script_node(sql_query_dataset: pd.DataFrame,
                        blank_df_for_nodes_order: pd.DataFrame,):
    return pd.DataFrame()

4. I define required nodes controlling the execution order by using consecutive empty_df outputs/inputs

Copy code

node(
   func=run_sql_script_node,
   inputs=["create_rropen_cadcost_schema_and_tables_dataset", "empty_cadcost_df0"],
   outputs="empty_cadcost_df1",
   name="create_rropen_cadcost_schema_and_tables_node",
        ),
node(
   func=run_sql_script_node,
   inputs=["create_rropen_cadcost_staging_table_dataset", "empty_cadcost_df1"],
   outputs="empty_cadcost_df2",
   name="create_rropen_cadcost_staging_table_dataset_node",
        ),

I do understand that Kedro may be the not-the-100%-appropriate-tool to control SQL workflows, but for the sake of total DS pipeline integrity and my kedro-learning would like to stick to it (it is amazing, btw!). This workaround works and works correctly, but I was thinking that this approach can be further simplified? May be there is a way to execute sql-queries in particular order without creation of catalog entries for datasets, for example? Thx in advance for critique and suggestions!

Olivier Ho

12/22/2022, 3:33 PM

Hello, I have some questions. What is the state of async support for Kedro? I tried to create async node where the function is async since I created a custom dataset that returns an async iterator (for performance purpose :~) so I had to define async function since I have to await on the iterator. Issues are that, to test async support, I tried to create a fake node where I save the iterator values in a partitioned dataset. The error I have in this case is that the input data passed to the save function is actually a coroutine. I tried with or without the async flag, with all three runners.

Mohammed Samir

12/22/2022, 3:36 PM

Hello Everyone, how can i run kedro preprocessing pipeline on aws sagemaker instance just like

train_model_sagemaker

Brandon Meek

12/22/2022, 10:20 PM

Hey everyone! I'm working on a modular pipeline and I'm trying to freeze one of my parameter inputs which is a dictionary:

Copy code

features:
  numeric:
    x: "x"
  categorical:
    y:
      col: "y"
      dropna: True
    z:
      col: "z"
      dropna: True
    i:
      col: "i"
      dropna: False
    j:
      col: "j"
      dropna: False

but when I try to freeze the parameter:

Copy code

ingestion_pipeline = pipeline(
    pipe=ingestion_pipe,
    inputs={
        "a",
        "b",
        "c",
        "d"
    },
    parameters="features",
    namespace="ingestion"
)

I get

Failed to map datasets and/or parameters: params:features

When I namespace

features

it works, Am I doing something wrong? I'm using

kedro 0.18.3

with spark

Suryansh Soni

12/23/2022, 4:14 PM

Hello Everyone ! Wishing you all a Merry Christmas. I wanted to know on how if somebody has information on how to deploy kedro pipeline to AWS stepfunctions and sagemaker to retrain the model, with CI/CD using github actions and codebuild. (edited)

Rob

12/26/2022, 4:21 PM

Hi everyone and happy holidays, I recently started using Kedro and I was looking at its workflow with spark so I'm testing it with the

pyspark-iris

starter. So I already setup spark 3.0 on my Windows machine and it's working, and I'm getting this `DataSetError`:

Copy code

DataSetError: Failed while saving data to data set 
SparkDataSet(file_format=parquet, 
filepath=C:/Users/rober/PycharmProjects/pyspark-test/data/02_intermediate/X_train.parquet, load_args={'header': True, 'inferSchema': True}, 
save_args={'header': True, 'mode': overwrite}).
An error occurred while calling o60.save.

So I already checked the

copy_mode

of the

MemoryDataSet

conf inside the

catalog.yml

and it's set as assign, since there are no actions executed in the previous node so I guess it's the only saving mode. Probably it's something simple, but if someone can help me, I'd appreciate your help

✅ 1

Elior Cohen

12/27/2022, 7:28 AM

Is there an option to dynamically execute nodes? I imagine a use case where I have node

which does some work and then depending on how much data it produced, it can create multiple parallel executions of

where each

B_i

executes the same logic on a sub set of the data produced by

Then maybe if any data point in

has errors they go to

but the data points that are good go to

meharji arumilli

12/27/2022, 1:42 PM

How to save non-DataFrame spark objects to s3? For non-spark objects I used to save/read from the catalog as:

lightgbm_model:

type: pickle.PickleDataSet

filepath: <s3://bucket/data/lightgbm_model.pkl>

backend: pickle

How can I save if the 'lightgbm_model' model is from spark pipeline?

Manilson António Lussati

12/27/2022, 6:48 PM

Hello everyone, have you ever had any difficulties running a kedro project using spark-submit?

Pawel Granat

12/28/2022, 4:47 PM

Hello everyone and Happy New Year! My question is whether it is possible to fail whole pipeline if one node failed ? Currently if one node failed it only prints warning and whole pipeline is visible as successful (exit code 0) In log I have:

Copy code

2022-12-18 14:08:46,846] {ssh.py:476} INFO - 2022-12-18 14:08:46,845 - kedro.pipeline.node - INFO - Running node: test_node_1: <lambda>([test_1.fake_name.test_data_predictions,params:test_1.predictive_modeling.fake_name.target]) -> [test_1.fake_name.labels,test_1.fake_name.score]
[2022-12-18 14:08:46,846] {ssh.py:476} INFO - 2022-12-18 14:08:46,845 - multi_runner.safeguards - ERROR - Node test_node_1, in the "test_1" run failed with the exception:
'AttributeError' object is not subscriptable
Traceback (most recent call last):
 [..]

And further on in the same log:

Copy code

[2022-12-18 14:08:47,520] {ssh.py:476} INFO - 2022-12-18 14:08:47,510 - multi_runner.safeguards - WARNING - Node fake_name_post_modelling_analysis, in the "test_1"run is skipped due to an upstream error
[2022-12-18 14:08:47,672] {ssh.py:476} INFO - 2022-12-18 14:08:47,671 - kedro.runner.sequential_runner - INFO - Completed 48 out of 48 tasks
[2022-12-18 14:08:47,673] {ssh.py:476} INFO - 2022-12-18 14:08:47,671 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
[2022-12-18 14:08:47,674] {ssh.py:476} INFO - 2022-12-18 14:08:47,672 - proj.hooks.project_hooks - INFO - fake_name pipeline execution completed successfully.
[2022-12-18 14:08:50,263] {taskinstance.py:859} DEBUG - Refreshing TaskInstance <TaskInstance: test_dag.fake_name manual__2022-12-17T16:06:07.524573+00:00 [running]> from DB
[2022-12-18 14:08:50,282] {base_job.py:226} DEBUG - [heartbeat]
[2022-12-18 14:08:51,673] {channel.py:1212} DEBUG - [chan 0] EOF received (0)
[2022-12-18 14:08:51,711] {_init_.py:107} DEBUG - Lineage called with inlets: [], outlets: []
[2022-12-18 14:08:51,711] {taskinstance.py:859} DEBUG - Refreshing TaskInstance <TaskInstance: test_dag.fake_name manual__2022-12-17T16:06:07.524573+00:00 [running]> from DB
[2022-12-18 14:08:51,734] {taskinstance.py:1406} DEBUG - Clearing next_method and next_kwargs.
[2022-12-18 14:08:51,734] {taskinstance.py:1400} INFO - Marking task as SUCCESS. dag_id=test_dag, task_id=fake_name, execution_date=20221217T160607, start_date=20221218T140430, end_date=20221218T140851
[2022-12-18 14:08:51,735] {taskinstance.py:2336} DEBUG - Task Duration set to 261.096866
[2022-12-18 14:08:51,751] {cli_action_loggers.py:84} DEBUG - Calling callbacks: []
[2022-12-18 14:08:51,822] {local_task_job.py:156} INFO - Task exited with return code 0

As you can see : fake_name pipeline execution completed successfully. Run command:

Copy code

kedro run --pipeline fake_name

Great hearing from you and all the best, Pawel

meharji arumilli

12/28/2022, 7:17 PM

Hi, i am writing a spark dataFrame to storage by specifying in the catalog.yml as shown below

meharji arumilli

12/28/2022, 7:17 PM

Copy code

preprocessed_data:
  type: spark.SparkDataSet
  filepath: data/${project}/05_model_input/df_preprocessed.parquet
  file_format: parquet

meharji arumilli

12/28/2022, 7:18 PM

And it throws the error:

raise DataSetError(message) from exc

<http://kedro.io|kedro.io>.core.DataSetError: Failed while saving data to data set SparkDataSet(file_format=parquet, filepath=/Users/data/rre/05_model_input/df_preprocessed.parquet, load_args={}, save_args={}).

An error occurred while calling o727.save.

meharji arumilli

12/28/2022, 7:19 PM

any clue what could have been the issue? i could be missing some minor thing here!! @Rob is it similar to your issue? could u help

Sebastian Cardona Lozano

12/29/2022, 2:20 PM

Hi everyone.

Sebastian Cardona Lozano

12/29/2022, 2:26 PM

Hi everyone. I recently started with Kedro. I'm workin with Vertex AI workbench in GCP (an Ubuntu virtual machine with Anaconda pre installed). For any reason I'm having these problems: 1. When I run

kedro info

in CLI appears the next warning:

Copy code

[12/29/22 14:22:07] WARNING  /opt/conda/lib/python3.7/site-packages/plotly/graph_objects/__init__.py:288:                   warnings.py:110
                             DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.                  
                               if LooseVersion(ipywidgets.__version__) >= LooseVersion("7.0.0"):

2. Neither I can use

kedro ipython

in CLI:

Copy code

[12/29/22 14:24:12] INFO     Resolved project path as: /home/jupyter/bm-598-onboarding.                                     __init__.py:135
                             To set a different path, run '%reload_kedro <project_root>'                                                   
[TerminalIPythonApp] WARNING | Error in loading extension: kedro.ipython
Check your config files in /home/jupyter/.ipython/profile_default
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/shellapp.py", line 301, in init_extensions
    self.shell.extension_manager.load_extension(ext)
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/extensions.py", line 87, in load_extension
    if self._call_load_ipython_extension(mod):
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/extensions.py", line 134, in _call_load_ipython_extension
    mod.load_ipython_extension(self.shell)
  File "/opt/conda/lib/python3.7/site-packages/kedro/ipython/__init__.py", line 40, in load_ipython_extension
    reload_kedro()
  File "/opt/conda/lib/python3.7/site-packages/kedro/ipython/__init__.py", line 89, in reload_kedro
    context = session.load_context()
  File "/opt/conda/lib/python3.7/site-packages/kedro/framework/session/session.py", line 259, in load_context
    context=context
  File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 60, in _multicall
    return outcome.get_result()
  File "/opt/conda/lib/python3.7/site-packages/pluggy/_result.py", line 60, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.7/site-packages/kedro_telemetry/plugin.py", line 120, in after_context_created
    catalog = context.catalog
  File "/opt/conda/lib/python3.7/site-packages/kedro/framework/context/context.py", line 232, in catalog
    return self._get_catalog()
  File "/opt/conda/lib/python3.7/site-packages/kedro/framework/context/context.py", line 287, in _get_catalog
    save_version=save_version,
  File "/opt/conda/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 272, in from_config
    ds_layer = ds_config.pop("layer", None)
AttributeError: 'str' object has no attribute 'pop'

Sebastian Cardona Lozano

12/29/2022, 2:29 PM

The same error appears when I try to use jupyter notebooks. I really appreciate if you can help me with these issues. Thanks! 🙂 (sorry for the multiple messages)

meharji arumilli

12/29/2022, 11:21 PM

hi, im trying to save intermediate pyspark object by specifying in the catalog as below:

meharji arumilli

12/29/2022, 11:23 PM

Copy code

feature_engineering:
  type: MemoryDataSet
  copy_mode: assign

preprocessed_data:
  type: spark.SparkDataSet
  filepath: data/${project}/05_model_input/df_preprocessed.parquet
  file_format: parquet

And this raises the error:

raise Py4JJavaError(

py4j.protocol.Py4JJavaError: An error occurred while calling o742.save.

: java.lang.ClassNotFoundException: <http://org.apache.spark.internal.io|org.apache.spark.internal.io>.cloud.PathOutputCommitProtocol

at <http://java.net|java.net>.URLClassLoader.findClass(URLClassLoader.java:387)

at java.lang.ClassLoader.loadClass(ClassLoader.java:418)

at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Can someone hint to fix this issue?

Sebastian Cardona Lozano

12/30/2022, 2:37 PM

Hi. I'm trying to use kedro-viz in a Worbench in Vertex AI in GCP, but It doesn't work maybe because is a virtual machine with an Anaconda environment without an internet browser. Could anyone please tell me how to use it or how to visualize the .json file of the pipeline? I just want to visualize the pipeline. Thanks

user

01/03/2023, 10:48 AM

Kedro register dataset from a board from pins package For my project I want to use a combination of kedro for the pipeline orchestration and pins for data and model versioning. I have some data which I stored on a board from the pins package. As I have multiple versions,

https://i.stack.imgur.com/ImXYi.png▾

I am not sure how to specify the catalog.yml file. In a simple Python script I would simply write: import pins board = pins.board_folder("/path/to/my/folder/") board.pin_read("df_all") and would...

Jo Stichbury

01/03/2023, 2:55 PM

Please could I ask for some help with the Kedro Viz example that uses Plotly? I've made some minor changes to the spaceflights tutorial example to add a reporting pipeline that uses Plotly express and Plotly graph objects (in order to improve the documentation in this area, as per this PR). I made a few changes to the example code in the original docs, so that there's a node each for express/graph objects, named uniquely. The graph objects node works perfectly and I see a plot.

Copy code

def compare_passenger_capacity_go(preprocessed_shuttles: pd.DataFrame):

    data_frame = preprocessed_shuttles.groupby(["shuttle_type"]).mean().reset_index()
    fig = go.Figure(
        [
            go.Bar(
                x=data_frame["shuttle_type"],
                y=data_frame["passenger_capacity"],
            )
        ]
    )
    
    return fig

However, the code for Plotly express isn't working in a

kedro run

Copy code

def compare_passenger_capacity_exp(preprocessed_shuttles: pd.DataFrame):
    fig = px.bar(
        data_frame=preprocessed_shuttles.groupby(["shuttle_type"]).mean().reset_index(),
        x="shuttle_type",
        y="passenger_capacity",
    )
    return fig

The error returned is

Copy code

PlotlyDataSet(filepath=/Users/jo_stichbury/Documents/GitHub/stichbury/kedro-projects/kedro-tutorial/data/08_reporting/shuttle_passenger_capacity_plot_exp.json, load_args={}, 
plotly_args={'fig': {'orientation': h, 'x': shuttle_type, 'y': passenger_capacity}, 'layout': {'title': Shuttle Passenger capacity, 'xaxis_title': Shuttles, 'yaxis_title': Average 
passenger capacity}, 'type': bar}, protocol=file, save_args={}, version=Version(load=None, save='2023-01-03T14.43.36.537Z')).
Value of 'x' is not the name of a column in 'data_frame'. Expected one of [0] but received: shuttle_type

Before the holiday, I did a fair amount of trial and error to re-write the function according to various stack overflow searches, but I couldn't find a way to fix it. 🚨 Please could I get some help from anyone who knows this code (maybe @Rashida Kanchwala?) or anyone who is familiar with Plotly to get the

compare_passenger_capacity_exp

method working? 🚨 My example is here so I hope it's just a matter of taking it and revising the method in the

nodes.py

file for the reporting pipeline. I should point out that it doesn't currently work on 0.18.4 (see this issue) so it's necessary to test against 0.18.3 (using the 'old' dataset notation) for now. Everything in my example is working apart from this node.

Sasha Collin

01/03/2023, 9:17 PM

Hello team! Is it possible to call a subdataset from a partitioned dataset directly from a pipeline? ie doing something like this:

Copy code

node(func=func, inputs="partitioned_dataset_name:dataset_name", ....)

thanks!

tingting wan

01/04/2023, 4:25 PM

Hi Team, is it possible to include the filepath in the catalog? Recursively loading csv file, so it can't be hard coded.

user

01/04/2023, 5:28 PM

How to avoid ECS Spot instance termination while processing user requests? I'm planning to run an ECS cluster with an ALB in front of spot instances. As an example: A user's request enters a container that is running on spot, but before getting the response, the spot instance is terminated. That will return an error, right? How can I resolve this type of issue? Is there any way to stop sending requests before it was gone?