Kedro #questions

Allen Ma

11/01/2022, 3:17 AM

Hey all: How to get session/context inside the node ?

✅ 1

Julian Waton

11/01/2022, 2:52 PM

Hello, I would like to use something like this dry runner https://kedro.readthedocs.io/en/stable/nodes_and_pipelines/run_a_pipeline.html#custom-runners but in a slightly different context - I would like to also check whether the data exists. • I am using multiple kedro environments https://kedro.readthedocs.io/en/stable/kedro_project_setup/configuration.html for different model experiments • When I do a partial run with

--from-nodes

and

--to-nodes

(to save time over a full pipeline run), I often discover that some data does not exist in my environment - but it takes a while to discover this, as the code needs to run first • Then "checking whether the data exists" is a bit complex: ◦ Either check whether it is an intermediate output of the provided pipeline ◦ Or check whether it can be read from the catalog using the

_exists

method of the abstract dataset class Is this something that someone has already built, and is it a common use case?

K 1

✅ 1

Zirui Xu

11/01/2022, 3:55 PM

Hi team. Has anyone explored using Kedro with Spark Structured Streaming? What worked/did not work?

Jose Alejandro Montaña Cortes

11/01/2022, 4:11 PM

Hi everyone, i am developing a pipeline of training which has some parameters set by default. I have a question about what are the best practices for running multiple times the same pipeline but with multiple parameters. I am currently using a bash script that runs “kedro run” with the flag “--params”. I am doing this because i want to register each experiment. However it does not fell quite right and i was wondering if it is possible to run multiple times a pipeline varying the parameters via a hook implementation or should i use a modular pipeline and register all the parameters in the yaml file. (i think this solutions is not quite practical ) but i would like to know if there is a “Better” way of doing these kind of things

👍 1

Lucie Gattepaille

11/02/2022, 1:01 PM

Hi everyone! I am new here and trying to get myself onboard with Kedro. I am following instructions of the Spaceflights tutorial and in particular, I am at the part about adding plots into Kedro viz. Unfortunately, I am failing miserably 😅. When I add the node and pipeline code to the data_science code I get the error message that the data_science pipeline does not exist (??) and when I create a separate pipeline just for the viz, there is no error but there is no graph shown in kedro viz either. It is said in the tutorial that the code should be pasted into nodes.py and pipeline.py respectively, but not which. Should it be added to the data_science files, or should we make a new pipeline? The pipeline code notably also describes a create_pipeline() function, so it did seem wrong to just paste in the data_science/pipeline.py file. I tried to add it as a new pipeline inside create_pipeline, and call something like pipe=ds_pipeline_1 + ds_pipeline_2 + plotly_pipeline at the end. No luck. Anybody has experienced in adding visual outputs such as a Plotly graph?

Vladimir Filimonov

11/02/2022, 1:56 PM

Hey everyone. I’m having troubles running

make test-no-spark

on kedro source code. 6 tests failing and seems like all of them related to parallel task runner. Have anyone had similar issue? My current hypothesis that it all fails due to

ModuleNotFoundError: No module named 'tests.runner'

. But wondering what might’ve caused an issue. Here is list of tests failed:

Copy code

FAILED tests/framework/cli/test_cli.py::TestRunCommand::test_run_successfully_parallel - assert not 1
FAILED tests/framework/session/test_session_extension_hooks.py::TestNodeHooks::test_on_node_error_hook_parallel_runner - assert 0 == 2
FAILED tests/framework/session/test_session_extension_hooks.py::TestNodeHooks::test_before_and_after_node_run_hooks_parallel_runner - assert 0 == 2
FAILED tests/framework/session/test_session_extension_hooks.py::TestDataSetHooks::test_before_and_after_dataset_loaded_hooks_parallel_runner - as...
FAILED tests/framework/session/test_session_extension_hooks.py::TestDataSetHooks::test_before_and_after_dataset_saved_hooks_parallel_runner - ass...
FAILED tests/framework/session/test_session_extension_hooks.py::TestBeforeNodeRunHookWithInputUpdates::test_correct_input_update_parallel - asser...
FAILED tests/framework/session/test_session_extension_hooks.py::TestBeforeNodeRunHookWithInputUpdates::test_broken_input_update_parallel - Failed...

And here is failure of test in

test_cli.py

reporting ModuleNotFound:

Copy code

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ <string>:1 in <module>                            │
│                                       │
│ /Users/Vladimir_Filimonov/opt/anaconda3/envs/kedro-environment/lib/python3.8 │
│ /multiprocessing/spawn.py:116 in spawn_main                 │
│                                       │
│  113 │  │  resource_tracker._resource_tracker._fd = tracker_fd      │
│  114 │  │  fd = pipe_handle                        │
│  115 │  │  parent_sentinel = os.dup(pipe_handle)             │
│ ❱ 116 │  exitcode = _main(fd, parent_sentinel)               │
│  117 │  sys.exit(exitcode)                         │
│  118                                    │
│  119                                    │
│                                       │
│ /Users/Vladimir_Filimonov/opt/anaconda3/envs/kedro-environment/lib/python3.8 │
│ /multiprocessing/spawn.py:126 in _main                    │
│                                       │
│  123 │  │  try:                              │
│  124 │  │  │  preparation_data = reduction.pickle.load(from_parent)   │
│  125 │  │  │  prepare(preparation_data)                 │
│ ❱ 126 │  │  │  self = reduction.pickle.load(from_parent)         │
│  127 │  │  finally:                            │
│  128 │  │  │  del process.current_process()._inheriting         │
│  129 │  return self._bootstrap(parent_sentinel)              │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'tests.runner'

K 1

Zirui Xu

11/02/2022, 4:28 PM

Hi team. Has anyone explored using Kedro with Spark Structured Streaming? What worked/did not work?

👀 1

Earl Hammond

11/02/2022, 6:31 PM

Hi team, When running

kedro run

we see this set of warnings, the pipeline runs fine. Just wondering what Kedro is doing at this time: WARNING: Something went wrong with getting the username to send to the Heap. Exception: [Errno 6] No such device or address WARNING: Failed to send data to Heap. Exception of type 'ConnectionError was raised. Thanks in advance!

✅ 1

Allen Ma

11/03/2022, 5:59 AM

Hi All: What is the difference between

kedro run

and

metadata = bootstrap_project(Path.cwd())

with KedroSession.create(metadata.package_name) as session:

session.run()

kedro run

can succeed but the second way can’t

Debanjan Banerjee

11/03/2022, 9:37 AM

Hi Kedro Viz Picassos 🙂 , i am on a machine having kedro 18.3 and kedro viz 5.1.1 . When i try to do

kedro viz

, it failed with the error

Error: No such command 'viz'

. This has never happened with me before , any ideas what might be causing this Did we change the way viz is supposed to be called ?

Debanjan Banerjee

11/03/2022, 10:12 AM

UPDATE : After reinstalling it fails with this error

Debanjan Banerjee

11/03/2022, 10:12 AM

@Tynan @Merel @datajoely any ideas what might be wrong here ?

viveca

11/03/2022, 3:32 PM

Hi, I’m on kedro 0.18.3 trying to override a templated variable in the data catalog with runtime configuration. So

catalog.yml

has

filepath: "${configurable_filepath}"

and I’d like to do

kedro run --params configurable_filepath:/path/to/file

. A similar question was asked previously https://linen-discord.kedro.org/t/2203662/Hi-all-I-have-a-beginner-question-on-Kedro-0-18-2-I-have-a-T with writing a custom TemplatedConfigLoader as solution: https://github.com/noklam/kedro_gallery/blob/master/template_config_loader_demo/src/template_config_loader_demo/settings.py Is this the recommended approach or is there a way of achieving what I want without writing a custom TemplatedConfigLoader accessing private variables? Is there no other way to add all runtime parameters to the global dict? I’d really want to avoid that if possible in case a future kedro update changes things.

Filip Panovski

11/03/2022, 3:35 PM

(Cross-posting from Discord since I saw the announcement there... a little too late) Hello everyone! I'm relatively new to Kedro. I'm using it together with Dask for some data processing, and I have some issues/questions with regards to data locality. I have a pipeline that has three nodes where the datasets are loaded like follows:

dask.ParquetDataSet from s3 -> MemoryDataSet -> dask.ParquetDataSet to s3

I run this pipeline from my local workstation for testing purposes. My Dask Cluster is then deployed on AWS EC2 (Scheduler+Workers) and they communicate privately. I noticed that on the last node, the

MemoryDataSet -> dask.ParquetDataSet to s3

causes the data to be transferred to my local machine where the Kedro pipeline is being run, and then transferred back to s3. Needless to say this introduces costs and lag and is not what I intended. Can I tell the workers to write this data directly to s3? If not, what is the intended way to do this? I read through the documentation, and there is some very good information on getting the Pipeline to run as either step functions or on AWS Batch, but this is not quite the deployment flow I had in mind. Is it intended for the pipeline to be run on the same infrastructure where the workers are deployed?

Seth

11/03/2022, 3:56 PM

Hi all, I want to have a config that will read/write to my local file system instead of S3, when the pipeline is executed locally. A conf/dev/ folder already exists, containing configurations for our cloud dev setup. I understand that I can manually create a globals.yml in my local folder, however all developers will have to manually create these files for all the different use-cases we maintain. With conf/local/ being in the .gitignore files, what is the intended way to share local configurations?

Earl Hammond

11/03/2022, 6:16 PM

Hi team, Does anyone have any code examples that show nested namespaces with branching outputs. We are having trouble linking together the namespace outputs and having them properly visualized in kedro-viz. The DS and reporting layers would be under nested namespaces e.g.,

ds.ds1

and

ds.ds2

👀 1

viveca

11/04/2022, 8:03 AM

Hi, are there any examples on using Kedro for inference or is it mainly designed for training pipelines? The issue I have with inference is that the input to inference will vary, for instance a path to s3, even though it’s one entry in the data catalog. I would’ve liked to solve this by setting the

filepath

of the catalog entry as a parameter to

kedro run

but according to my other discussion with @datajoely this is not allowed in kedro by design. Has anyone else used kedro this way or should I just skip kedro for inference or similar type of pipelines with varying input?

Allen Ma

11/04/2022, 1:55 PM

Hi All: when I use

session.run()

I got the following error message:

Copy code

22/11/04 21:25:59 ERROR SparkUI: Failed to bind SparkUI
java.net.BindException: Failed to bind to /0.0.0.0:9016: Service 'SparkUI' failed after 16 retries (starting from 9000)! Consider explicitly setting the appropriate port for the service 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.

but I use

kedro run

It work

Eduardo Lopez

11/04/2022, 4:23 PM

Hello everyone, I have a little conceptual question, maybe someone can help with that. I am creating a small computer vision project using Kedro. At the moment it consists of 3 main parts which are preprocessing, object detection and object tracking. I make a pipeline for each part, but my initial idea was that the output of the object detection pipeline would be used as input in the tracking pipeline, but my only node that has a tracking pipeline performs detection and tracking at the same time, so the detection pipeline becomes unnecessary. So my doubt is what can be the correct flow. I'm thinking of these approaches: 1. Leave them as they are with a pipeline for each part and find a way that the tracking pipeline only does the tracking part and not the detection at the same time. 2. Have a single pipeline with one node for detection and one node for tracking. 3. Have a single pipeline with a single node that performs detection and tracking on the same node. If anyone has a suggestion or says what is the best flow to follow. Thank you!!

Jonathan Javier Velásquez Quesquén

11/06/2022, 6:09 PM

Hello to everyone!! :) I can create a new kedro-project using [kedro new] via cmd. ~ Can I replicate the same thing using a code python file / jupyter notebook? 🤔

Jonathan Javier Velásquez Quesquén

11/06/2022, 6:16 PM

Sorry! I just figured out the way to do that. 😅

🥳 1

Sean Westgate

11/07/2022, 3:08 PM

Hi Team, Are there any projects or examples how to use

kedro build-docs

effectively? The Spaceflights tutorial is pretty minimalist and works in that you can see the pipelines and nodes, but how is this used to document parameters or inputs and outputs in greater detail? Thank you!

user

11/07/2022, 5:48 PM

Kedro - How to update a dataset in a Kedro pipeline given that a dataset cannot be both input and output of a node (only DAG)? In a Kedro project, I have a dataset in catalog.yml that I need to increment by adding a few lines each time I call my pipeline. #catalog.yml my_main_dataset: type: pandas.SQLTableDataSet credentials: postgrey_credentials save_args: if_exists: append table_name: my_dataset_name However I cannot just rely on append in my catalog parameters since I need to control that I do not insert already existing dates in my dataset to avoid duplicates. I also cannot create a node taking my...

user

11/08/2022, 7:58 AM

How can I register kedro data catalog programmatically in Kedro 0.18? For various reasons (mainly ability to dynamically construct file paths) I like to define the data catalog programatically, and not use yaml file to define datasets e.g. DataCatalog( {"products": ParquetDataSet(filepath=f{PREFIX}/products.parquet") ... }) In kedro 0.17 there was an easy way to...

Safouane Chergui

11/08/2022, 9:38 AM

Hello everyone, Is there a way to change the default versioning string ? Instead of having this YYYY-MM-DDThh.mm.ss.sssZ appended to the names of the dataset, I’d like to append just a part of it (e.g: YYYY-MM-DD) Thanks

Jordan

11/08/2022, 12:26 PM

Can I create a pipeline with a hook or does this always need to be done via the registry?

John Melendowski

11/09/2022, 1:51 AM

Is there any plans to release a kedro light? Or a version without the project management. I really like the package but I have a hard time putting this into production with my conda environments for work because kedro has some heavy requirements like specific

git

versions which I'm assuming is for the project management features kedro supplies...or

cookiecutter

which needs to be downgraded from the latest anaconda release

Yuchu Liu

11/09/2022, 12:25 PM

Hello #C03RKP2LW64! I am running into an issue with

kedro jupyter notebook

. When I try to load it from the terminal, in a virtual environment I setup for kedro, it tries to load extension

kedro,ipython

from a wrong version of Python. As a result, I don't have any kedro specific commas in jupyter notebook. Here is the warning I get when loading a notebook.

Copy code

[I 13:20:49.212 NotebookApp] Kernel started: 21bb83e7-2e5f-4463-a43e-23744ec3ed02, name: kedro_nfr_transactions
[IPKernelApp] WARNING | Error in loading extension: kedro.ipython
Check your config files in /Users/yuchu_liu/.ipython/profile_default
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/logging/config.py", line 544, in configure
    formatters[name] = self.configure_formatter(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/logging/config.py", line 676, in configure_formatter
    c = _resolve(cname)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/logging/config.py", line 90, in _resolve
    found = __import__(used)
ModuleNotFoundError: No module named 'pythonjsonlogger'

I have tried to load kedro from ipython in the terminal using, and it works perfectly fine.

Copy code

%load_ext kedro.extras.extensions.ipython
%reload_kedro .

Does anyone know how to debug this issue? Thank you!

✅ 1

Luis Gustavo Souza

11/09/2022, 12:26 PM

Hello, everyone! In a single kedro run command, I need to run a pipeline multiple times using different snapshot dates. Anyone knows how can I achieve that? For example: Run DS and DE pipelines -> 2022-01-01 Run DS and DE pipelines -> 2022-02-01 Run DS and DE pipelines -> 2022-03-01

Rosh

11/09/2022, 2:41 PM

Hello everyone, does anyone have any experience running

kedro-airflow

with Spark on GCP? We are wanting to understand how Kedro would work with Spark on GCP Composer and if there is any integration for this that's already available, checked this GitHub issue here but couldn't find anything further : https://github.com/quantumblacklabs/kedro-airflow/issues/65