Kedro #questions

Harsh Maheshwari

04/01/2023, 9:02 PM

Hi everyone, I am trying to run Kedro with Ray, considering the potential of ray distributed and Kedro pipelines an integration would help a lot. Has someone tried something similar, Need some guidance on how to proceed!

pb95

04/03/2023, 10:25 AM

Hello everyone, I have been trying to add CLI parameters with "kedro run --pipeline mypipeline --params message:hello" that can be accessed inside my Dataset class added to the "catalog.yml". I can retrieve these parameters in "pipeline.py" using 'params:message' or but can't seem to find a way to pass it to my catalog / custom AbstractDataset class. Would you have any pointers ?

Damian Fiłonowicz

04/03/2023, 11:11 AM

Hey, is there any way to use an IncrementalDataSet with a SQL query to incrementally fetch partitions from RDBMS tables, or it works only with object storage partitioned data?

Damian Fiłonowicz

04/03/2023, 11:12 AM

Hey, does IncrementalDataSet support the SCD/merge statement to update/insert/delete specific rows from its (previous) partitions?

Dawid Bugajny

04/03/2023, 12:02 PM

Hello, is there any example how to combine FastAPI with Kedro and handle input and output from Kedro pipeline?

Dawid Bugajny

04/03/2023, 12:23 PM

Hi, I have a problem with creating an API and deployment with Kubeflow. https://docs.kedro.org/en/0.18.1/deployment/kubeflow.html says "All node input/output DataSets must be configured in catalog.yml and refer to an external location (e.g. AWS S3); you cannot use the MemoryDataSet in your workflow" Let's assume that a request has to fire two nodes and there is data which is output from the first node and input in the second one. If there would be few requests in one time, there is a chance that one request will use data from another (in case of using only one file to save data between nodes). What would be the best solution for this problem?

Guillaume Latour

04/03/2023, 1:28 PM

Hello everyone, I am currently trying to launch a pipeline with the dask runner in a sort of distributed fashion (it's all on the same machine but there are several workers). I am facing a not so enjoyable issue which is that my dask workers are killed by the os (likely oom) AND I can't retrieve any logging information. So the logs are "displayed" on the workers side (is it possible to collect them from the workers and output/store/centralize them?) and the only message that I get is "KilledWorker: Attempted to run task <id task> on 3 different workers, but all those workers died while running it. The last worker was <ip:port>." But since the worker has been killed, (and a new one took its place) there's no way of reading those valuable pieces of information. How do you guys usually debug your app when using a dask runner? I've found a dask deployment page and a debug page in the kedro documentation but not a page that is merging those two, have I missed it? I must admit that I am kind of new to dask deployment so if it's a trivial subject, may I ask to be oriented towards pertinent documentation in that regard? Thank you all!

Olivia Lihn

04/03/2023, 10:36 PM

Hi everyone! We are exploring with Unity Catalog in Databricks. I have successfully implemented the catalog using SparkDataSet and S3 buckets but was wondering if anyone has implemented tables in Unity Catalog inside the kedro catalog (should be similar to SparkHive, or even Spark... but wanted to check)

Balachandran Ponnusamy

04/03/2023, 10:37 PM

Hi Kedro team...We ran a forecasting pipeline(runs 20 hours) having around 231nodes and it failed around 198 th node...now I want to run only the remaining failed 33 nodes , but then the error log/info logs doesn't provide a list of remaining nodes to rerun...Can you pls help on this...we definitely not want to run this forecasting pipeline again for 20 hours

Zemeio

04/04/2023, 4:52 AM

Hey guys, does kedro support python 3.11? Is there any plan on supporting 3.11 soon?

FlorianGD

04/04/2023, 7:55 AM

Hi! Pandas just released the 2.0.0, which looks quite promising. Do you know if/when Kedro will support this version? Could this be included in the next 0.19 release?

Melvin Kok

04/04/2023, 10:58 AM

Hi Kedro team/users! I found two unusual behaviours with kedro and would like to ask if anyone else is facing the same issues 1.

after_catalog_created

hook is triggered before

after_context_created

. However this is fixed when

kedro-telemetry

is uninstalled (I have raised an issue here) 2.

kedro-telemetry

is still sending information about the data catalog, the default pipeline etc to heapanalytics.com even if consent is set to false. Under

KedroTelemetryProjectHooks

, it is calling

_send_heap_event

without checking for consent.

🙏 1

Gary McCormack

04/05/2023, 10:44 AM

Hi Everyone, I'm hoping that you'll be able to answer my question. My current use case is, I have a pipeline that does something like the following: I have massive data sets that contain information on 'trades' • node1: extracts data that depends on a timestamp range given by params in the parameter.yaml config file • node2: given this instance of the data, extract user_ids for who 'traded' during this timestamp range • nodes3,4,5....etc: The subsequent nodes would then depend on which user_ids were found in the datasets. ◦ user_id 123 has it's own node ◦ user_id 789 has it's own node ◦ etc etc This last step is where I'm running into my issues 🫠 I suppose, the best and most succinct way of asking this question is, Can one node be used to create subsequent nodes dynamically? If yes, could anyone explain how to do this? Or potentially point me to some documentation? If no, I would be very grateful if anyone has a work around that I might be able to use. Thanks for any help in advance 🙂

Olivier Ho

04/05/2023, 1:22 PM

hello, what is the best way to i access credentials during a node run? i saw https://github.com/kedro-org/kedro/issues/575 but

get_current_session

has been deprecated for hooks

Franco Zentilli

04/05/2023, 4:12 PM

hello ! 🙂 I’m working in a project using kedro-glass / hydra-config loader with the next structure: 3 pipelines (master+model+prediction). And for some reason the

kedro viz

is not working properly, is displaying like all nodes in one row, very weirds. There are not error in the terminal when is executed and my kedro-viz version is 6.0.1. Does anyone knows what could be happening? is Like kedro viz is not recognising the connection between nodes in pipelines.

William Caicedo

04/05/2023, 8:10 PM

From the command line, is there a way to specify a catalog-wide dataset version at run time?

Jannik Wiedenhaupt

04/05/2023, 8:54 PM

How do you write unit tests for kedro nodes? I am doing data transformations in my pipeline and would like to make sure that they are done correctly. Is it even necessary to write tests for that?

Dawid Bugajny

04/06/2023, 7:39 AM

Hello, I'm using kedro-mlflow and I'm having some problems with UI. I tried several times to shutdown UI (by killing gunicorn processes) but it didn't help (after a few seconds UI was starting working again). Do you know any other options to do it? Maybe there is a built-in kedro-mlflow command, which I didn't find in the documentation?

Matthias Roels

04/06/2023, 12:22 PM

Quick question, I noticed the implementation of

_load

for the pandas parquet_dataset is different in

kedro.extras.datasets

vs the

kedro-datasets

plugin! The difference is significant as the one in kedro extras can be extremely slow (2hours compared to 10sec to load a dataset)! In our case, we had a dataset on S3 generated by a Spark job (hence a “directory” of (snappy) parquet files with a

_SUCCESS

file) with 137808 rows and 6410 columns. With that dataset, I could validate that

Copy code

pq.ParquetDataset(load_path, filesystem=self._fs).read(**self._load_args)

took indeed longer that 15mins (after that, I ran out of patience since

pd.read_parquet()

on the same dataset was loading within 10sec’s). So the question is: should we already switch from kedro extras datasets to the new kedro-datasets plugin to solve this issue? Is this plugin already ready to use with the current kedro version (v0.18.x)? And can we then simply remove the pandas extras from our

requirements

Guillaume Latour

04/06/2023, 2:46 PM

Hello everyone, I have two pipelines that are using the same node but when I look at the structure with kedro-viz, only the tags of the first pipeline have been applied to this node. Is it because of how I am registering my pipelines (using the built-in

kedro.framework.project.find_pipelines

)? is it a desired behaviour? How can I have all the tags on kedro viz (without manually labeling this node with the tags of my 2 pipelines)?

Guilherme Parreira

04/06/2023, 5:17 PM

Hi Everyone! Can we specify the timezone for kedro to save the files? I ask that because when we use catalog.save() it creates data in the default UTC timezone. Thanks!

Ian Whalen

04/07/2023, 1:07 PM

Question on parameter namespacing. Say I have the following: In my registry:

Copy code

return {"__default__": pipeline(Pipeline([node(foo, "params:value", None)]), namespace="bar")}

In my parameters yaml:

Copy code

value: 1

I’ll get an error that says

Pipeline inputs(s) {'params:bar.value'} not found in the DataCatalog

when I kick off a

kedro run

Rather than define a

bar.value

(and so on for each namespace) is there a way to use defaults as a fallback and only use

bar.value

if it appears in my parameters? I know I could do

pipeline(…, parameters={"params:value": "params:value"})

but that would always use the default value. Rather than only use it when its defined.

Roman Shevtsiv

04/07/2023, 4:08 PM

What would be the best way to reuse a part of a pipeline like a hundred times with different input parameters. Example: • Pipeline parses a list of available datasets (web links) - Node A • Pipeline filters the list based on some fixed criteria to determine the final list of datasets to load and process - Node B • For each link in the list: ◦ Load dataset - Node C1 ◦ Clean dataset - Node C2 ◦ Write processed dataset to SQL server- Node C3 • Provide summary on loaded data - Node D I understand that nodes C1-3 can be organized as a modular pipeline and then reused manually with different input parameters. The problem is the list of links is dynamic and I need to reuse this modular pipeline C1-3 in a loop depending on the result of node B. Is there a proper way to do the above using Kedro? Is this even supported because the root pipeline is no longer a DAG if we include all nodes from C1-3?

Aaron Niskin (amniskin)

04/07/2023, 4:48 PM

Hey all. Is there a way to get a single node to execute in, say, aws batch?

Ryoji Kuwae Neto

04/07/2023, 9:05 PM

Is there a good way on using Kedro with structured streaming? Customizing runner or doing other similar things? I just started studying streaming and would be nice to have some good scaffolding on this. Also making the overall project structure similar between batch and streaming would be interesting in a learning curve as people would only see one single structure.

Iñigo Hidalgo

04/10/2023, 3:02 PM

Bit of a non-technical question, more on the pipeline design side: discussing with a coworker we realized we have a very different approach to designing pipelines: I tend to reduce individual nodes to the smallest logical increment: e.g. in my feature engineering pipeline I have one node which generates time-based features (feature engineering on the datetime index) then have another node which generates synthetic variables, then another which generates lags, another for feature aggregations etc, and pass the datasets between these nodes as memory datasets. whereas my coworker tends to make these nodes larger, basically encompassing an entire step, for example intermediate (or raw) to primary in one node, so performing all the different cleaning steps in one node, then all the feature engineering I described above in another node, and he doesn't use memory datasets frequently, as usually the inputs and outputs of each node will be in a form he wants to persist. In my view, each of these approaches has different benefits and drawbacks: Pros: - for mine, it is easier to see from kedro-viz what is going on and how different nodes depend on other steps - for his, the pipeline view is a lot cleaner, and if you want to dig, you have the code available Cons: - mine, my pipeline definitions start to become a bit unwieldy as their size grows, and refactoring them becomes more difficult since not having these dependencies defined "in code" means IDEs and linters can't help me spot issues - his, there is much less visibility into the different steps performed, and if for example down the line we want to persist an intermediate dataset it's basically impossible. I was wondering if there's been any prior discussion regarding this, either internal within QB or some article or documentation I could refer to, and was curious to hear the thoughts of other people who work with Kedro daily, particularly with pipelines on the bigger side.

Tim

04/11/2023, 2:45 AM

Hi. I am using modular pipelines and I need to surface a list of parameters in my pipeline registry to loop through to create the pipelines I am running modularly. The only solution I can think of is to run a hook to make an SQL call to get the list of parameters, then update the parameters.yml with the parameters I need. Then import parameters.yml into my pipeline registry and pull the parameters I need to create the pipelines. Before I get too deep, is this even possible (pretty sure it is), and also, is there a simpler way? I am already making the sql call from a node already. But I don't see a way to run nodes prior to creating a second pipeline once the data is in memory. Thanks for your help.

Christianne Rio Ortega

04/11/2023, 3:38 AM

Hi All, Another noob question... So we're currently using databricks as the compute engine, and we would like to use azure secrets to grab data from our snowflake instance. Anyone here have done this? would really appreciate the help. I've seen codes but not sure if its a node or a function. TIA!

Dotun O

04/11/2023, 1:25 PM

Hey all, I am creating a custom functionality to re-run all missing nodes when the nodes fails. Currently trying to access the done_nodes set from runner.py within my custom hooks on_node_error function, is there an easy way for me to access the done_nodes? Here is where the done_nodes is saved for reference: https://github.com/kedro-org/kedro/blob/fa8c56fa2e510e6a449f5ac7356f76c167be978a/kedro/runner/sequential_runner.py#L71

Gary McCormack

04/11/2023, 2:08 PM

Hi again everyone, I've another quick question. I've a hook that runs before a specific node. This hook will check the data in one of the previous steps and determine what the correct value of a certain param should be for the upcoming node. To begin with, the new params are defaulted as empty in the

conf/base/parameters.yaml

file

Copy code

param_1: foo
param_2: bar 
my_nice_new_params:

If I have the following toy code block:

Copy code

@hook_impl
def before_node_run(..args.., catalog: DataCatalog, ..more_args..)
    
    print(catalog._get_dataset('params:my_nice_new_params'))
    new_param1, new_param2 = run_some_super_cool_logic()
    catalog.add_feed_dict(
    {
        'params:my_nice_new_params':[new_param1, new_param2]
    }, replace=True
    )
    print(catalog._get_dataset('params:my_nice_new_params'))

Then the printed stdout will be something like

Copy code

MemoryDataSet(data=<NoneType>)
MemoryDataSet(data=<list>)

which is what I would have hoped for. However, when the node itself is run and accesses the

'params:my_nice_new_params'

, the original

None

value remains. Is there a step that I'm missing that saves the most recent instance of the catalog?