Kedro #questions

Elias

10/18/2022, 1:00 PM

Hey Team, I was wondering if there was an elegant solution to overwrite parameters dynamically? I am instanciating a pipeline 12 times, but they all need to run with a different parameter called date_max, e.g. “07/01/22” for the first one, and the other ones are decrementing one month, e.g. “06/01/22"... etc.. The pipelines are generated of a template dynamically and ideally I would just pass the adjusted parameter.

Elias

10/18/2022, 1:00 PM

Copy code

parameters.yml
t_-0:
  filters:
    date_max: 2022/07/01
t_-1:
  filters:
    date_max: 2022/06/01

So I want to avoid doing this, as I would need to pass 12 or more variables on each induction. Whereas they are actually all are dependent on the first one.

user

10/19/2022, 2:38 PM

How can I run `catalog.load` in a non-IPython context? In IPython I can run data = catalog.load('my_dataset') in order to load a dataset specified as 'my_dataset' in the catalog.yml file. What's the equivalent command in a pthon script? What do I need to import?

✅ 1

Sean Westgate

10/19/2022, 3:00 PM

Hi Team, working my way through the spaceflight-tutorial I found that running

kedro build-docs

would pull down the latest jinja2 version 3.1.2 which then caused an error as

contextfunction

was deprecated in version 3.1.0. I manually downgraded jinja2 to version 3.0.3 and all worked fine. Not sure if it is just me or a general issue. Is posting bugs like this here the right thing to do? I had a look at your open issues on the Github repo but couldn't find anything related.

Shubham Gupta

10/20/2022, 3:40 AM

Hi Team,

Shubham Gupta

10/20/2022, 3:43 AM

We are trying to build an API using Kedro. I understand that the kedro loads data in a lazy manner. Is there any feature way we can persist this lookup data in DataCatalog? We might be able to keep a robust and fast API using a combination of lazy and eager load on DataCatalog.

Shubham Gupta

10/20/2022, 3:44 AM

And runners can take care of the rest.

Suryansh Soni

10/21/2022, 1:56 PM

Hello Team. Does anyone have experience in deploying kedro pipelines to aws stepfunctions, please let me know i need some urgent help with that

Ian Whalen

10/21/2022, 5:01 PM

Just in time for halloween, I’m trying to do some

Jinja

black magic 🧙 High level: I want to add a global variable to

globals_dict

settings.py

and use it in a loop in my catalog. See thread for an example. Any ideas?

Jordan

10/21/2022, 8:49 PM

Hi friends, I’m looking for some advice. I need to be able to batch process different custom partitioned datasets using the same modular pipeline, whenever required. It’s quite tedious to make

catalog.yml

entries for the inputs and outputs of each batch process. Therefore, I was hoping to implement a solution using hooks that would avoid this tedium: If possible, I would like the solution to: 1. Dynamically populate the catalog with input and output entires for each partitioned dataset. 2. Instantiate and run the modular pipeline using each partitioned dataset’s dynamically populated catalog entries. 3. Make the output datasets of each run available via the data catalog at any time. This should (maybe) be possible with some combination of

after_context_created

after_catalog_created

and

before_pipeline_run

hooks, but unsure how to actually implement this. Any guidance would be much appreciated, cheers.

user

10/23/2022, 7:58 AM

How to show plotly chart in kedro I am trying to use data science tool kedro according to this tutorial. I followed the instruction(write config.yaml, node.py and pipeline.py etc) and do exactly the same as the documentation) and could run kedro run successfully. And next step, I tried kedro viz and could show the pipelines but I cannot see plotly chart. Here is the result of the visualization. Please see the left...

user

10/24/2022, 8:18 AM

How to generate kedro pipelines automatically (like DataEngineerOne does)? Having seen the video of DataEngineerOne:

How To Use a Parameter Range to Generate Pipelines Automatically▾

I want to automate a pipeline that simulates an electronic circuit. I want to do a grid search over multiple central frequencies of a bandpass filter, and for each one run the simulate pipeline. In the pipeline registry, the grid search parameters are passed to the create_pipeline() function's kwargs. #...

Yetunde

10/24/2022, 8:52 AM

has renamed the channel from ‘support’ to ‘questions’

user

10/25/2022, 8:18 AM

Can't run KedroSession with `from_inputs` parmeter: ValueError: Pipeline does not contain data_sets named [...] In jupyter notebook, when I run session.run(pipeline_name='sim', from_inputs=['measurements', 'params:simulation']), passing datasets & params specified in catalog.yaml, everything works fine. However, when I want to run it with a dataset that I added during the session, a ValueError occurs:

>> ds = GenMsmtsDataSet()

>> catalog.add('ipy_msmts', ds)

>> session.run(pipeline_name='sim', from_inputs=['ipy_msmts', 'params:simulation'])

ValueError: Pipeline does not contain data_sets named...

Toni

10/25/2022, 10:35 AM

Hi community! I was wondering if I can save an output of the same node in two different formats: For instance:

Copy code

node(
  func = some_function,
  inputs = "some_input",
  outputs = "the_output",
   name = "node",
),

Copy code

the_output:
  type: pandas.CSVDataSet
  filepath: data/output_csv.csv

the_output:
  type: pandas.ParquetDataSet
  filepath: data/output_parquet.parquet

Luis Gustavo Souza

10/25/2022, 1:03 PM

Hello, everyone! I need to pass some complex parameters to kedro cli (list, dicts, list of dicts; e.g: --params test:["a", "b", "c"]). Does anyone know how I achieve that?

Yuchu Liu

10/25/2022, 1:40 PM

Hello everyone! I am trying to set up Kedro on my machine for an existing project and pipeline. My colleague and I have similar dependencies, and the project works perfectly fine on their machine. The error I got is related to writing a parquet file. To debug, I have: • validated that Pyspark works when reading and writing a parquet file, including overwriting an existing file • I loaded Kedro jupyter lab and tried to load and write a parquet file, the loading works, but writing gives me the same error message as when I run the pipeline (Failed while saving data to data set)

Danhua Yan

10/25/2022, 2:00 PM

Hi team, a question might be related to how to create a custom datasets with certain read/write behavior. Details below: I try to use kedro to read a

delta

datasets created by databricks in pandas. The current configs look like below:

Copy code

_pandas_parquet: &_pandas_parquet
  type: pandas.ParquetDataSet

_spark_parquet: &_delta_parquet
  type: spark.SparkDataSet
  file_format: delta

What I want to achieve:

Copy code

node1:
  outputs: dataset@spark

node2:
  inputs: dataset@pandas

Unfortunately

pandas

doesn’t support reading

delta

as is. I found below workaround that requires additional steps. https://mungingdata.com/pandas/read-delta-lake-dataframe/ How should I create a dataset that can do something like this internally when being loaded?

Copy code

from deltalake import DeltaTable
dt = DeltaTable("resources/delta/1")
df = dt.to_pandas()

Tried looking into this https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction but nothing mentioned about using pandas to interact with

delta

. Thank you!

Denis Araujo da Silva

10/25/2022, 2:40 PM

What’s the future for

kedro build-reqs

? Just saw the message that it will be deprecated at 0.19.

✅ 1

Sasha Collin

10/26/2022, 9:53 AM

Hey! I have a question about the best practice when dealing with several splitting methods of the same dataset. I was thinking about doing a structure as follows:

Copy code

- 05_model_input (folder)
--- master_table_1 (folder)
------ master_table_1.csv (file)
------ split_1 (folder)
--------- X_train.csv
--------- X_test.csv
--------- y_train.csv
--------- y_test.csv
------ split_2 (folder)
--------- X_train.csv
--------- X_test.csv
--------- y_train.csv
--------- y_test.csv

Would you say this is good practice? or would you advice not saving the splits and parametrise the split method selected in the parameters.yml for instance? Thanks a lot for your help!

Nichita Morcotilo

10/26/2022, 10:02 AM

Hey! I have a question regarding kedro 0.17.7. I have three conf folders:

conf/base

conf/test

, and

conf/local

(empty directory). My

conf/test/pipelines.yml

is an empty file, and executing

kedro run --env=test

results in creating in

/data

directory folders for each of the nodes listed in

conf/base/pipelines.yml

It is expected behavior for kedro? I mean, if one of the environments has empty

pipelines.yml

to fallback for base env? Thank you!

🔥 1

Erwin

10/26/2022, 11:43 AM

Hi! We are having an issue with CI Github. We have kedro project with experimental tracking enabled, so every time it runs something save some information in the DB, which always runs great in local mode. But when it comes to test a minimal pipeline in github actions (just to make sure there are no circular dependencies, etc), it fails since the repo is cloned from a specific commit and not a branch (since it is a CI test, after a PR is opened). The key here is that the run cannot be associated with a branch, this fails:

branch = git.Repo(search_parent_directories=True).a

Is there any way to disable experiment tracking at runtime? Or what would it be a better approach to check if kedro can at least create the graph and detect circular dependencies? Detailed log:

Copy code

Run kedro run --tag tag_dict
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
<https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice> 
Do you opt into usage analytics?  [y/N]: [10/25/22 20:27:24] WARNING  Failed to confirm consent. No data    plugin.py:210
                             was sent to Heap. Exception:                       
[10/25/22 20:27:24] INFO     Kedro project            session.py:343




Pipelines started

        
        
[10/25/22 20:27:24] INFO     Seeding sklearn, numpy and random   seed_file.py:41
                             libraries with the seed 42                         
                    INFO     Loading data from               data_catalog.py:343
                             'tag_dictionary'                                   
                             (ExcelDataSet)...                                  
[10/25/22 20:27:25] INFO     Running node: create_td:                node.py:327
                             create_td([tag_dictionary]) -> [td]                
                    INFO     Saving data to 'td'             data_catalog.py:382
                             (PickleDataSet)...                                 
                    INFO     Completed 1 out of 1 tasks  sequential_runner.py:85
                    INFO     Pipeline execution completed           runner.py:90
                             successfully.                                      
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/hostedtoolcache/Python/3.8.0/x64/bin/kedro:8 in <module>                │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/cli/cli.py:211 in main                                                 │
│                                                                              │
│   208 │   """                                                                │
│   209 │   _init_plugins()                                                    │
│   210 │   cli_collection = KedroCLI(project_path=Path.cwd())                 │
│ ❱ 211 │   cli_collection()                                                   │
│   212                                                                        │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:1130 in __call__                                                         │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/cli/cli.py:139 in main                                                 │
│                                                                              │
│   136 │   │   )                                                              │
│   137 │   │                                                                  │
│   138 │   │   try:                                                           │
│ ❱ 139 │   │   │   super().main(                                              │
│   140 │   │   │   │   args=args,                                             │
│   141 │   │   │   │   prog_name=prog_name,                                   │
│   142 │   │   │   │   complete_var=complete_var,                             │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:1055 in main                                                             │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:1657 in invoke                                                           │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:1404 in invoke                                                           │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
│ .py:760 in invoke                                                            │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/cli/project.py:366 in run                                              │
│                                                                              │
│   363 │   node_names = _get_values_as_tuple(node_names) if node_names else n │
│   364 │                                                                      │
│   365 │   with KedroSession.create(env=env, extra_params=params) as session: │
│ ❱ 366 │   │   session.run(                                                   │
│   367 │   │   │   tags=tag,                                                  │
│   368 │   │   │   runner=runner(is_async=is_async),                          │
│   369 │   │   │   node_names=node_names,                                     │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/session/session.py:293 in __exit__                                     │
│                                                                              │
│   290 │   def __exit__(self, exc_type, exc_value, tb_):                      │
│   291 │   │   if exc_type:                                                   │
│   292 │   │   │   self._log_exception(exc_type, exc_value, tb_)              │
│ ❱ 293 │   │   self.close()                                                   │
│   294 │                                                                      │
│   295 │   def run(  # pylint: disable=too-many-arguments,too-many-locals     │
│   296 │   │   self,                                                          │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
│ ework/session/session.py:285 in close                                        │
│                                                                              │
│   282 │   │   if `save_on_close` attribute is True.                          │
│   283 │   │   """                                                            │
│   284 │   │   if self.save_on_close:                                         │
│ ❱ 285 │   │   │   self._store.save()                                         │
│   286 │                                                                      │
│   287 │   def __enter__(self):                                               │
│   288 │   │   return self                                                    │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro_viz/ │
│ integrations/kedro/sqlite_store.py:68 in save                                │
│                                                                              │
│   65 │   │   engine, session_class = create_db_engine(self.location)         │
│   66 │   │   Base.metadata.create_all(bind=engine)                           │
│   67 │   │   database = next(get_db(session_class))                          │
│ ❱ 68 │   │   session_store_data = RunModel(id=self._session_id, blob=<http://self.to|self.to> │
│   69 │   │   database.add(session_store_data)                                │
│   70 │   │   database.commit()                                               │
│   71                                                                         │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro_viz/ │
│ integrations/kedro/sqlite_store.py:52 in to_json                             │
│                                                                              │
│   49 │   │   │   │   try:                                                    │
│   50 │   │   │   │   │   import git  # pylint: disable=import-outside-toplev │
│   51 │   │   │   │   │                                                       │
│ ❱ 52 │   │   │   │   │   branch = git.Repo(search_parent_directories=True).a │
│   53 │   │   │   │   │   value["branch"] = branch.name                       │
│   54 │   │   │   │   except ImportError as exc:  # pragma: no cover          │
│   55 │   │   │   │   │   logger.warning("%s:%s", exc.__class__.__name__, exc │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/git/repo/b │
│ ase.py:865 in active_branch                                                  │
│                                                                              │
│    862 │   │   :raises TypeError: If HEAD is detached                        │
│    863 │   │   :return: Head to the active branch"""                         │
│    864 │   │   # reveal_type(self.head.reference)  # => Reference            │
│ ❱  865 │   │   return self.head.reference                                    │
│    866 │                                                                     │
│    867 │   def blame_incremental(self, rev: str | HEAD, file: str, **kwargs: │
│    868 │   │   """Iterator for blame information for the given file at the g │
│                                                                              │
│ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/git/refs/s │
│ ymbolic.py:309 in _get_reference                                             │
│                                                                              │
│   306 │   │   │   to a reference, but to a commit"""                         │
│   307 │   │   sha, target_ref_path = self._get_ref_info(self.repo, self.path │
│   308 │   │   if target_ref_path is None:                                    │
│ ❱ 309 │   │   │   raise TypeError("%s is a detached symbolic reference as it │
│   310 │   │   return self.from_path(self.repo, target_ref_path)              │
│   311 │                                                                      │
│   312 │   def set_reference(                                                 │
╰──────────────────────────────────────────────────────────────────────────────╯
TypeError: HEAD is a detached symbolic reference as it points to 
'b508cdadd62cf912ab26104388cef8e08d1066eb'
Error: Process completed with exit code 1.

K 1

🙏 1

Suryansh Soni

10/26/2022, 3:17 PM

Hello ! Is there any update on kedro deployment issue with AWS step functions ?

Michał Stachowicz

10/27/2022, 9:56 AM

In my project I would like to have a pipeline using a pre-trained model that is on MLflow. The problem I am facing is that I don't know how to convey the transformation information on the data in the preprocessing. For example, for normalisation we need a minimum and a maximum value to be able to transform the test set in the same way. Is there a solution to this issue? Do you recommend using sklearn Pipelines ?

K 3

Zirui Xu

10/27/2022, 4:34 PM

Hello team. Is there a way to use

kedro.extras.datasets.spark.SparkDataSet

without installing dependencies specified in

kedro[spark]

? I am on a databricks cluster where the installation of pyspark is blocked.

Eivind Samseth

10/28/2022, 11:05 AM

Any news on Python 3.11 support? I saw on hacker news that it will have quite some speed ups

Seth

10/28/2022, 2:45 PM

Is there a way to access the config values of my conf/base/globals.yml inside a node? I know I can load them in by creating a ConfigLoader, however since they are used to create the DataCatalog, I assume they were read into memory already and maybe accessible some other way?

Lorenzo Castellino

10/31/2022, 8:48 AM

Is there a suggested way to set a numpy.random.state globally? I thought about using a global parameter but it might be verbose to add it to all the functions that require it. I was also thinking about a custom Hook that would execute

np.random.set_state()

before the nodes that require it but I would like to hear what your solution to "reproducible randomness" looks like 🙂

Jordan

10/31/2022, 5:03 PM

What is the best way to make credentials from

credentials.yml

available for use in a hook? I know I can just use the yaml loader, but this doesn’t feel correct.

Pedro Arthur

10/31/2022, 7:52 PM

Hi! Is there a way to cache Node outputs and avoid running it if the cached result exists?

✅ 1