Kedro #questions

Olivier Ho

07/05/2023, 2:49 PM

hi! is there a way to force the use of the omegaconf oc.env resolver in the catalog (even though it's not recommended) ?

Nicolas Oulianov

07/06/2023, 8:01 AM

Hello all. I want to log the number of rows for the datasets at each step of my pipeline. Do you know a good way to do that ? I’ve seen this great pull request to about logging the number of columns and rows. Is this already possible in the current module ? I didn’t find much documentation about this.

Felipe Vianna

07/06/2023, 1:29 PM

Hello folks. Is there any way I can use globals params from env other than base? problem: I have some configurations that are parsed from globals to other params files by reference

my_param: ${my_param_in_globals.yml}

for some reason we can parse params like that just from base globals.yml. and when i try to get from live/globals.yml it’s not taken.

Andrew Doherty

07/06/2023, 2:08 PM

Hi All, I am developing a Kedro Namespace pipeline (

kedro=0.18.7

). When running from the CLI I would like to pass the location of the parameters file. From the documentation I believe I should be able to run:

Copy code

kedro run  --pipeline=namespace_1 --config=<config_file_name>.yml

However, when I run this I get

KeyError: 'run'

(traceback in thread). Is there a way to use

--config

with namespace pippelines? The alternative is to create a copy of the

conf

folder and use the following:

kedro run  --pipeline=namespace_1 --conf-source=moved_params

This works but the first option where we only override the single parameter file would be much neater for our purpose. Thanks again.

Marc Gris

07/06/2023, 2:22 PM

Hi everyone, Is there some setting that could allow for

kedro jupyter notebook

to open notebooks directly in the IDE (vscode in my case) instead of in a browser ? Thx M.

Marc Gris

07/06/2023, 3:26 PM

Hi again, Does the

ParallelRunner

really only parallelize nodes, or does it actually also parallelize modular / namespace pipelines ? Thx M

Andreas_Kokolantonakis

07/07/2023, 10:26 AM

Hi everyone, has anyone tried to use csv logger with kedro logging.yml? https://pypi.org/project/csv-logger/

👀 1

Mathilde Lavacquery

07/07/2023, 12:11 PM

Hi Team, is it not possible anymore to pass a namespace as a node argument ? That was possible until v 18.3 but I am currently having troubles to replicate the behaviour with v 18.8 Here is my usecase: I have one raw input file for all a country, but I am running separately each region in country. my first node is to filter out the

region

from raw file, and then I apply the same modular pipeline to each region, using

namespace=region

Copy code

def create_preprocess_modular_pipeline(region) -> Pipeline:
    def filter_scope_node(raw_national):
        return filter_scope(raw_national, region=region)

    return pipeline(
        [
            node(
                func=filter_scope_node,
                inputs=["raw_national"],
                outputs="scope_filtered",
            ),
        ]
    )


def create_preprocess_pipeline(**kwargs) -> Pipeline:
    regions = kwargs.get("regions")
    preprocess_pipeline = Pipeline([])
    for region in regions:
        preprocess_pipeline += pipeline(
            pipe=create_preprocess_modular_pipeline(region=region),
            namespace=region,
            # inputs for which we don't have namespaces
            inputs={"raw_national": "raw_national"},
        )
    return preprocess_pipeline

Current behaviour: it seems that only the last region is used in

filter_scope_node

for all regions (seems to be an instantiation problem of this node)

How can I pass dynamically my namespace to some nodes ?

How to dynamically instantiate this filter_scope_node ?

👀 1

Michel van den Berg

07/07/2023, 1:50 PM

In the data catalog, can we add extra metadata to a dataset?

Emilio Gagliardi

07/07/2023, 7:01 PM

A quick question about using jupyter notebook inside VSCode. I followed the instructions in the documentation to launch a proper kedro notebook that opens in a tab in my browser. In that scenario, the correct kernel is available and I can access the session/context etc. as directed. However, when I open the same notebook inside VSCode, the kedro kernel isn't available, only the python environment I created for my project. Is it frowned upon to open notebooks inside VSCode? should I just stick to the method outlined in the instructions? I'd prefer to open the notebook inside VSCode so I don't have to tab between it and my browser. Any instructions to get this working is greatly appreciated. thanks kindly,

👀 1

Michel van den Berg

07/08/2023, 6:57 AM

When in a non-production environment - during development - data engineers can validate datasets between nodes, to see if the function was correctly executed. However, I wonder if you have some guidance on how to do this in a production environment, especially when CI/CD is involved and when the pipeline is not run locally but in say Airflow. I am aware that there are data quality tools like Great Expectations that can automate this within a CI/CD pipeline. Some questions I have: • When an automated data quality tool fails the (Airflow) pipeline that is running in a CI/CD pipeline, what is the recommended way of fixing the data and re-running the pipeline again? Is it recommended to re-run the whole pipeline again, or can we also run only a subset of the pipeline? Or is it really hard to find out where the data quality issue resides in the overal (master) pipeline, thus it might be better to re-run the pipeline as a whole again? • I understand that it is better to automate data quality testing, however, is there also something like manual data quality testing, especially when running in the context of a production system? Can we express something like a manual validation step within Kedro , whereas the pipelines waits until a user presses a button to continue the pipeline, or - in case of an data error - (partially) re-run the pipeline with newly uploaded corrected data?

👀 1

Eduardo Romero López

07/09/2023, 11:32 AM

Hi all, I am starting with kedro and I have a lot of doubt. Which is it the recommended way of save and load data in the structure of the project. For example: I have a node that read raw data, does data wrangling and save intermediate data in parquet format. Is it a good practice do it thus? or is it better using "from kedro.io import data_catalog" how I show in the image?

Eduardo Romero López

07/09/2023, 11:33 AM

thanks in advance

Jackson

07/10/2023, 9:10 AM

Hi everyone, I have a question. I was trying to use convert kedro pipeline to airflow dags following the steps as described in [here] Following is my steps: 1. kedro airflow create, the file is then generated inside airflow_dags/ directory 2. I have another folder for airflow, therefore, I copy the generated dags to corresponding dags/ folder in airflow directory 3. Package the kedro pipeline using kedro package and install it in the corresponding airflow folder 4. Copy the conf/ and data/ folder to the corresponding airflow folder The problem is, when I run the airflow task, it output the following error.

Copy code

ValueError: Given configuration path either does not exist or is not a valid directory: /home/jackson/MASS_AIR_PIPELINE/conf/base

May I know if i missed something in my steps above? Shouldn't it locate the conf/base folder inside my airflow directory? FYI, /home/jackson/MASS_AIR_PIPELINE/ is the folder where I code all my kedro pipeline and /home/jackson/airflow is where i stored my airflow dags.

Toni - TomTom - Madrid

07/10/2023, 9:28 AM

Hello 🙂 anyone that has already implemented a Kedro type to read Text Files as RDD in Spark? (extra points if you have even done it for XML or KML files 😉), if not, I would like to know what would be the best (simplest) way to implement this class from the existing methods/templates in Kedro. Thanks a lot in advance!

Abhishek Bhatia

07/10/2023, 11:33 AM

Hi folks! Is it possible to define multiple types of base datasets for PartitionedDataSet? I have a use case where my node may return both pandas dataframe and dictionaries in the output. Having separate catalog entries for each also doesn't suit my needs. Was hoping if the following was anyway possible?

Copy code

my_multi_format_part_dataset:
  type: PartitionedDataSet
  path: "data/path/to/dataset"
  dataset:
    - type: pandas.CSVDataSet
      load_args:
        index_col: 0
      save_args:
        index: false
      filename_suffix: ".csv"
    - type: json.JSONDataSet
      filename_suffix: ".json"

Olivier Ho

07/10/2023, 11:44 AM

hello, questions about best practices. Say you have two pipeline pipe1 & pipe2 that you want to micropackage. Following documentation, we should have a

requirements.txt

in the pipeline folders for each pipeline, however should we also copy those requirements in the global

requirements.txt

in the

src/project_name

folder? The issue here is that having pipeline specific dependencies duplication requires maintenance so I would like to keep the dependencies in their specific folders and have a single command to install all project & pipeline dependencies.

Marc Gris

07/10/2023, 4:21 PM

Hi everyone, I hope you all had a nice week-end 🙂 This is a subject I had already brought up, which I really think deserves “another shot”: dependencies isolation (nodes, pipelines, namespaces…) (I’m dropping this here in order not to “clutter” the feature requests on github “for nothing”. If you consider it to be a “reasonable” / “sensible” request, I’ll happily put it there) Let’s start with an obvious & bitter truth: Python is wonderful… But like the snake in the garden of eden, the fruit it offers comes at a high cost: 🔥 😈 Dependency Hell 😈 🔥 If it were only a deployment / production issue, I would “happily surrender” to the answer: Just use Airflow’s VirtualenvOperator / ExternalOperator… But, I hope that most of you will also agree that it also happens to be a nightmare in development also… Granted, a “workaround” is always possible, creating multiple venvs in the repo and then “manually switching” between interpreters in the shell with (thx @Iñigo Hidalgo for the tip):

.venv_model_*a*/bin/python -m kedro run --tags=model_*a*

then

.venv_model_*b*/bin/python -m kedro run --tags=model_*b*

etc.. But, this, IMHO, is really far from an optimal “dev-confort-centric” workflow… Hence my initial request / question: Would there be some mechanisms that could allow passing a path to a venv when creating a node / pipeline ? (I must confess that, in my naïveté, I though that this would be “quite easy” using a

before_node_run

callback… But, I quickly had to reckon that my skills were too meager for the task 😅 ) Many thanks in advance for taking the time to consider this suggestion / request. Regards Marc

Michel van den Berg

07/11/2023, 12:58 PM

From this talk (

https://www.youtube.com/watch?v=-FedSW2SN7A▾

), Lim mentions that your deployed pipeline does not need to have the same granularity as your development pipeline. Is this something that is build in Kedro? Or how to achieve this?

👍 1

Michel van den Berg

07/11/2023, 1:48 PM

Does QB have something on top of Kedro? In our company we have data engineers (technical people) who can build pipelines (currently not Kedro), but don't actually run pipelines in production. We have build a small UI on top of the existing pipeline infrastructure that allows non-technical users to upload files, do manual validation and execute one or more pipeline tasks. Is this something you have seen as well at other companies?

Jose Nuñez

07/11/2023, 4:23 PM

Hi fellow kedroids K🤖!! . Suddenly when trying to execute

kedro viz

I'm getting this error:

Copy code

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/bin/kedro:8 in <module>                     │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework │
│ /cli/cli.py:211 in main                                                                          │
│                                                                                                  │
│   208 │   """                                                                                    │
│   209 │   _init_plugins()                                                                        │
│   210 │   cli_collection = KedroCLI(project_path=Path.cwd())                                     │
│ ❱ 211 │   cli_collection()                                                                       │
│   212                                                                                            │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 │
│ 130 in __call__                                                                                  │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework │
│ /cli/cli.py:139 in main                                                                          │
│                                                                                                  │
│   136 │   │   )                                                                                  │
│   137 │   │                                                                                      │
│   138 │   │   try:                                                                               │
│ ❱ 139 │   │   │   super().main(                                                                  │
│   140 │   │   │   │   args=args,                                                                 │
│   141 │   │   │   │   prog_name=prog_name,                                                       │
│   142 │   │   │   │   complete_var=complete_var,                                                 │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 │
│ 055 in main                                                                                      │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 │
│ 657 in invoke                                                                                    │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 │
│ 404 in invoke                                                                                    │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:7 │
│ 60 in invoke                                                                                     │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_viz/launc │
│ hers/cli.py:86 in viz                                                                            │
│                                                                                                  │
│    83 # pylint: disable=import-outside-toplevel, too-many-locals                                 │
│    84 def viz(host, port, browser, load_file, save_file, pipeline, env, autoreload, params):     │
│    85 │   """Visualise a Kedro pipeline using Kedro viz."""                                      │
│ ❱  86 │   from kedro_viz.server import is_localhost, run_server                                  │
│    87 │                                                                                          │
│    88 │   installed_version = VersionInfo.parse(__version__)                                     │
│    89 │   latest_version = get_latest_version()                                                  │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_viz/serve │
│ r.py:13 in <module>                                                                              │
│                                                                                                  │
│    10 from kedro.pipeline import Pipeline                                                        │
│    11 from watchgod import run_process                                                           │
│    12                                                                                            │
│ ❱  13 from kedro_viz.api import apps                                                             │
│    14 from kedro_viz.api.rest.responses import EnhancedORJSONResponse, get_default_response      │
│    15 from kedro_viz.constants import DEFAULT_HOST, DEFAULT_PORT                                 │
│    16 from kedro_viz.data_access import DataAccessManager, data_access_manager                   │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_viz/api/a │
│ pps.py:16 in <module>                                                                            │
│                                                                                                  │
│    13 from jinja2 import Environment, FileSystemLoader                                           │
│    14                                                                                            │
│    15 from kedro_viz import __version__                                                          │
│ ❱  16 from kedro_viz.api.rest.responses import EnhancedORJSONResponse                            │
│    17 from kedro_viz.integrations.kedro import telemetry as kedro_telemetry                      │
│    18                                                                                            │
│    19 from .graphql.router import router as graphql_router                                       │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_viz/api/r │
│ est/responses.py:10 in <module>                                                                  │
│                                                                                                  │
│     7 from fastapi.responses import ORJSONResponse                                               │
│     8 from pydantic import BaseModel                                                             │
│     9                                                                                            │
│ ❱  10 from kedro_viz.data_access import data_access_manager                                      │
│    11                                                                                            │
│    12                                                                                            │
│    13 class APIErrorMessage(BaseModel):                                                          │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_viz/data_ │
│ access/__init__.py:2 in <module>                                                                 │
│                                                                                                  │
│   1 """`kedro_viz.data_access` provides an interface to save and load data for viz backend."     │
│ ❱ 2 from .managers import DataAccessManager                                                      │
│   3                                                                                              │
│   4 data_access_manager = DataAccessManager()                                                    │
│   5                                                                                              │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_viz/data_ │
│ access/managers.py:15 in <module>                                                                │
│                                                                                                  │
│    12 from sqlalchemy.orm import sessionmaker                                                    │
│    13                                                                                            │
│    14 from kedro_viz.constants import DEFAULT_REGISTERED_PIPELINE_ID, ROOT_MODULAR_PIPELINE_ID   │
│ ❱  15 from kedro_viz.models.flowchart import (                                                   │
│    16 │   DataNode,                                                                              │
│    17 │   GraphEdge,                                                                             │
│    18 │   GraphNode,                                                                             │
│                                                                                                  │
│ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_viz/model │
│ s/flowchart.py:14 in <module>                                                                    │
│                                                                                                  │
│    11 from typing import Any, Dict, List, Optional, Set, Union, cast                             │
│    12                                                                                            │
│    13 from <http://kedro.io|kedro.io> import AbstractDataSet                                                       │
│ ❱  14 from kedro.io.core import DatasetError                                                     │
│    15 from kedro.pipeline.node import Node as KedroNode                                          │
│    16 from kedro.pipeline.pipeline import TRANSCODING_SEPARATOR, _strip_transcoding              │
│    17                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: cannot import name 'DatasetError' from 'kedro.io.core' (/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/core.py)

Any ideas on what the problem may be? thanks!

Erwin

07/11/2023, 7:03 PM

Hello! I’m currently working with Kedro and Pandera for PySpark. I’m looking for some guidance on how to validate schemas and would appreciate any best practices or references you can provide. I have a few specific questions: 1. Where is the recommended place to define the schema? Should I build it using hooks from parameters (yaml)? Create a

schema.py

file? Or define the schema directly in

nodes.py

? I would greatly appreciate any help or suggestions you can offer. Thank you!

Esteban Obando

07/11/2023, 9:36 PM

Hi Team, I need some guidance here. I'm trying to execute a kedro project on databricks but the entire project is a wheel file. I'm currently using the main method to execute but it failes at the very end after the pipeline completes in all cases. Is there a way to fix that?

Esteban Obando

07/11/2023, 9:36 PM

here is the error I'm having

Esteban Obando

07/11/2023, 9:37 PM

Copy code

---------------------------------------------------------------------------
Exit                                      Traceback (most recent call last)
File /databricks/python/lib/python3.10/site-packages/click/core.py:1063, in BaseCommand.main(self, args, prog_name, complete_var, standalone_mode, windows_expand_args, **extra)
   1056         # it's not safe to `ctx.exit(rv)` here!
   1057         # note that `rv` may actually contain data like "1" which
   1058         # has obvious effects
   (...)
   1061         # even always obvious that `rv` indicates success/failure
   1062         # by its truthiness/falsiness
-> 1063         ctx.exit()
   1064 except (EOFError, KeyboardInterrupt):

File /databricks/python/lib/python3.10/site-packages/click/core.py:681, in Context.exit(self, code)
    680 """Exits the application with a given exit code."""
--> 681 raise Exit(code)

Exit: 0

During handling of the above exception, another exception occurred:

SystemExit                                Traceback (most recent call last)
    [... skipping hidden 1 frame]

File <command-991559434375355>:41
     39 params_string = ','.join(list_params)
---> 41 main(
     42     [
     43         "--env", conf,
     44         "--pipeline", pipeline,
     45         f"--params={params_string}"
     46     ]
     47 )  # o

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-2abca7aa-105b-47df-bf33-dd90d478d3bc/lib/python3.10/site-packages/encounter_kedro/__main__.py:47, in main(*args, **kwargs)
     46 run = _find_run_command(package_name)
---> 47 run(*args, **kwargs)

File /databricks/python/lib/python3.10/site-packages/click/core.py:1128, in BaseCommand.__call__(self, *args, **kwargs)
   1127 """Alias for :meth:`main`."""
-> 1128 return self.main(*args, **kwargs)

File /databricks/python/lib/python3.10/site-packages/click/core.py:1081, in BaseCommand.main(self, args, prog_name, complete_var, standalone_mode, windows_expand_args, **extra)
   1080 if standalone_mode:
-> 1081     sys.exit(e.exit_code)
   1082 else:
   1083     # in non-standalone mode, return the exit code
   1084     # note that this is only reached if `self.invoke` above raises
   (...)
   1089     # `ctx.exit(1)` and to `return 1`, the caller won't be able to
   1090     # tell the difference between the two

SystemExit: 0

During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)
    [... skipping hidden 1 frame]

File /databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2047, in InteractiveShell.showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
   2044 if exception_only:
   2045     stb = ['An exception has occurred, use %tb to see '
   2046            'the full traceback.\n']
-> 2047     stb.extend(self.InteractiveTB.get_exception_only(etype,
   2048                                                      value))
   2049 else:
   2050     try:
   2051         # Exception classes can customise their traceback - we
   2052         # use this in IPython.parallel for exceptions occurring
   2053         # in the engines. This should return a list of strings.

File /databricks/python/lib/python3.10/site-packages/IPython/core/ultratb.py:585, in ListTB.get_exception_only(self, etype, value)
    577 def get_exception_only(self, etype, value):
    578     """Only print the exception type and message, without a traceback.
    579 
    580     Parameters
   (...)
    583     value : exception value
    584     """
--> 585     return ListTB.structured_traceback(self, etype, value)

File /databricks/python/lib/python3.10/site-packages/IPython/core/ultratb.py:452, in ListTB.structured_traceback(self, etype, evalue, etb, tb_offset, context)
    449     chained_exc_ids.add(id(exception[1]))
    450     chained_exceptions_tb_offset = 0
    451     out_list = (
--> 452         self.structured_traceback(
    453             etype, evalue, (etb, chained_exc_ids),
    454             chained_exceptions_tb_offset, context)
    455         + chained_exception_message
    456         + out_list)
    458 return out_list

File /databricks/python/lib/python3.10/site-packages/IPython/core/ultratb.py:1118, in AutoFormattedTB.structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1116 else:
   1117     self.tb = tb
-> 1118 return FormattedTB.structured_traceback(
   1119     self, etype, value, tb, tb_offset, number_of_lines_of_context)

File /databricks/python/lib/python3.10/site-packages/IPython/core/ultratb.py:1012, in FormattedTB.structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
   1009 mode = self.mode
   1010 if mode in self.verbose_modes:
   1011     # Verbose modes need a full traceback
-> 1012     return VerboseTB.structured_traceback(
   1013         self, etype, value, tb, tb_offset, number_of_lines_of_context
   1014     )
   1015 elif mode == 'Minimal':
   1016     return ListTB.get_exception_only(self, etype, value)

File /databricks/python/lib/python3.10/site-packages/IPython/core/ultratb.py:865, in VerboseTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
    856 def structured_traceback(
    857     self,
    858     etype: type,
   (...)
    862     number_of_lines_of_context: int = 5,
    863 ):
    864     """Return a nice text document describing the traceback."""
--> 865     formatted_exception = self.format_exception_as_a_whole(etype, evalue, etb, number_of_lines_of_context,
    866                                                            tb_offset)
    868     colors = self.Colors  # just a shorthand + quicker name lookup
    869     colorsnormal = colors.Normal  # used a lot

File /databricks/python/lib/python3.10/site-packages/IPython/core/ultratb.py:799, in VerboseTB.format_exception_as_a_whole(self, etype, evalue, etb, number_of_lines_of_context, tb_offset)
    796 assert isinstance(tb_offset, int)
    797 head = self.prepare_header(etype, self.long_header)
    798 records = (
--> 799     self.get_records(etb, number_of_lines_of_context, tb_offset) if etb else []
    800 )
    802 frames = []
    803 skipped = 0

File /databricks/python/lib/python3.10/site-packages/IPython/core/ultratb.py:854, in VerboseTB.get_records(self, etb, number_of_lines_of_context, tb_offset)
    848     formatter = None
    849 options = stack_data.Options(
    850     before=before,
    851     after=after,
    852     pygments_formatter=formatter,
    853 )
--> 854 return list(stack_data.FrameInfo.stack_data(etb, options=options))[tb_offset:]

File /databricks/python/lib/python3.10/site-packages/stack_data/core.py:578, in FrameInfo.stack_data(cls, frame_or_tb, options, collapse_repeated_frames)
    562 @classmethod
    563 def stack_data(
    564         cls,
   (...)
    568         collapse_repeated_frames: bool = True
    569 ) -> Iterator[Union['FrameInfo', RepeatedFrames]]:
    570     """
    571     An iterator of FrameInfo and RepeatedFrames objects representing
    572     a full traceback or stack. Similar consecutive frames are collapsed into RepeatedFrames
   (...)
    576     and optionally an Options object to configure.
    577     """
--> 578     stack = list(iter_stack(frame_or_tb))
    580     # Reverse the stack from a frame so that it's in the same order
    581     # as the order from a traceback, which is the order of a printed
    582     # traceback when read top to bottom (most recent call last)
    583     if is_frame(frame_or_tb):

File /databricks/python/lib/python3.10/site-packages/stack_data/utils.py:97, in iter_stack(frame_or_tb)
     95 while frame_or_tb:
     96     yield frame_or_tb
---> 97     if is_frame(frame_or_tb):
     98         frame_or_tb = frame_or_tb.f_back
     99     else:

File /databricks/python/lib/python3.10/site-packages/stack_data/utils.py:90, in is_frame(frame_or_tb)
     89 def is_frame(frame_or_tb: Union[FrameType, TracebackType]) -> bool:
---> 90     assert_(isinstance(frame_or_tb, (types.FrameType, types.TracebackType)))
     91     return isinstance(frame_or_tb, (types.FrameType,))

File /databricks/python/lib/python3.10/site-packages/stack_data/utils.py:176, in assert_(condition, error)
    174 if isinstance(error, str):
    175     error = AssertionError(error)
--> 176 raise error

J. Camilo V. Tieck

07/11/2023, 10:49 PM

hi all, I need some help! we are a startup working with real state data. we are defining the deployment strategy for kedro on aws. I would love to discuss this with some experienced kedro users. this is our use case: we have a set of kedro projects with different models and processing pipelines. a typical project for us has the following components: [1] a kedro data processing pipeline [2] a kedro model training and evaluation pipeline [3] an inference/prediction pipeline [4] there is always an S3 bucket for each project [5] source data is usually in our datalake in aws, i.e. also an S3 bucket our models are not that big, so we usually run [1] and [2] locally, and the articfacts (model.pkl, and other stuff) are generated and stored in [4] the S3 location. so far so good. with [3], we have different requirements. [3.0] we have some post processing steps here: running the prediction on the model and then depending on the results from the model, extract some data from other datasets for that prediction, also format some of the data. we don’t want to duplicate the code here for [3.1] and [3.2]. [3.1] scheduled execution. we need to run the inference pipeline every x time period, let say every month. so, every month the generated data about real state properties [5] needs to be processed, and the output dataset is stored in the S3 [4]. -> we have tried a docker image and EC2, with an scheduled task, which kind of works, but this doesn’t play well with [3.2] [3.2] on demand excecution. we need to run the ‘same’ pipeline when a user registers a new property in our website. basically we need the same output information as in [3.1] but for one single registry. -> we have tried a lambda function that has the code for [3.0] and has access to the S3 location [4]. this works, but the code for the post-processing [3.0] is duplicated. questions? ? what would be a good architecture to solve this without duplicating the code in [3.0] the inference pipeline? ? how can we have an inference pipeline that can be run both on-demand and with an schedule? ? what aws services are recommended for this? thank you very much for any help or advise you can give me! bests, camilo

Adith George

07/12/2023, 6:45 AM

Hi All, I am running a kedro pipeline along with databricks notebooks. using the below code in a databricks.

Copy code

reload_kedro("../../../", env="base", extra_params=extra_params)

 with KedroSession.create(project_path="../", env="base", extra_params=extra_params) as session:
        session.run(pipeline_name="kedro_pipeline_name")

It runs a kedro session with the input parameters. Is there a way the kedro pipeline can return params/value back to the databricks notebooks? something similar to a

return

statement.

Zemeio

07/12/2023, 9:35 AM

Hey guys. How do I access the context/catalog during a node execution?

Eduardo Romero López

07/12/2023, 12:36 PM

I am trying to register a "plotly.PlotlyDataSet" type in the catalog and get this error:

Copy code

DatasetError: Failed while saving data to data set
PlotlyDataSet(filepath=C:/Users/eromero/Documents/Proyectos_bilbomatica_altia/0020004_Soporte_Analitica_Avanzada_e_Inte
ligencia_Artificial_dedalo_altia/componentes-IA-Altia-git/kedro/queres/data/08_reporting/plotly_express_reclamaciones_p
or_fecha.json, load_args={}, plotly_args={'fig': {'orientation': h, 'x': fecha, 'y': numero}, 'type': line},
protocol=file, save_args={}).
Value of 'x' is not the name of a column in 'data_frame'. Expected one of [0] but received: fecha

Eduardo Romero López

07/12/2023, 12:36 PM