https://kedro.org/ logo
Join Slack
Powered by
# questions
  • e

    Elias

    10/18/2022, 1:00 PM
    Hey Team, I was wondering if there was an elegant solution to overwrite parameters dynamically? I am instanciating a pipeline 12 times, but they all need to run with a different parameter called date_max, e.g. “07/01/22” for the first one, and the other ones are decrementing one month, e.g. “06/01/22"... etc.. The pipelines are generated of a template dynamically and ideally I would just pass the adjusted parameter.
    s
    y
    • 3
    • 3
  • e

    Elias

    10/18/2022, 1:00 PM
    Copy code
    parameters.yml
    t_-0:
      filters:
        date_max: 2022/07/01
    t_-1:
      filters:
        date_max: 2022/06/01
    So I want to avoid doing this, as I would need to pass 12 or more variables on each induction. Whereas they are actually all are dependent on the first one.
    n
    • 2
    • 2
  • u

    user

    10/19/2022, 2:38 PM
    How can I run `catalog.load` in a non-IPython context? In IPython I can run data = catalog.load('my_dataset') in order to load a dataset specified as 'my_dataset' in the catalog.yml file. What's the equivalent command in a pthon script? What do I need to import?
    ✅ 1
  • s

    Sean Westgate

    10/19/2022, 3:00 PM
    Hi Team, working my way through the spaceflight-tutorial I found that running
    kedro build-docs
    would pull down the latest jinja2 version 3.1.2 which then caused an error as
    contextfunction
    was deprecated in version 3.1.0. I manually downgraded jinja2 to version 3.0.3 and all worked fine. Not sure if it is just me or a general issue. Is posting bugs like this here the right thing to do? I had a look at your open issues on the Github repo but couldn't find anything related.
    m
    n
    • 3
    • 6
  • s

    Shubham Gupta

    10/20/2022, 3:40 AM
    Hi Team,
  • s

    Shubham Gupta

    10/20/2022, 3:43 AM
    We are trying to build an API using Kedro. I understand that the kedro loads data in a lazy manner. Is there any feature way we can persist this lookup data in DataCatalog? We might be able to keep a robust and fast API using a combination of lazy and eager load on DataCatalog.
    n
    • 2
    • 2
  • s

    Shubham Gupta

    10/20/2022, 3:44 AM
    And runners can take care of the rest.
  • s

    Suryansh Soni

    10/21/2022, 1:56 PM
    Hello Team. Does anyone have experience in deploying kedro pipelines to aws stepfunctions, please let me know i need some urgent help with that
    d
    n
    l
    • 4
    • 31
  • i

    Ian Whalen

    10/21/2022, 5:01 PM
    Just in time for halloween, I’m trying to do some
    Jinja
    black magic 🧙 High level: I want to add a global variable to
    globals_dict
    in
    settings.py
    and use it in a loop in my catalog. See thread for an example. Any ideas?
    f
    • 2
    • 6
  • j

    Jordan

    10/21/2022, 8:49 PM
    Hi friends, I’m looking for some advice. I need to be able to batch process different custom partitioned datasets using the same modular pipeline, whenever required. It’s quite tedious to make
    catalog.yml
    entries for the inputs and outputs of each batch process. Therefore, I was hoping to implement a solution using hooks that would avoid this tedium: If possible, I would like the solution to: 1. Dynamically populate the catalog with input and output entires for each partitioned dataset. 2. Instantiate and run the modular pipeline using each partitioned dataset’s dynamically populated catalog entries. 3. Make the output datasets of each run available via the data catalog at any time. This should (maybe) be possible with some combination of
    after_context_created
    ,
    after_catalog_created
    and
    before_pipeline_run
    hooks, but unsure how to actually implement this. Any guidance would be much appreciated, cheers.
    d
    • 2
    • 1
  • u

    user

    10/23/2022, 7:58 AM
    How to show plotly chart in kedro I am trying to use data science tool kedro according to this tutorial. I followed the instruction(write config.yaml, node.py and pipeline.py etc) and do exactly the same as the documentation) and could run kedro run successfully. And next step, I tried kedro viz and could show the pipelines but I cannot see plotly chart. Here is the result of the visualization. Please see the left...
  • u

    user

    10/24/2022, 8:18 AM
    How to generate kedro pipelines automatically (like DataEngineerOne does)? Having seen the video of DataEngineerOne:

    How To Use a Parameter Range to Generate Pipelines Automatically▾

    I want to automate a pipeline that simulates an electronic circuit. I want to do a grid search over multiple central frequencies of a bandpass filter, and for each one run the simulate pipeline. In the pipeline registry, the grid search parameters are passed to the create_pipeline() function's kwargs. #...
    v
    l
    • 2
    • 2
  • y

    Yetunde

    10/24/2022, 8:52 AM
    has renamed the channel from ‘support’ to ‘questions’
  • u

    user

    10/25/2022, 8:18 AM
    Can't run KedroSession with `from_inputs` parmeter: ValueError: Pipeline does not contain data_sets named [...] In jupyter notebook, when I run session.run(pipeline_name='sim', from_inputs=['measurements', 'params:simulation']), passing datasets & params specified in catalog.yaml, everything works fine. However, when I want to run it with a dataset that I added during the session, a ValueError occurs:
    >> ds = GenMsmtsDataSet()
    >> catalog.add('ipy_msmts', ds)
    >> session.run(pipeline_name='sim', from_inputs=['ipy_msmts', 'params:simulation'])
    ValueError: Pipeline does not contain data_sets named...
  • t

    Toni

    10/25/2022, 10:35 AM
    Hi community! I was wondering if I can save an output of the same node in two different formats: For instance:
    Copy code
    node(
      func = some_function,
      inputs = "some_input",
      outputs = "the_output",
       name = "node",
    ),
    Copy code
    the_output:
      type: pandas.CSVDataSet
      filepath: data/output_csv.csv
    
    the_output:
      type: pandas.ParquetDataSet
      filepath: data/output_parquet.parquet
    h
    f
    • 3
    • 5
  • l

    Luis Gustavo Souza

    10/25/2022, 1:03 PM
    Hello, everyone! I need to pass some complex parameters to kedro cli (list, dicts, list of dicts; e.g: --params test:["a", "b", "c"]). Does anyone know how I achieve that?
    i
    n
    • 3
    • 3
  • y

    Yuchu Liu

    10/25/2022, 1:40 PM
    Hello everyone! I am trying to set up Kedro on my machine for an existing project and pipeline. My colleague and I have similar dependencies, and the project works perfectly fine on their machine. The error I got is related to writing a parquet file. To debug, I have: • validated that Pyspark works when reading and writing a parquet file, including overwriting an existing file • I loaded Kedro jupyter lab and tried to load and write a parquet file, the loading works, but writing gives me the same error message as when I run the pipeline (Failed while saving data to data set)
    d
    r
    • 3
    • 19
  • d

    Danhua Yan

    10/25/2022, 2:00 PM
    Hi team, a question might be related to how to create a custom datasets with certain read/write behavior. Details below: I try to use kedro to read a
    delta
    datasets created by databricks in pandas. The current configs look like below:
    Copy code
    _pandas_parquet: &_pandas_parquet
      type: pandas.ParquetDataSet
    
    _spark_parquet: &_delta_parquet
      type: spark.SparkDataSet
      file_format: delta
    What I want to achieve:
    Copy code
    node1:
      outputs: dataset@spark
    
    node2:
      inputs: dataset@pandas
    Unfortunately
    pandas
    doesn’t support reading
    delta
    as is. I found below workaround that requires additional steps. https://mungingdata.com/pandas/read-delta-lake-dataframe/ How should I create a dataset that can do something like this internally when being loaded?
    Copy code
    from deltalake import DeltaTable
    dt = DeltaTable("resources/delta/1")
    df = dt.to_pandas()
    Tried looking into this https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction but nothing mentioned about using pandas to interact with
    delta
    . Thank you!
    n
    • 2
    • 4
  • d

    Denis Araujo da Silva

    10/25/2022, 2:40 PM
    What’s the future for
    kedro build-reqs
    ? Just saw the message that it will be deprecated at 0.19.
    ✅ 1
    n
    j
    • 3
    • 5
  • s

    Sasha Collin

    10/26/2022, 9:53 AM
    Hey! I have a question about the best practice when dealing with several splitting methods of the same dataset. I was thinking about doing a structure as follows:
    Copy code
    - 05_model_input (folder)
    --- master_table_1 (folder)
    ------ master_table_1.csv (file)
    ------ split_1 (folder)
    --------- X_train.csv
    --------- X_test.csv
    --------- y_train.csv
    --------- y_test.csv
    ------ split_2 (folder)
    --------- X_train.csv
    --------- X_test.csv
    --------- y_train.csv
    --------- y_test.csv
    Would you say this is good practice? or would you advice not saving the splits and parametrise the split method selected in the parameters.yml for instance? Thanks a lot for your help!
    b
    • 2
    • 1
  • n

    Nichita Morcotilo

    10/26/2022, 10:02 AM
    Hey! I have a question regarding kedro 0.17.7. I have three conf folders:
    conf/base
    ,
    conf/test
    , and
    conf/local
    (empty directory). My
    conf/test/pipelines.yml
    is an empty file, and executing
    kedro run --env=test
    results in creating in
    /data
    directory folders for each of the nodes listed in
    conf/base/pipelines.yml
    It is expected behavior for kedro? I mean, if one of the environments has empty
    pipelines.yml
    to fallback for base env? Thank you!
    🔥 1
    e
    d
    • 3
    • 4
  • e

    Erwin

    10/26/2022, 11:43 AM
    Hi! We are having an issue with CI Github. We have kedro project with experimental tracking enabled, so every time it runs something save some information in the DB, which always runs great in local mode. But when it comes to test a minimal pipeline in github actions (just to make sure there are no circular dependencies, etc), it fails since the repo is cloned from a specific commit and not a branch (since it is a CI test, after a PR is opened). The key here is that the run cannot be associated with a branch, this fails:
    branch = git.Repo(search_parent_directories=True).a
    Is there any way to disable experiment tracking at runtime? Or what would it be a better approach to check if kedro can at least create the graph and detect circular dependencies? Detailed log:
    Copy code
    Run kedro run --tag tag_dict
    As an open-source project, we collect usage analytics. 
    We cannot see nor store information contained in a Kedro project. 
    You can find out more by reading our privacy notice: 
    <https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice> 
    Do you opt into usage analytics?  [y/N]: [10/25/22 20:27:24] WARNING  Failed to confirm consent. No data    plugin.py:210
                                 was sent to Heap. Exception:                       
    [10/25/22 20:27:24] INFO     Kedro project            session.py:343
    
    
    
    
    Pipelines started
    
            
            
    [10/25/22 20:27:24] INFO     Seeding sklearn, numpy and random   seed_file.py:41
                                 libraries with the seed 42                         
                        INFO     Loading data from               data_catalog.py:343
                                 'tag_dictionary'                                   
                                 (ExcelDataSet)...                                  
    [10/25/22 20:27:25] INFO     Running node: create_td:                node.py:327
                                 create_td([tag_dictionary]) -> [td]                
                        INFO     Saving data to 'td'             data_catalog.py:382
                                 (PickleDataSet)...                                 
                        INFO     Completed 1 out of 1 tasks  sequential_runner.py:85
                        INFO     Pipeline execution completed           runner.py:90
                                 successfully.                                      
    ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
    │ /opt/hostedtoolcache/Python/3.8.0/x64/bin/kedro:8 in <module>                │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
    │ ework/cli/cli.py:211 in main                                                 │
    │                                                                              │
    │   208 │   """                                                                │
    │   209 │   _init_plugins()                                                    │
    │   210 │   cli_collection = KedroCLI(project_path=Path.cwd())                 │
    │ ❱ 211 │   cli_collection()                                                   │
    │   212                                                                        │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
    │ .py:1130 in __call__                                                         │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
    │ ework/cli/cli.py:139 in main                                                 │
    │                                                                              │
    │   136 │   │   )                                                              │
    │   137 │   │                                                                  │
    │   138 │   │   try:                                                           │
    │ ❱ 139 │   │   │   super().main(                                              │
    │   140 │   │   │   │   args=args,                                             │
    │   141 │   │   │   │   prog_name=prog_name,                                   │
    │   142 │   │   │   │   complete_var=complete_var,                             │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
    │ .py:1055 in main                                                             │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
    │ .py:1657 in invoke                                                           │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
    │ .py:1404 in invoke                                                           │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/click/core │
    │ .py:760 in invoke                                                            │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
    │ ework/cli/project.py:366 in run                                              │
    │                                                                              │
    │   363 │   node_names = _get_values_as_tuple(node_names) if node_names else n │
    │   364 │                                                                      │
    │   365 │   with KedroSession.create(env=env, extra_params=params) as session: │
    │ ❱ 366 │   │   session.run(                                                   │
    │   367 │   │   │   tags=tag,                                                  │
    │   368 │   │   │   runner=runner(is_async=is_async),                          │
    │   369 │   │   │   node_names=node_names,                                     │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
    │ ework/session/session.py:293 in __exit__                                     │
    │                                                                              │
    │   290 │   def __exit__(self, exc_type, exc_value, tb_):                      │
    │   291 │   │   if exc_type:                                                   │
    │   292 │   │   │   self._log_exception(exc_type, exc_value, tb_)              │
    │ ❱ 293 │   │   self.close()                                                   │
    │   294 │                                                                      │
    │   295 │   def run(  # pylint: disable=too-many-arguments,too-many-locals     │
    │   296 │   │   self,                                                          │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro/fram │
    │ ework/session/session.py:285 in close                                        │
    │                                                                              │
    │   282 │   │   if `save_on_close` attribute is True.                          │
    │   283 │   │   """                                                            │
    │   284 │   │   if self.save_on_close:                                         │
    │ ❱ 285 │   │   │   self._store.save()                                         │
    │   286 │                                                                      │
    │   287 │   def __enter__(self):                                               │
    │   288 │   │   return self                                                    │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro_viz/ │
    │ integrations/kedro/sqlite_store.py:68 in save                                │
    │                                                                              │
    │   65 │   │   engine, session_class = create_db_engine(self.location)         │
    │   66 │   │   Base.metadata.create_all(bind=engine)                           │
    │   67 │   │   database = next(get_db(session_class))                          │
    │ ❱ 68 │   │   session_store_data = RunModel(id=self._session_id, blob=<http://self.to|self.to> │
    │   69 │   │   database.add(session_store_data)                                │
    │   70 │   │   database.commit()                                               │
    │   71                                                                         │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/kedro_viz/ │
    │ integrations/kedro/sqlite_store.py:52 in to_json                             │
    │                                                                              │
    │   49 │   │   │   │   try:                                                    │
    │   50 │   │   │   │   │   import git  # pylint: disable=import-outside-toplev │
    │   51 │   │   │   │   │                                                       │
    │ ❱ 52 │   │   │   │   │   branch = git.Repo(search_parent_directories=True).a │
    │   53 │   │   │   │   │   value["branch"] = branch.name                       │
    │   54 │   │   │   │   except ImportError as exc:  # pragma: no cover          │
    │   55 │   │   │   │   │   logger.warning("%s:%s", exc.__class__.__name__, exc │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/git/repo/b │
    │ ase.py:865 in active_branch                                                  │
    │                                                                              │
    │    862 │   │   :raises TypeError: If HEAD is detached                        │
    │    863 │   │   :return: Head to the active branch"""                         │
    │    864 │   │   # reveal_type(self.head.reference)  # => Reference            │
    │ ❱  865 │   │   return self.head.reference                                    │
    │    866 │                                                                     │
    │    867 │   def blame_incremental(self, rev: str | HEAD, file: str, **kwargs: │
    │    868 │   │   """Iterator for blame information for the given file at the g │
    │                                                                              │
    │ /opt/hostedtoolcache/Python/3.8.0/x64/lib/python3.8/site-packages/git/refs/s │
    │ ymbolic.py:309 in _get_reference                                             │
    │                                                                              │
    │   306 │   │   │   to a reference, but to a commit"""                         │
    │   307 │   │   sha, target_ref_path = self._get_ref_info(self.repo, self.path │
    │   308 │   │   if target_ref_path is None:                                    │
    │ ❱ 309 │   │   │   raise TypeError("%s is a detached symbolic reference as it │
    │   310 │   │   return self.from_path(self.repo, target_ref_path)              │
    │   311 │                                                                      │
    │   312 │   def set_reference(                                                 │
    ╰──────────────────────────────────────────────────────────────────────────────╯
    TypeError: HEAD is a detached symbolic reference as it points to 
    'b508cdadd62cf912ab26104388cef8e08d1066eb'
    Error: Process completed with exit code 1.
    K 1
    🙏 1
    m
    • 2
    • 1
  • s

    Suryansh Soni

    10/26/2022, 3:17 PM
    Hello ! Is there any update on kedro deployment issue with AWS step functions ?
    m
    • 2
    • 1
  • m

    Michał Stachowicz

    10/27/2022, 9:56 AM
    In my project I would like to have a pipeline using a pre-trained model that is on MLflow. The problem I am facing is that I don't know how to convey the transformation information on the data in the preprocessing. For example, for normalisation we need a minimum and a maximum value to be able to transform the test set in the same way. Is there a solution to this issue? Do you recommend using sklearn Pipelines ?
    K 3
    l
    m
    +5
    • 8
    • 12
  • z

    Zirui Xu

    10/27/2022, 4:34 PM
    Hello team. Is there a way to use
    kedro.extras.datasets.spark.SparkDataSet
    without installing dependencies specified in
    kedro[spark]
    ? I am on a databricks cluster where the installation of pyspark is blocked.
    n
    • 2
    • 8
  • e

    Eivind Samseth

    10/28/2022, 11:05 AM
    Any news on Python 3.11 support? I saw on hacker news that it will have quite some speed ups
    n
    • 2
    • 1
  • s

    Seth

    10/28/2022, 2:45 PM
    Is there a way to access the config values of my conf/base/globals.yml inside a node? I know I can load them in by creating a ConfigLoader, however since they are used to create the DataCatalog, I assume they were read into memory already and maybe accessible some other way?
    i
    b
    • 3
    • 4
  • l

    Lorenzo Castellino

    10/31/2022, 8:48 AM
    Is there a suggested way to set a numpy.random.state globally? I thought about using a global parameter but it might be verbose to add it to all the functions that require it. I was also thinking about a custom Hook that would execute
    np.random.set_state()
    before the nodes that require it but I would like to hear what your solution to "reproducible randomness" looks like 🙂
    n
    b
    d
    • 4
    • 11
  • j

    Jordan

    10/31/2022, 5:03 PM
    What is the best way to make credentials from
    credentials.yml
    available for use in a hook? I know I can just use the yaml loader, but this doesn’t feel correct.
    n
    • 2
    • 10
  • p

    Pedro Arthur

    10/31/2022, 7:52 PM
    Hi! Is there a way to cache Node outputs and avoid running it if the cached result exists?
    ✅ 1
    n
    • 2
    • 2
12345...31Latest