https://kedro.org/ logo
Join Slack
Powered by
# questions
  • m

    Mohammed Samir

    01/29/2023, 11:06 AM
    Hello, Have a quick question related to pipelines and nodes Ordering, I have created like 5 pipelines each of which has its own nodes, now whenever i am running the full env.
    kedro run --env env_name
    the pipelines nodes are interchangeable in running order , meaning that it runs as below
    pipeline 1 --> Node 1
    pipeline 2 ---> Node 1
    pipeline 2 --> Node 2
    pipeline 3 --> Node 1
    pipeline 1 --> Node 2
    pipeline 3 --> Node 2
    (Note Nodes order in each pipeline is correct but kedro run a node from each pipeline) However i want them to run in the below order,
    pipeline 1 --> Node 1
    pipeline 1---> Node 2
    pipeline 2 --> Node 1
    pipeline 2 --> Node 2
    pipeline 3 --> Node 1
    pipeline 3 --> Node 2
    I have the following config in pipeline_registry -->
    return {"__default__": pipeline1 + pipeline2+ pipeline3 + pipeline4 + pipeline5, }
    K 2
    👀 1
    d
    • 2
    • 3
  • r

    Rob

    01/29/2023, 6:21 PM
    Hi again everyone, When I set a
    spark.yml
    file on the configuration folder, this to run the code from a
    databricks cluster
    (using a workflow job, so my
    run.py
    is in the DBFS), is required to specify the spark master URL? Or is there an alternative to omit the
    spark.yml
    to let Databricks manage my configuration? (I mean, to omit the manual setting of the Master URL) Thanks in advance!
    y
    d
    o
    • 4
    • 8
  • s

    Sergei Benkovich

    01/29/2023, 8:01 PM
    is there any integration with weights and biases? ideas on how i can run several runs with varying configuration where each one would be logged by W&B?
    d
    d
    • 3
    • 3
  • a

    Antoine Bon

    01/30/2023, 9:00 AM
    Hi, I've been trying to use the
    load_version
    functionality with a catalog that is build programmatically with a hook, but I fail to do so. From my understanding of the code this is not possible, and so I raised the following ticket https://github.com/kedro-org/kedro/issues/2233 Unless someone knows of a way to do so?
  • m

    Massinissa Saïdi

    01/30/2023, 4:17 PM
    Hello ! has anyone experienced kedro + sagemaker + custom docker image? Looking closer I have the impression that it's quite difficult to achieve given the way sagemaker is run and someone has already faced this problem without an answer. If anyone has any tips I'd love to hear them, thanks 🙂
    m
    y
    • 3
    • 3
  • m

    Massinissa Saïdi

    01/30/2023, 5:34 PM
    Another question again sorry, is it possible to get updated parameters with
    --params
    in code with
    KedroSession
    ? I have something like that
    Copy code
    def get_session() -> Optional[MyKedroSession]:
        bootstrap_project(Path.cwd())
        try:
            session = MyKedroSession.create()
        except RuntimeError as exc:
            <http://_log.info|_log.info>(f"Session doesn't exist, creating a new one. Raise: {exc}")
            package_name = str(Path(__file__).resolve().parent.name)
            session = MyKedroSession.create(package_name)
        return session
    
    
    def get_parameters():
        context = get_session().load_context()
        return context.params
    But
    get_parameters
    gives the parameters set in yaml and not the updated with
    --params
    ? thx !
    d
    f
    • 3
    • 12
  • a

    Andrew Stewart

    01/30/2023, 9:59 PM
    What's the use-case difference between
    Copy code
    ## from <https://kedro.readthedocs.io/en/stable/kedro_project_setup/session.html>
    
    from kedro.framework.session import KedroSession
    from kedro.framework.startup import bootstrap_project
    from pathlib import Path
    
    bootstrap_project(Path.cwd())
    with KedroSession.create() as session:
        session.run()
    vs
    Copy code
    ## from <https://kedro.readthedocs.io/en/stable/tutorial/package_a_project.html>
    
    from kedro_tutorial.__main__ import main
    
    main(
        ["--pipeline", "__default__"]
    )  # or simply main() if you don't want to provide any arguments
    d
    i
    • 3
    • 6
  • a

    Alexandra Lorenzo

    01/31/2023, 4:48 PM
    Hello, First, thanks a lot for creating such a community. I'm trying to connect my PartitionedDataSet to my S3 Bucket, I have the following error:
    "create_client() got multiple values for keyword argument 'aws_access_key_id'."
    credentials.yml
    Copy code
    dev_s3:
     client_kwargs:
        aws_access_key_id: AWS_ACCESS_KEY_ID
        aws_secret_access_key: AWS_SECRET_ACCESS_KEY
    catalog.yml
    Copy code
    raw_images:
      type: PartitionedDataSet
      dataset:
        type: flair_one.extras.datasets.satellite_image.SatelliteImageDataSet
      credentials: dev_s3
      path: <s3://ignchallenge/train> 
      filename_suffix: .tif
      layer: raw
    kedro = 0.17.7 s3fs = 0.4.2 Anyone as an idea ? Thanks in advance
    d
    • 2
    • 13
  • j

    João Areias

    01/31/2023, 5:01 PM
    Hi all, so I guess I'm a little late for the party, but why is
    kedro jupyter convert
    being deprecated? And is there going to be an easy way of turning notebooks into nodes and pipelines following this decision on kedro 0.19?
    d
    y
    m
    • 4
    • 6
  • e

    Elias

    01/31/2023, 5:54 PM
    What would be the smartest way to query only data from a database that is newer than 5 years (from today/a set enddate) through the catalog?
    d
    • 2
    • 5
  • o

    Olivia Lihn

    01/31/2023, 7:28 PM
    Hi everyone! Is there anyway of taking a column from a csv dataset as a parameter for another pipeline? We have a csv file with features that need to be created and we need to pass a these features as a list to another node as a parameter. Any ideas?
    i
    • 2
    • 1
  • a

    Andrew Stewart

    02/01/2023, 1:35 AM
    I have a Kedro project where I want to use PySpark when running in a cloud / production environment, but for experimentation in local environment I don't want to necessarily bother with standing up an entire Spark env. Looking for strategy advice. Solutions areas as I see so far: • somehow make SparkHook condition on environment? • really really simple Spark setup (like via Docker or something ; don't want to install Java on native)
    d
    d
    w
    • 4
    • 9
  • s

    Sebastian Cardona Lozano

    02/01/2023, 4:43 AM
    Hi everyone! In my team we started to use Kedro recently for data science projects, we have found many advantages, and we are very happy with it. Now we are facing some challenges regarding the implementation of the models in Google Cloud and Vertex AI. I woud really appreciate you opinion about these points: 1. We want to apply the data transformation steps to the new data (eg. one-hot encoding, standardization, missing imputation, etc) when the model is used for prediction. We know that with scikit-learn pipelines we can do that, but there are many disadvantages which were discussed in this thread. There, some of you recommended the
    kedro-mlflow
    plugin to achieve what we want. Here are the questions: Once you have the mlflow artifact can we still use the kedro-docker plugin to create the image or do we have to create the Docker image from scratch? From the other hand, can we still use the other plugins to export the pipeline to Airflow or Vertex Pipelines? 2. On that basis, we start to question if is it better to use mlflow for tracking and model registry taking advantage of the Kedro plugins, than the Vertex AI APIs. I would like to know your opinion about this or recommendations about how to combine both worlds. Thanks in advance. #C03RKP2LW64 #C03RKPCLYGY
    d
    • 2
    • 1
  • a

    Anirudh Dahiya

    02/01/2023, 1:14 PM
    Hi all! I have a kedro project that is being intiated with a pyspark session. Till date, I never had any issues when running pipelines or opening a jupyter notebook from my project's directory. However today I am facing this error -
    Copy code
    Exception: Java gateway process exited before sending its port number
    Has anyone faced this error before?
    d
    o
    • 3
    • 24
  • m

    Massinissa Saïdi

    02/02/2023, 9:59 AM
    Hello kedroids ! Is it possible to get the name of running
    tag
    in code ? (
    kedro run --tag NAME
    )
    d
    • 2
    • 2
  • l

    Larissa Siqueira

    02/02/2023, 2:28 PM
    Hello everyone! Is it possible to access parameters apart from the node inputs? Our goal is to format the variables names and make it change depending on the global params input on kedro run.
    d
    • 2
    • 4
  • a

    Artur Dobrogowski

    02/02/2023, 3:58 PM
    Hi, I'm getting to know kedro hooks - I want my hook only to run for specific pipeline. What should be the approach here? Should I detect which pipeline is running in settings.py and register it only if the pipeline is correct? Or can I somehow check which pipeline is being run in the hook itself? I don't see how to do it though from given hook parameteres: https://kedro.readthedocs.io/en/latest/kedro.framework.hooks.specs.DataCatalogSpecs.html#kedro.framework.hooks.specs.DataCatalogSpecs
  • d

    datajoely

    02/02/2023, 3:58 PM
    which kind of hook do you want to run?
    a
    • 2
    • 17
  • f

    Filip Panovski

    02/02/2023, 5:01 PM
    Hello everyone. I have a question with regard to environments, since I'm seemingly misunderstanding them. I searched a bit in this channel, but (unless I grossly misread something) didn't see this specific question. I have a
    dask.yml
    in my
    conf/base
    which contains the following (real config is much larger, but this gets the point across):
    Copy code
    dask_cloudprovider:
      region: eu-central-1
      instance_type: t3.xlarge
      n_workers: 36
    And a
    dask.yml
    in another environment, e.g.
    conf/low
    with the following:
    Copy code
    dask_cloudprovider:
      instance_type: t3.small
      n_workers: 8
    Which I activate using
    kedro run --env=low
    . Now, I would have expected the
    config_loader
    (
    TemplatedConfigLoader
    ) to contain something like
    {'dask_cloudprovider': {'region: 'eu-central-1', 'instance_type': 't3.small', 'n_workers': 8}}
    . However, it overrides the entire entry, resulting in the
    config_loader
    containing:
    {'dask_cloudprovider': {'instance_type': 't3.small', 'n_workers': 8}}
    . Is there any way to get what I was expecting out of the box? I don't really want to copy my entire configuration N-times for each environment, especially since only a few of the keys change. Is the intended use case for environments different to what I'm trying to use it for (say, only for top-level entries)?
    d
    m
    m
    • 4
    • 13
  • w

    WEN XIN (Jessie 文馨)

    02/03/2023, 4:47 AM
    Hi team, is there any guide on submitting
    spark
    job to
    EMR
    through
    livy
    for a
    kedro
    project?
    d
    • 2
    • 16
  • e

    Evžen Šírek

    02/03/2023, 10:01 AM
    Hi everyone! Is it possible to use the
    fastparquet
    engine with the ParquetDataSet? There is a possibility to specify the engine in the catalog entry:
    Copy code
    dataset:
      type: pandas.ParquetDataSet
      filepath: data/dataset.parquet
      load_args:
         engine: fastparquet
      save_args:
         engine: fastparquet
    However, when I do that, I get the
    DataSetError
    with
    I/O operation on closed file
    when Kedro tries to save the dataset. When I manually save the data with
    pandas
    and
    engine=fastparquet
    (which is what Kedro should do according to the docs), it works well. Is this expected? Thanks! :)) Environment:
    python==3.10.4, pandas==1.5.1, kedro==0.18.4, fastparquet==2023.1.0
    d
    • 2
    • 12
  • m

    Massinissa Saïdi

    02/03/2023, 10:45 AM
    Hello kedroids ! Has anyone ever used the kedro-argo plugin? And what is their feedback? Is it maintained and reliable with the new versions of kedro (as the last update is in 2020)
    👍 1
  • v

    Veenu Yadav

    02/03/2023, 1:18 PM
    Hi Team, I am getting error
    Given configuration path either does not exist or is not a valid directory: /usr/local/airflow/conf/base
    while deploying Kedro pipeline on on Apache Airflow with Astronomer . Any clues?
    y
    • 2
    • 1
  • v

    Veenu Yadav

    02/03/2023, 1:20 PM
    The directory
    /usr/local/airflow/conf/base
    is not even present in webserver container.
  • s

    Sergei Benkovich

    02/03/2023, 3:29 PM
    hey 🙂 i want to output several htmls, jsons, and dataframes at once, as a single report. is there any way to create them all in a single node and save to a single zipped file?
    d
    • 2
    • 2
  • r

    Rafał Nowak

    02/05/2023, 6:54 PM
    Hello all kedro enthusiasts, I am looking for implementation of kedro dataset
    json.JSONDataSet
    supporting gzip compression, so the filepath would be
    *.json.gz
    I haven’t found such backend in
    kedro.datasets
    Have anyone already implemented such dataset?
    d
    r
    • 3
    • 2
  • s

    Sergei Benkovich

    02/05/2023, 8:05 PM
    when saving a model using pickleDataSet with dill backend it packages the node in which the model instance was created and ran, trying to dill.load raises
    Copy code
    ModuleNotFoundError: No module named 'pipelines'
    any suggestions on how to handle it?
    d
    f
    i
    • 4
    • 15
  • a

    Ankar Yadav

    02/06/2023, 12:19 PM
    Hi Team, one quick questions: I am using pandas.CSVDataset to save a file however when I mention
    sep
    in save_args, it gives me an error:
    Copy code
    prm_customer:
      type: pandas.CSVDataSet
      filepath: ${base_path}/${folders.prm}/
      save_args:
        index: False
        sep: "|"
    Any idea how to fix this? I am using
    kedro 0.18.1
    n
    d
    s
    • 4
    • 18
  • y

    Yanni

    02/06/2023, 1:59 PM
    Hi guys I am a newbie in Kedro and have a question about my KedroProject. I would like to integrate k-fold cross validation into my project. What is the best way to implement this with Kedro? I found many train_test_split methods in Github with Kedro but none of them use cross-validation. The dataset is only splitted once into training and test set. What would be the best way to implement this in Kedro? Or is Kedro not useful in this case?
  • d

    Debanjan Banerjee

    02/06/2023, 2:03 PM
    Team, bit of a long shot but is Kedro catalog available as a separate data catalog API ? something like https://intake.readthedocs.io/en/latest/catalog.html
    d
    • 2
    • 1
1...111213...31Latest