This message was deleted Ploomber #ask-anything

Join Slack

This message was deleted.

# ask-anything

Slackbot

09/09/2022, 10:53 AM

This message was deleted.

Eduardo

09/09/2022, 12:44 PM

great question! we have an open issue that will simplify this use case - but haven't finished working on it yet! I'd suggest adding one parameter to pass the columns you want to train on:

Copy code

# pipeline.yaml
tasks:
  - source: train.ipynb
    product:
      nb: 'output/{{experiment_name}}/train.ipynb'
    params:
      column_to_delete: '{{column_to_delete}}'

then, create an `env.yaml`:

Copy code

experiment_name: default_experiment
column_to_delete: null

then implement the logic that drops a column based on the

column_to_delete

value, and run your pipeline with:

Copy code

ploomber build

to run an experiment with a column dropped:

Copy code

ploomber build --env--experiment_name my-experiment --env-column_to_delete some-column

then you'll have

output/default/train.ipynb

and

output/my-experiment/train.ipynb

and you can compare them! if you have more than one task in your pipeline (e.g. the tasks that generate the training set, you can share those files), this will allow you to cache results so next time you run an experiment, you don't have to run all tasks

Copy code

# pipeline.yaml
tasks:
  - source: prepare.ipynb
    product:
      # note that we don't use "experiment_name" here!
      nb: output/prepare.ipynb
      data: output/train.csv

  - source: train.ipynb
    product:
      nb: 'output/{{experiment_name}}/train.ipynb'
    params:
      column_to_delete: '{{column_to_delete}}'

let me know if this helps!

Jan Lennartz

09/09/2022, 2:01 PM

Thanks for the detailed explanation! In fact this is what I've done for now. I created a 'v1' with the current columns and a 'v2' with the additional column by using env.yaml and params. However, I would like to compare the results in a last task that combines all (or given versions). This seems to be exactly what the issue is about that you linked. The problem is that currently I can not easily cascade the params to my other notebooks / scripts in the pipeline. And because the selection of the columns comes very early in the pipeline I have to run everything multiple times. Of course this is not possible to avoid but I lack the possibility to easily combine the results later on. Within a given pipeline I'm always stuck to the current parameter. Ideally I want to be able to change (or add) any task in my pipeline (e.g. add a feature) and compare the results to see how this change affects the results. If I understand correctly this will be possible with https://github.com/ploomber/ploomber/issues/602 but currently it is not possible to do this within the pipeline, i.e. you have to manually look at the different pipeline results or create a comparison script outside the pipelines. Is that correct?

Eduardo

09/09/2022, 4:48 PM

However, I would like to compare the results in a last task that combines all (or given versions). This seems to be exactly what the issue is about that you linked.

ah, i thought you wanted one evaluation task per task generated by grid. if you want to evaluate all models at once, there's another thing you can do. you can use grid and set the next task with a placeholder:

Copy code

tasks:
  - source: ...
    grid: ...
    name: train- # since this is a grid, it'll generate train-0, train-1,...

  # make this task depend on all fit- tasks
  # by setting a placeholder as the upstreamm:
  # upstream = ["train-*"]
  - source: evaluate.ipynb
    product: ...

The problem is that currently I can not easily cascade the params to my other notebooks / scripts in the pipeline.

do you need to access the grid parameters in later tasks? one way would be to have the train.ipynb store its parameters in a

parameter.json

and register it as a product, then load them in the next stage. Alternatively, you can use our notebook introspector to extract values, charts, table from output cells I think both of these things combined would get you what you want, but let me know if this is not what you want to implement

Eduardo

09/09/2022, 4:48 PM

btw, a final alternative is to use the Python API directly, which is a lot more flexible for this kind of dynamic things, the notebook introspector has a complete example

Jan Lennartz

09/12/2022, 2:20 PM

Thanks for the hint with notebook introspector! This is also going into the direction I was looking for. The parameters cascading I can probably accomplish then also via the DAG as well. I will try and see if I can achieve it 🙂

Eduardo

09/12/2022, 4:25 PM

great, feel free to post other questions if you need help! I'd love to see that notebook pipeline up and running!

3 Views

Open in Slack

Previous Next