This message was deleted Ploomber #ask-anything

Join Slack

This message was deleted.

# ask-anything

Slackbot

03/24/2022, 8:05 PM

This message was deleted.

Eduardo

03/24/2022, 8:09 PM

yes, you can use the same

source

many times, but you have to turn automatic upstream extraction off:

Copy code

# pipeline.yaml
meta:
  extract_upstream: false

tasks:
  - source: task.py
    name: task-a
    products: ...
    upstream [another]

  - source: task.py
    name: task-b
    products: ...
    upstream [a-different-one]

in your case, since

drop_columns

will have different upstreams you can refer to them without knowing their name like this:

Copy code

def drop_columns(upstream, product):
    train = pd.read_csv(upstream.first["train"])
    holdout = pd.read_csv(upstream.first["holdout"])
    # cleaning....

does this solve your issue?

Eduardo

03/24/2022, 8:09 PM

(note: if you turn extract_upstream off, all tasks must have an upstream key)

feregrino

03/24/2022, 8:25 PM

Ahm, not so sure that solves my problem though… I still have to hardcode

"train"

and

"holdout"

in my Python code… I was thinking of something to the tune of:

Copy code

tasks:
  - source: task.py
    name: task-a
    products: ...
    upstream [another.train]

  - source: task.py
    name: task-b
    products: ...
    upstream [another.holdout]

  - source: task.py
    name: task-c
    products: ...
    upstream [another.some_other]

And then Python

Copy code

def drop_columns(upstream, product):
    generic_df = pd.read_csv(upstream)
    #code...

Eduardo

03/24/2022, 8:28 PM

yeah, so in

drop_columns

you can do

upstream.first

and that's going to return the products from the upstream task (whatever that is, since you're not referring the upstream by name). what's missing with this approach?

feregrino

03/24/2022, 8:44 PM

So I could be specific in the

upstream

key in the pipeline:

Copy code

tasks:
  - source: task.drop_columns
    name: task-a
    products: ...
    upstream: [another.train]

And then in `task.py`:

Copy code

def drop_columns(upstream, product):
    generic_df = pd.read_csv(upstream.first)

upstream.first

would contain the product

train

of the

another

task?

Eduardo

03/24/2022, 9:03 PM

ah ok, i think I see where the confusion is coming from. in

upstream

you put the name of the upstream task. then

upstream.first

returns the product of that upstream. example:

Copy code

tasks:
  - source: tasks.train
    name: train
    products:
      a: ...
      b: ....

  - source: tasks.drop_columns
    products:
      c: ...
    upstream: [train]

Eduardo

03/24/2022, 9:04 PM

then

upstream.first

inside

tasks.drop_columns

is the same as doing

upstream['train']

hence, it returns a dictionary with keys

and

, since those are the products of

train

feregrino

03/24/2022, 9:37 PM

Oh, still not quite my use case. If I understand correctly, I would still need to reference

and

in my Python code, as in

upstream.first["a"]

, right? I want to avoid having to reference a specific upstream product within my Python code. The way I was thinking is to be able to directly specify where the value of

upstream

comes for a given task. Kind of: • for task a with source drop_columns, its upstream is split.holdout, • for task b with source drop_columns, its upstream is split.train • for task c with source drop_columns, its upstream is fetch_new_data.data and then have the function

drop_columns

be as generic as possible:

Copy code

def drop_columns(upstream, product):
    df = pd.read_csv(upstream)
    # ...

Eduardo

03/24/2022, 10:07 PM

ah ok, got it. interesting use case, never thought about it. right now, ploomber pushes all products downstream, so what you're suggesting isn't supported, I'll open an issue since this is an interesting use case. I think for now, what I suggested can help you. then, you have two options. 1) either standarize your products (e.g. tasks product a single file or produce a dictionary with the asme keys) or 2) add a bit of logic so you don't need to hardcode the product key. e.g if they generate a dictionary with a single key:

Copy code

def drop_columns(upstream, product):
    key = list(upstream.first)[0]
    df = pd.read_csv(upstream.first[key])

if the upstream generates a single product (instead of a dictionary), then it's simpler:

Copy code

def drop_columns(upstream, product):
    df = pd.read_csv(upstream.first)

Gaurav

03/25/2022, 2:49 AM

interesting request. So if i understand correct, @feregrino your drop task still has to have explicit knowledge of two upstream products as you are producing respective cleaned output for each. how would you be able to generate different named output without this understanding?

feregrino

03/25/2022, 1:22 PM

@Eduardo cool, I’ll keep an eye on the issues then. @Gaurav the task, as an abstraction (in the pipeline definition) yes, has to know about the upstream. However, at code (or using Ploomber terms, source level) I see no need for it to hold this knowledge.

Gaurav

03/25/2022, 1:25 PM

Just wondering if instead of generating named output, can you define product as a direcory instead of a file. In downstream task you just then have to get all files from this folder and apply drop task on them.

feregrino

03/25/2022, 1:32 PM

Sure I could workaround in many ways, however they all sound hacky. To me, it would make more sense to allow true task reusability by letting users specify connections amongst them in the pipeline file, rather than hardcoding them within the source code itself.

feregrino

03/25/2022, 4:05 PM

The more I think about this, the more it makes sense to me... For example, I was thinking of having three pipelines: One for initial training, other for subsequent retraining and one for batch inference. Reusing the same code for feature transformation across three pipelines would be great, reusing the training/evaluation/validation code for the initial and subsequent retraining would be great. At the moment I see this only being possible by creating thin wrapper methods.

Eduardo

03/25/2022, 7:35 PM

for training <> serving, we have a

import_tasks_from

feature, that allows you to put all feature engineering code in a file and then have a

pipeline.train.yaml

and

pipeline.serve.yaml

files (see here) - but I agree, there's more we can do to enable this modularization

Eduardo

03/25/2022, 11:40 PM

hi, i created an issue. I think this should fix your problem @feregrino, feel free to comment any feedback. @Gaurav, please also let me know what you think. I'd like to open the discussion so we add a new example showing how to re-use tasks https://github.com/ploomber/ploomber/issues/682

👍 1

2 Views

Open in Slack

Previous Next