This message was deleted Ploomber #ask-anything

Join Slack

This message was deleted.

# ask-anything

Slackbot

05/13/2022, 7:29 PM

This message was deleted.

Eduardo

05/13/2022, 7:35 PM

great question! there isn't one particular feature to link the pipelines but you can achieve this with parametrized pipelines. For example, you can define an

output_i_want_to_link

key in your

env.yaml

and map it to the output location (say

clean-data.csv

) then reference it in

pipeline.preprocess.yaml

(as a

product

in the final task) and

pipeline.model.yaml

(as a

param

in the first task) with

{{output_i_want_to_link}}

, this way, you'll avoid hardcoding the path. to mark the training pipeline as outdated when your data changes you can use resources_ feature. however, this will cause ploomber to compute the hash on the file, which isn't scalable if the file is too big. what's the file size?

👍 2

Eduardo

05/13/2022, 7:37 PM

also, out of curiosity. why not having a single

pipeline.yaml

Jess Mankewitz (they/she)

05/13/2022, 7:37 PM

one file is ~77.5mb and the other is ~25mb….so not horrible but definitely not teeny tiny!

Eduardo

05/13/2022, 7:39 PM

ah, I think using

resources_

will work. It'll probably take a few seconds to hash the file but it will solve your problem. let me know how it goes. we have a long-standing issue about adding a feature to facilitate composing pipelines and maybe it's the right time to tackle it 🙂

Jess Mankewitz (they/she)

05/13/2022, 7:40 PM

re: multiple pipelines, we’re getting to the point where we have ~15 tasks for the preprocessing steps and ~10 tasks in the modeling steps. We can keep them in one pipeline, but we’ve been conceptualizing them as two major steps. Plus, most of our collaborators and other scientists only realllly care about (for better or for worse) the initial output after we do all our data munging and cleaning plus the model results 🤣

Eduardo

05/13/2022, 7:43 PM

ah, this makes sense. I didn't think about it this way before. I'm guessing it makes it simpler for each collaborator since they don't need to deal with understanding other parts of the pipeline. I think my suggestion will work but feel free to send any other questions!

Jess Mankewitz (they/she)

05/13/2022, 7:43 PM

Thanks, appreciate it!

meerkat 1

Jess Mankewitz (they/she)

05/25/2022, 9:33 PM

Hey! Just working on this now, I’ve defined the parameter in my

env.yml

, but I’m getting a warning that its not defined…how do I check which env ploomber is pointing to?

Eduardo

05/25/2022, 10:09 PM

Can you share the error message?

Jess Mankewitz (they/she)

05/25/2022, 10:56 PM

Copy code

Error: Error replacing placeholders:
  * {{save_path}}: Ensure the placeholder is defined in the env

Loaded env: EnvDict({'cwd': '/Users/jessi...ense_pipeline', 'git': 'main', 'git_hash': 'c5441e3-dirty', 'here': '/Users/jessi...ense_pipeline', ...})

Jess Mankewitz (they/she)

05/25/2022, 10:56 PM

and my env.yml def has

Copy code

save_path: preprocessing/output/processed_data/preprocessed_semcor_tags.csv

Jess Mankewitz (they/she)

05/25/2022, 10:57 PM

and my pipeline.yml has

Copy code

- source: preprocessing/scripts/preprocess_semcor_tags.py
  product:
    nb: preprocessing/output/notebooks/preprocess_semcor_tags.ipynb
    data: {{save_path}}

Jess Mankewitz (they/she)

05/25/2022, 10:58 PM

I’m worried its finding a different env or something? I’d just like to check which env its parsing 🤔

Eduardo

05/25/2022, 11:03 PM

Ah, it’s because the name should be env.yaml, rename it and it should work!

Eduardo

05/25/2022, 11:03 PM

We have an open issue about this :)

Jess Mankewitz (they/she)

05/25/2022, 11:04 PM

Ha! I knew it’d be something simple, thanks!

Eduardo

05/25/2022, 11:04 PM

Sure. I’d recommend adding {{root}}/preprocessing/… prefixing it will convert it to an absolute path and it will ensure you can load the file from anywhere

🙏 1

4 Views

Open in Slack

Previous Next