This message was deleted.
# ask-anything
s
This message was deleted.
e
great question! there isn't one particular feature to link the pipelines but you can achieve this with parametrized pipelines. For example, you can define an
output_i_want_to_link
key in your
env.yaml
and map it to the output location (say
clean-data.csv
) then reference it in
pipeline.preprocess.yaml
(as a
product
in the final task) and
pipeline.model.yaml
(as a
param
in the first task) with
{{output_i_want_to_link}}
, this way, you'll avoid hardcoding the path. to mark the training pipeline as outdated when your data changes you can use resources_ feature. however, this will cause ploomber to compute the hash on the file, which isn't scalable if the file is too big. what's the file size?
👍 2
also, out of curiosity. why not having a single
pipeline.yaml
?
j
one file is ~77.5mb and the other is ~25mb….so not horrible but definitely not teeny tiny!
e
ah, I think using
resources_
will work. It'll probably take a few seconds to hash the file but it will solve your problem. let me know how it goes. we have a long-standing issue about adding a feature to facilitate composing pipelines and maybe it's the right time to tackle it 🙂
j
re: multiple pipelines, we’re getting to the point where we have ~15 tasks for the preprocessing steps and ~10 tasks in the modeling steps. We can keep them in one pipeline, but we’ve been conceptualizing them as two major steps. Plus, most of our collaborators and other scientists only realllly care about (for better or for worse) the initial output after we do all our data munging and cleaning plus the model results 🤣
e
ah, this makes sense. I didn't think about it this way before. I'm guessing it makes it simpler for each collaborator since they don't need to deal with understanding other parts of the pipeline. I think my suggestion will work but feel free to send any other questions!
j
Thanks, appreciate it!
meerkat 1
Hey! Just working on this now, I’ve defined the parameter in my
env.yml
, but I’m getting a warning that its not defined…how do I check which env ploomber is pointing to?
e
Can you share the error message?
j
Copy code
Error: Error replacing placeholders:
  * {{save_path}}: Ensure the placeholder is defined in the env

Loaded env: EnvDict({'cwd': '/Users/jessi...ense_pipeline', 'git': 'main', 'git_hash': 'c5441e3-dirty', 'here': '/Users/jessi...ense_pipeline', ...})
and my env.yml def has
Copy code
save_path: preprocessing/output/processed_data/preprocessed_semcor_tags.csv
and my pipeline.yml has
Copy code
- source: preprocessing/scripts/preprocess_semcor_tags.py
  product:
    nb: preprocessing/output/notebooks/preprocess_semcor_tags.ipynb
    data: {{save_path}}
I’m worried its finding a different env or something? I’d just like to check which env its parsing 🤔
e
Ah, it’s because the name should be env.yaml, rename it and it should work!
We have an open issue about this :)
j
Ha! I knew it’d be something simple, thanks!
e
Sure. I’d recommend adding {{root}}/preprocessing/… prefixing it will convert it to an absolute path and it will ensure you can load the file from anywhere
🙏 1