This message was deleted.
# ask-anything
s
This message was deleted.
🙌 2
e
thanks for the question @Ondřej Hubáček! you can check out the survey we did where we include Kedro. A few things that come to mind (please correct me if I'm wrong on any of these points, since I'm not a kedro user and It's been a while since I tested it): I've seen some Kedro examples, and my personal take is that they contain too much boilerplate in the pipeline definition (our integration with Jupyter/VSCode/PyCharm allows data scientists to build pipelines without boilerplate in a much leaner way), the support for production platforms limits to showing a script that turns a kedro pipeline into something like an Argo pipeline (so essentially you need to copy paste that thing) - on the other hand, we have a separate package that streamlines this process, I dont think they have incremental builds (caching previous computations so the next time you run the pipeline, only tasks that changed executed), we have native support for SQL, so you can create tasks from .sql files instead of creating a .py that runs the query (we manage the connection to the db for you), this simplifies the pipelines a lot (here's an example), we also support R.
may I ask what are the upsides and downsides you've seen in Kedro?
i
Just to add on the above, sometimes kedro is a lot of overhead and can be an overkill to rapid experimentations, there’s a lot to manage compared to a pip install.
o
I have read the survey. • Yes boilerplate is what bothers me in kedro. • Integration with jupyter notebooks sounds like a terrible idea to me, but that's ok, I will just ignore the feature :] • I don't have (yet) a lot of experience with deploying kedro pipelines. I just figured from the docs that both kedro and ploomber supports aws batch which is something we might eventually use (as we use aws for all our services) • Caching in kedro is a thing. You can run pipelines from a specific node/task. This is the same in ploomber if I understand correctly. • SQL in tasks got my attention. Although often the SQL scripts we use are dynamically constructed in python and called via pandas. Downsides of kedro that come to mind right now: • boilerplate • supports only dags (ploomber too currently) • I really like using # %% + scientific mode in pycharm when developing a code and later moving it into functions. In kedro, this does not work that well, as the inputs to a node/task are outputs of some node in the pipeline. A workaround is to call all the tasks by yourself and handle the correct routing of the inputs/outputs yourself. What I tend to do is to stop the pipeline in a debugger a basically develop my code there, which sounds probably worse than it actually is. I am not sure, however, how much this can be improved in a pipelining tool. Upsides: • while it makes code harder to write, it makes it easier to read/navigate the project. There are definitely more things but these are those that came to my mind first.
We will be starting a new project next week and my plan is to pitch ploomber and give it a shot and than compare the hands on experience.
@Ido (Ploomber) overhead from what exactly? The most overhead I had using kedro is because I try to avoid caching and often rerun the whole pipeline.
e
• Give the jupyter integration a shot 😅 - we allow users to use the
# %%
format and open it as notebooks in Jupyter, so the same file can be developed in Jupyter/VScode without the complexities of the ipynb format, and to check that thing isn't broken you can just call
ploomber build
and it'll orchestrate execution • you mention that kedro only supports DAGs, can you expand on that? What features are missing? • yeah, using the
# %%
is very convenient for data scientists because they can still develop interactively but since we simplify the modular pipeline building part, they can create a lot more maintainable code (10 tasks, 20 cells each instead of a big script with 200 cells) • we support templated SQL, check this tutorial - same concept as generating sql from Python but a lot simpler. while it is not as powerful as using Python, it covers 90% of use cases and simplifies the code a lot please share your experience when you give ploomber a try. and don't hesitate to post any questions, we are happy to help!
o
"you mention that kedro only supports DAGs, can you expand on that? What features are missing?" Well, cycles. If I would like to get low level and run for example some optimization loop as pipeline of task/nodes this would not be currently possible either in kedro or ploomber. SQL templating looks interesting, I will look into it. (We have tried https://pugsql.org/ as an option between form and plain SQL, but there were some issues I can't exactly remember right now - something regarding performance i think)
👍 1
e
re cycles: ah, we've been chatting about this a couple times with other members of the community. since this keep coming up, I think we should work on it 🙂 - just to ensure I get this correctly. you want a cycle that will exit when some condition is met (e.g. model performance passes some threshold)?
👍 2
o
Yes. I probably want to call a user-defined function, that decides if the loop should stop or not.
e
cool. we actually have an open issue about this so feel free to share your thoughts https://github.com/ploomber/ploomber/issues/474
m
@Eduardo, apologies I just noticed that there was request for feedback from early January. I will include some.
e
no worries, @Matej UhrĂ­n!
o
@Eduardo Does ploomber supports any inputs/outputs "routing"? In kerdro I can have a following pipeline:
Copy code
def create_pipeline():
    node1 = node(func=node1_func, inputs="a", outputs="b")
    node2 = node(func=node2_func, inputs="c", outputs="d")
    node3 = node(func=add, inputs=["b", "d"], outputs="sum")
    return Pipeline([node1, node2, node3])
Just by looking at the pipeline, I can see how are the outputs passed through the pipeline. I can for example split a dataset in a node into training and testing sets, and just by setting the inputs and outputs ensure, that node/task for model training receives only the training set. From that I understand, in ploomber you specify only the order in which the tasks should be evaluated (using upstream)?
m
Copy code
- source: add.py
     name: add
     upstream: [b, d]
or perhaps
Copy code
- source: node3
     name: add
     upstream: [node1, node2]
e
You got it @Ondřej Hubáček : when you set the upstream relationship, you are implicitly saying that the outputs of one task become inputs of the next one. No need to route the outputs
IMO, setting order of execution and inputs/outputs is redundant. Ploomber will pass the paths to all the inputs available, if you don't need some of those, then that's fine. You can ignore them
o
For our use cases it is not redundant, it is a feature to consume only part of the outputs.
e
Yeah, Ploomber leaves that up to you. So say you have A->B, and you have outputs A1, A2. Ploomber will pass the paths to both A1 and A2 to B. And your code in B can decide if it loads one of those or both. There is no need to tell Ploomber what you wanna do in B