Welcome & great question!
Some things to keep in mind:
(1) Hamilton’s roots are to help make it easy to figure out given an output, what was the code that created it. E.g. each function is a
named piece of business logic. So for some extra verbosity you get much simpler maintenance and hand-off.
(2) It’s up to you as to how granular you operate at. E.g. do you want to operate over dataframes, or columns, or both?
With that in mind, general building blocks like one-hot-encoding will be used
within the body of the functions that need it (versus other systems where it’s part of the pipeline). If there’s any shared logic between functions you can abstract that out like usual with “helper functions” or importing other code modules.
My recommendation is to draw out a granular DAG of the operations you want, and then the code in Hamilton should map pretty closely to it.
For an example using encoders you can look at our
DBT example using the titanic dataset (look at the diagram in the readme and you’ll see how it’s set up).
Otherwise in the next release (ETA Monday), we’ve got a new decorator
pipe that will enable you to more explicitly put something generic as part of the “pipeline definition”.
If you have example code, then we can help ground some of what I said with options.