Hamilton Open Source

I think a shared drive, where you separate what is raw, and what is processed is a good idea. You just need to be consistent in naming and documenting things.

If storage isn’t an issue, then I would add subdirectories corresponding to the date data was processed — that way it’s easy to look back at an old version, instead of overwriting the same files.


e.g.
  raw/
       source_1/
              20230210/
                    files
        ….
 processed/
        table_name_1/
                20230210/
                      files
       table_name_2/
                20230210/
                       files
        ….

I would only bring in dask when it’s necessary:
• either the data is too big to fit in memory, or it become really slow to run over all the data.

Thank you, that makes sense. I'll certainly adopt that structure.
When it comes to hamilton workflows that result in:
• transformed dataset
• agg of value_0 at (yearly, monthly, weekly, daily)
• agg of value_N at (yearly, monthly, weekly, daily)
If the driver can only return one of these at a time,
how should I be calling these in other scripts?

&gt;  If the driver can only return one of these at a time,
&gt; how should I be calling these in other scripts?
if they’re all in the same dataflow, you can get all of them at once.

The driver default is to try to create a single dataframe

but you can tell Hamilton to return a dictionary of outputs in the following way:

```from hamilton import base

adapter = base.SimplePythonGraphAdapter(base.DictResult())
dr = driver.Driver(config, module_1, module_2, adapter=adapter)
# then when you do execute, you'll get a dictionary back.
result_dict = dr.execute([ ...])```


Note - if you want to create a dataframe from a set of columns, your options are to create a function to do that, or do that after getting the result from the driver.


Is there any downside in having a single parquet with historical population of data (that is appended to) instead of individual parquets in its own folder?

I was thinking maybe like the below. Still would like to hear your thoughts. 

data/
• source_1/
    ◦ etl.py
    ◦ raw/
        ▪︎ files
    ◦ processed/
        ▪︎ table_name_1.parquet
        ▪︎ table_name_2.parquet
• source_2/
    ◦ etl.py
    ◦ raw/
        ▪︎ files
    ◦ processed/
        ▪︎ table_name_1.parquet
        ▪︎ table_name_2.parquet

&gt; Is there any downside in having a single parquet with historical population of data (that is appended to) instead of individual parquets in its own folder?
Try the simplest thing to manage for now. So I think that’s fine.

hey there, sorry its been a few months since ive been at the coder camp meeting, i see the venue changed i just wanted to ask is there any place nearby yall would recomend for parking?
Thank you!

Are you guys alright (in the context of the fires in LA)?

Elijah and myself are in the SF bay area so no issues for us.