If I have `n` excel files to process each corresponding to a Hamilton Open Source #hamilton-help

If I have `n` excel files to process each correspo...

Seth Stokes

05/01/2024, 10:52 PM

If I have

excel files to process each corresponding to a day's snapshot of data, I could load them with

Parallelizable

and

Collect

yielding over filepaths. But each file has

m=3

sheets that i need to load as seperate data sets. The

Parallelizable

works on the

items but not the

sheets. Is there a hamiltonian idiom for that yet?

👀 1

Stefan Krawczyk

05/01/2024, 10:54 PM

@Seth Stokes correct. We don’t allow nesting of parallelizable. It’s also not always useful to parallelize that much — since it may actually be slower than processing them serially due to serialization costs.

Stefan Krawczyk

05/01/2024, 10:56 PM

@Elijah Ben Izzy or @Thierry Jean if you wanted to add thoughts.

Elijah Ben Izzy

05/01/2024, 10:56 PM

Yep so whether or not its parallel is another decision (you can use parallel with synchronous execution). That said, I think that there are two solutions: 1. Break it each into individual items (have just n*m number of outputs in parallel) 2. Deal with sets of (3) I’d do (1), personally, assuming that makes sense for your use-case (that I understood it right)

Thierry Jean

05/01/2024, 10:58 PM

If it makes sense in your analysis, you can iterate over sheets:

Copy code

def excel_files(file_paths: list[str]) -> list[Excel]:  # don't know the type
   return [load_excel(p) for p in file_paths]

def sheet(excel_files: list[Excel]) -> Sheet:
   for excel_file in excel_files:
      for sheet in excel_file.sheets:
         yield sheet

def sheet_collection(transformed_sheet: Collect[...]) -> list[...]
   return list(transformed_sheet)

👍 1

Seth Stokes

05/01/2024, 10:59 PM

Would (1) be along the lines of

dataflows, one for each, and then have a driver for each one that yields over the filepaths?

Thierry Jean

05/01/2024, 10:59 PM

if you need ids of specific files, you can pass a dictionary

Copy code

def excel_files(file_paths: list[str]) -> list[Excel]:  # don't know the type
   return [load_excel(p) for p in file_paths]

def sheet(excel_files: list[Excel]) -> dict:
   for file_idx, excel_file in enumerate(excel_files):
      for sheet_idx, sheet in enumerate(excel_files.sheets):
         yield dict(doc_id=file_idx, sheet_idx=sheet_idx, sheet=sheet)

Open in Slack

Previous Next