If I have `n` excel files to process each correspo...
# hamilton-help
s
If I have
n
excel files to process each corresponding to a day's snapshot of data, I could load them with
Parallelizable
and
Collect
yielding over filepaths. But each file has
m=3
sheets that i need to load as seperate data sets. The
Parallelizable
works on the
n
items but not the
m
sheets. Is there a hamiltonian idiom for that yet?
👀 1
s
@Seth Stokes correct. We don’t allow nesting of parallelizable. It’s also not always useful to parallelize that much — since it may actually be slower than processing them serially due to serialization costs.
@Elijah Ben Izzy or @Thierry Jean if you wanted to add thoughts.
e
Yep so whether or not its parallel is another decision (you can use parallel with synchronous execution). That said, I think that there are two solutions: 1. Break it each into individual items (have just n*m number of outputs in parallel) 2. Deal with sets of (3) I’d do (1), personally, assuming that makes sense for your use-case (that I understood it right)
t
If it makes sense in your analysis, you can iterate over sheets:
Copy code
def excel_files(file_paths: list[str]) -> list[Excel]:  # don't know the type
   return [load_excel(p) for p in file_paths]

def sheet(excel_files: list[Excel]) -> Sheet:
   for excel_file in excel_files:
      for sheet in excel_file.sheets:
         yield sheet

def sheet_collection(transformed_sheet: Collect[...]) -> list[...]
   return list(transformed_sheet)
👍 1
s
Would (1) be along the lines of
m
dataflows, one for each, and then have a driver for each one that yields over the filepaths?
t
if you need ids of specific files, you can pass a dictionary
Copy code
def excel_files(file_paths: list[str]) -> list[Excel]:  # don't know the type
   return [load_excel(p) for p in file_paths]

def sheet(excel_files: list[Excel]) -> dict:
   for file_idx, excel_file in enumerate(excel_files):
      for sheet_idx, sheet in enumerate(excel_files.sheets):
         yield dict(doc_id=file_idx, sheet_idx=sheet_idx, sheet=sheet)