Hi all,
Could I get your opinion on data/process architecture?
Many of you work with big data in an s3 bucket or sql table. I do not have such resources so I would like to know what you think about the following.
“Data Lake” - Directory on Shared Drive
• Raw data files
ETL
• read csv via pandas
• hamilton to normalize data
• write to parquet
“Data Base” - parquet with norm hist data
Analysis
• read parquet via dask (anticiparting parquet files to balloon)
• Processing via hamilton/dask workflow