I’m mainly wondering what the most appropriate destination would be for our usecase.
Some details:
• We want to sync the user’s data (for ex. Shopify/Prestashop/Amazon/… orders/customers/products/…) to an internal database (the destination) with Airbyte. This data is then periodically used to train an ML model. We are then using this model to run inference on new data coming into the database.
◦ For training: we need to pull all the data
◦ For inference: we nee to pull all data since some_timestamp/last_pull/…
• We also need to be able to join these different datasets (for ex. add customer features to an order). We might be able to get away with joining them in the training/prediction code, but this does not seem ideal
• Afterward, we are storing our predictions in another database (probably makes sense that this has the same architecture as the destination)
My thoughts:
• A SQL database seems like the most sensible option. This allows for incremental updates, selecting the most recent data for inference, and joining different tables. However:
◦ I’m worried about creating a large number of tables (even 100s of users would be a problem?)
◦ This does seems a bit like overkill to store these type of datasets in separate SQL tables (some clients only have 100s of customers atm)
• NoSQL is an attractive option, however, I would lose the ability to join tables and would have to do this in my training and (more tragically) inference code
• Dumping records on S3 seems like a bad idea. No incremental updates, no partial reads for inference, so very heavy read/write
Any insight would be much, much appreciated!