Hey guys I m working on a start up <http runconverge com|run Airbyte #feedback-and-requests

Hey guys! I’m working on a start-up <runconverge.c...

thomas holvoet

01/31/2022, 7:54 PM

Hey guys! I’m working on a start-up runconverge.com. We are building ML-based automation pipelines for e-commerce businesses. To scale the data infrastructure, we are currently exploring the option of integrating Airbyte into our architecture. However, I have a couple of questions I would love some advice on! I’ll elaborate in the replies 🙂 Thanks!

thomas holvoet

01/31/2022, 7:55 PM

I’m mainly wondering what the most appropriate destination would be for our usecase. Some details: • We want to sync the user’s data (for ex. Shopify/Prestashop/Amazon/… orders/customers/products/…) to an internal database (the destination) with Airbyte. This data is then periodically used to train an ML model. We are then using this model to run inference on new data coming into the database. ◦ For training: we need to pull all the data ◦ For inference: we nee to pull all data since some_timestamp/last_pull/… • We also need to be able to join these different datasets (for ex. add customer features to an order). We might be able to get away with joining them in the training/prediction code, but this does not seem ideal • Afterward, we are storing our predictions in another database (probably makes sense that this has the same architecture as the destination) My thoughts: • A SQL database seems like the most sensible option. This allows for incremental updates, selecting the most recent data for inference, and joining different tables. However: ◦ I’m worried about creating a large number of tables (even 100s of users would be a problem?) ◦ This does seems a bit like overkill to store these type of datasets in separate SQL tables (some clients only have 100s of customers atm) • NoSQL is an attractive option, however, I would lose the ability to join tables and would have to do this in my training and (more tragically) inference code • Dumping records on S3 seems like a bad idea. No incremental updates, no partial reads for inference, so very heavy read/write Any insight would be much, much appreciated!

Martin Prejean

02/02/2022, 2:30 PM

I dont think SQL tables is overkill, DM if youd like to chat more

2 Views

Open in Slack

Previous Next