Hello All!! We have a requirement like below and ...
# feedback-and-requests
r
Hello All!! We have a requirement like below and could you please help us to understand if the Airbyte is a good fit for our use case for EL of ELT. 1. We have 10,000+ customers and they send the data in CSV, XLSX, CSV/XLSX as email attachment and EDI file formats 2. Each customer sends 100s of files daily in one of the above file formats and around we receive around 100,000 files each day 3. We are evaluating to use Airbyte and File as Source and Postgres as Destination with basic normalization enabled to write to Postgres 4.  We are also evaluating dbt to run on top of the above said basic normalized data to have a common data model as much as possible We ran a basic validation/POC with Airbyte to evaluate to achieve above requirements but we could not close to a solution thus far due to below current issues. Could you please help us by addressing below current challenges: 1. How do we process 100,000 unique files using File(Source) to Postgres(Dest.) connection? 2. Do we need to create 10,000+ Sources/connections as one per each customer? 3. Do we need to create 10,000+ Destinations for Postgres schema's as one per each customer? 4. How do we handle if we receive 100+ files from same customer on a same day(currently we are seeing only one file name can be configured through Source/Destination and Connector i.e. 1:1)? 5. How can we handle the frequent data type and columns(+/-) changes at source with Airbyte? 6. What is the connections limit at Airbyte? 7. Can we able to manage these many sources/destinations and connections? 8. Is there a better way to achieve the above requirement by using Airbyte and bdt? 9. What is the plan/roadmap for accepting the EDI files as File Source? 10. What is the plan/roadmap for accepting the XML files as File Source? 11. Any source connector to read to retrieve the mail attachment for above said files? 12. We have converted EDI file to JSON and sourced it to Airbyte but we could not see the data in tables created in Postgres with basic normalization, any help/video to understand E2E flow? Could you please help us with a demo and/or a workshop to know how Airbyte can used to achieve our above requirement. Let us know if we can meet any of the Airbyte Engineers/SMEs to know more to introduce this feature rich trending EL of ELT tool. Thanks in anticipation!! Thanks, Ravi Kottu
u
Hi @Ravi Kottu 1. 2. Our File(Source) connector is currently only able to handle a single file, so you'll have to create one connection per file... Our S3/GCS connector can handle multiple files (using glob patterns) with the same schema so you could try to upload your file to S3 if possible. 3. You can create a single Postgres connection and target different schemas with namespace settings in connection's sync settings 4. Same answer as 1.: S3/GCS source will allow you to sync new files in incremental mode 5. We do not support schema evolution yet 6. There is no virtual limit, properly sizing your Airbyte instance / cluster will be the solution here. 7. Yes 8. Host your source files in a cloud storage (S3/GCS) and perform incremental load + prefix matching to handle multiple file. 9. Feel free to request this file format support on our GitHub repo 10. We have an open GitHub issue for XML support with S3 connector, feel free to subscribe to it to get updates on this topic . 11. We only support structured data format on which a schema can be infered 12. Feel free to open an issue / write a specific message in #troubleshooting so that we can investigate this together We have plenty of resources on our website to help you understand what you can do with Airbyte. Let us know if you have other specific question for which we'll be glad to answer.
g
@[DEPRECATED] Augustin Lafanechere Thank you for answering and appreciate your help. Could you please provide few more details to continue further. 1. How do we process 100,000 unique files using File(Source) to Postgres(Dest.) connection? 2. Do we need to create 10,000+ Sources/connections as one per each customer? 1 & 2. Our File(Source) connector is currently only able to handle a single file, so you’ll have to create one connection per file...  Our S3/GCS connector can handle multiple files (using glob patterns) with the same schema so you could try to upload your file to S3 if possible. — Does this connector also support Azure as like as S3/GCS connector? 4. How do we handle if we receive 100+ files from same customer on a same day(currently we are seeing only one file name can be configured through Source/Destination and Connector i.e. 1:1)? Same answer as 1.: S3/GCS source will allow you to sync new files in incremental mode — Can Azure be used in place of S3/GCS source? — Can DBT be able to track it? 12. We have converted EDI file to JSON and sourced it to Airbyte but we could not see the data in tables created in Postgres with basic normalization, any help/video to understand E2E flow? Feel free to open an issue / write a specific message in #troubleshooting so that we can investigate this together — I will run a test with JSON as a source and post the details in* #troubleshooting* for more updates
u
For 12th question, I am uploading the Log file and the screenshots from Postgres DB. Please have a look at that and let us know to understand the right behaviour.