Hey team Since I m not a java developer I won t submit a PR Airbyte #feedback-and-requests

Hey team! Since I’m not a java developer I won’t s...

Huib

02/25/2022, 10:54 AM

Hey team! Since I’m not a java developer I won’t submit a PR for this myself, but I noticed something which I think really makes your lives a lot harder than it should. For a few destinations, we write simple files. All of these destinations currently implement the same code for all the file formats that are supported (see, for instance, https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/d[…]tination/azure_blob_storage/writer/ProductionWriterFactory.java) The downside of this approach is that even though we have (for instance) parquet output for S3, we don’t have it for azure blob storage. even though all the code is already written to support this IMO it would save a lot of effort if there were a single “remote file destination” that handles all the type conversions, the schema, etc, and then let S3, blob, ftp, … deal with the interface to the storage itself.

Imane

02/25/2022, 9:43 PM

Feedback has been sent to Harvestr

Imane

02/25/2022, 9:45 PM

Thanks @Huib this feedback is pretty good. I think some source/destinations were implemented to allow users start using them but now there are plenty of space to improvement as you pointed. Are you looking to implement a new option in a connector and you hit a problem doing it?

Vikram Bhamidipati

02/26/2022, 8:27 AM

Good question, and yes 🙂 I’ve got a few sources that use (S)FTP for the data transport, so I thought to implement this myself. In doing so (again, no Java developer, more prolific in Python) I quickly realised this would mean implementing a parquet reader, a csv reader, a json reader, an avro reader, etc.

Vikram Bhamidipati

02/26/2022, 8:36 AM

Also, we’re using blob storage as our main destination, and writing to csv files is rather painful 😉 For example, one of our tables is around 200M rows with a handful of columns. Storing it as CSV takes 50GB, while the parquet version takes less than 1GB. Even though the costs associated with this table are small, it adds up over all the tables that we have. A 50x reduction in storage (and IO) costs is always welcome 🙂

Vikram Bhamidipati

02/26/2022, 9:49 AM

I would be more than happy to create a PR for this type of enhancement for the Python ecosytem, if that helps

Anatole Callies

02/28/2022, 2:32 PM

Hi @Huib, I think I found a GitHub issue pretty similar to the valuable feedback you shared. Feel free to thumb it up, share your opinion and discoveries on it and of course subscribe to receive updates 😄

Open in Slack

Previous Next