Hi we recently updated our airbyte servers to the latest ver Airbyte #troubleshooting

Hi, we recently updated our airbyte servers to the...

Thiago Costa

04/01/2022, 3:59 AM

Hi, we recently updated our airbyte servers to the latest version and since then we've noticed that the S3 destination writing parquet files doesn't follow the naming convention from before the update. Seems like the new buffer file implementation changed something there

Thiago Costa

04/01/2022, 3:59 AM

PR: https://github.com/airbytehq/airbyte/pull/11294

Thiago Costa

04/01/2022, 4:00 AM

you can see the results from before and after here

Thiago Costa

04/01/2022, 4:01 AM

the logs do show that same thing:

Copy code

2022-03-31 21:35:19 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):305 - Total records read: 444 (1 MB)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.b.FailureTrackingAirbyteMessageConsumer(close):65 - Airbyte message consumer: succeeded.
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.b.BufferedStreamConsumer(close):170 - executing on success close procedure.
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.r.SerializedBufferingStrategy(flushAll):92 - Flushing all 9 current buffers (524 KB in total)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.r.SerializedBufferingStrategy(lambda$flushAll$2):95 - Flushing buffer of stream shopify_abandoned_checkouts (8 KB)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.s.S3ConsumerFactory(lambda$flushBufferFunction$3):136 - Flushing buffer for stream shopify_abandoned_checkouts (8 KB) to storage
2022-03-31 21:35:19 [32mINFO[m i.a.w.DefaultReplicationWorker(run):163 - One of source or destination thread complete. Waiting on the other.
2022-03-31 21:35:20 [43mdestination[0m > 2022-03-31 21:35:20 [32mINFO[m i.a.i.d.s.p.ParquetSerializedBuffer(flush):106 - Finished writing data to ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet (8 KB)
2022-03-31 21:35:20 [43mdestination[0m > 2022-03-31 21:35:20 [32mINFO[m i.a.i.d.s.u.StreamTransferManagerHelper(getDefault):55 - PartSize arg is set to 10 MB
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(getMultiPartOutputStreams):329 - Initiated multipart upload to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with full ID evEsKdJr5EPtQTe_CKm1svtdmT8FnXwg3dytZIWyUew1pjBZRV9vSnSFWt3NXiM4jfMkqgW8vIvwib0SgkO94dTG4fxeqmZK3DqBTX9M7DmtHNf94vlfXKC3kXKbePTt
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.MultiPartOutputStream(close):158 - Called close() on [MultipartOutputStream for parts 1 - 10000]
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(complete):367 - [Manager uploading to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with id evEsKdJr5...3kXKbePTt]: Uploading leftover stream [Part number 1 containing 0.06 MB]
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(uploadStreamPart):558 - [Manager uploading to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with id evEsKdJr5...3kXKbePTt]: Finished uploading [Part number 1 containing 0.06 MB]

Thiago Costa

04/01/2022, 4:03 AM

cc: @Zak Keener

Thiago Costa

04/01/2022, 4:06 AM

also cc'ing @Chris Duong [Airbyte] since they might be familiar with the changes

Zak Keener

04/01/2022, 5:59 AM

Note: we also updated the S3 destination to latest

Sherif Nada

04/01/2022, 6:05 AM

Cc @Greg Solovyev (Airbyte)

Chris Duong [Airbyte]

04/01/2022, 6:12 AM

Yes the new buffering is using new classes to handle files (and filenames) it might generate multiple smaller files and to avoid overwriting files (and file conflicts) filenames are now following uuid Maybe the new s3_path_format field could be improve so users can both specified the folder name format and the filename too

Sherif Nada

04/01/2022, 3:29 PM

@Chris Duong [Airbyte] isn't changing the output path considered a breaking change though if user workflows expect a particular pattern?

Chris Duong [Airbyte]

04/01/2022, 3:34 PM

yes,i guess we can consider that a breaking change then Should we republish the version of the connector/update docs too? in parallel, csv/json files are also compressed now

Sherif Nada

04/01/2022, 3:40 PM

Republish meaning revert the name path change and publish? I am not familiar with why the change was made, so not 100% sure. But in general I would say we should keep backwards compatibility unless there's a strong reason to break it

Thiago Costa

04/01/2022, 3:48 PM

The buffered output is a very welcome change, but the fact that the final file being written won't follow the pattern of the older files is a problem. Now I'd have to rely on reading the file's metadata to know what are the files I should read instead of using simple string patters.

Thiago Costa

04/01/2022, 3:50 PM

On the other hand, it would be awesome to have the possibility to determine the full path variables (as it's already done internally for the file name). So I could chose the path as

/bucket/path/${YEAR}/${MONTH}/${DAY}

if I wanted (or even more granular for hourly batches.

Chris Duong [Airbyte]

04/01/2022, 3:52 PM

yes along with the buffering, the format is now exposed for the path it would require extra efforts to mmick the same on the file’s name generation too but if we do that, we could set a default value that is backward compatible with before

Thiago Costa

04/01/2022, 3:52 PM

regarding overwriting files: if that's a concern, I'd recommend having the file prefix being the time format (YYYY-MM-DD-????) and the suffix being the uuid

Thiago Costa

04/01/2022, 3:53 PM

That would be welcome (the feature of using the variables on the path name), but it's a secondary request. The most important for now would be revert that specific factor of having the file being prefixed with the datetime

Thiago Costa

04/01/2022, 3:54 PM

(thanks for checking the case, btw)

Greg Solovyev (Airbyte)

04/01/2022, 4:28 PM

@Thiago Costa just to confirm, if filenames are prefixed with datetime (like they were before), but also have the UUID part (like the do now) - that would solve the problem for you?

Thiago Costa

04/01/2022, 4:55 PM

I'd say so, yes

Thiago Costa

04/01/2022, 4:56 PM

so we don't have to rely on metadata to identify the files

Chris Duong [Airbyte]

04/01/2022, 5:00 PM

we have to test this idea, but the new connector now has a field

s3_path_format

and default pattern is

${NAMESPACE}/${STREAM_NAME}/

But maybe trying to use:

${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}-${HOUR}-${MINUTE}-{SECOND}

instead (notice there is no

at the end) it might do what you want… while appending <uuid>.<extension> after

Chris Duong [Airbyte]

04/01/2022, 5:18 PM

yes it worked

Thiago Costa

04/01/2022, 5:24 PM

awesome, I'll test it from our side

Thiago Costa

04/01/2022, 5:24 PM

and update you

Thiago Costa

04/01/2022, 5:24 PM

the s3_path_format will only reflect the file name or the the actual path as well?

Chris Duong [Airbyte]

04/01/2022, 5:25 PM

it’s actually both

Thiago Costa

04/01/2022, 6:45 PM

ok, that worked perfectly. guess the only thing needed would be to document this behavior

Thiago Costa

04/01/2022, 6:46 PM

thanks for the support @Chris Duong [Airbyte] @Greg Solovyev (Airbyte) @Sherif Nada

5 Views

Open in Slack

Previous Next