Hi, we recently updated our airbyte servers to the...
# troubleshooting
t
Hi, we recently updated our airbyte servers to the latest version and since then we've noticed that the S3 destination writing parquet files doesn't follow the naming convention from before the update. Seems like the new buffer file implementation changed something there
you can see the results from before and after here
the logs do show that same thing:
Copy code
2022-03-31 21:35:19 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):305 - Total records read: 444 (1 MB)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.b.FailureTrackingAirbyteMessageConsumer(close):65 - Airbyte message consumer: succeeded.
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.b.BufferedStreamConsumer(close):170 - executing on success close procedure.
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.r.SerializedBufferingStrategy(flushAll):92 - Flushing all 9 current buffers (524 KB in total)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.r.SerializedBufferingStrategy(lambda$flushAll$2):95 - Flushing buffer of stream shopify_abandoned_checkouts (8 KB)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.s.S3ConsumerFactory(lambda$flushBufferFunction$3):136 - Flushing buffer for stream shopify_abandoned_checkouts (8 KB) to storage
2022-03-31 21:35:19 [32mINFO[m i.a.w.DefaultReplicationWorker(run):163 - One of source or destination thread complete. Waiting on the other.
2022-03-31 21:35:20 [43mdestination[0m > 2022-03-31 21:35:20 [32mINFO[m i.a.i.d.s.p.ParquetSerializedBuffer(flush):106 - Finished writing data to ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet (8 KB)
2022-03-31 21:35:20 [43mdestination[0m > 2022-03-31 21:35:20 [32mINFO[m i.a.i.d.s.u.StreamTransferManagerHelper(getDefault):55 - PartSize arg is set to 10 MB
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(getMultiPartOutputStreams):329 - Initiated multipart upload to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with full ID evEsKdJr5EPtQTe_CKm1svtdmT8FnXwg3dytZIWyUew1pjBZRV9vSnSFWt3NXiM4jfMkqgW8vIvwib0SgkO94dTG4fxeqmZK3DqBTX9M7DmtHNf94vlfXKC3kXKbePTt
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.MultiPartOutputStream(close):158 - Called close() on [MultipartOutputStream for parts 1 - 10000]
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(complete):367 - [Manager uploading to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with id evEsKdJr5...3kXKbePTt]: Uploading leftover stream [Part number 1 containing 0.06 MB]
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(uploadStreamPart):558 - [Manager uploading to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with id evEsKdJr5...3kXKbePTt]: Finished uploading [Part number 1 containing 0.06 MB]
cc: @Zak Keener
also cc'ing @Chris Duong [Airbyte] since they might be familiar with the changes
z
Note: we also updated the S3 destination to latest
s
Cc @Greg Solovyev (Airbyte)
c
Yes the new buffering is using new classes to handle files (and filenames) it might generate multiple smaller files and to avoid overwriting files (and file conflicts) filenames are now following uuid Maybe the new s3_path_format field could be improve so users can both specified the folder name format and the filename too
s
@Chris Duong [Airbyte] isn't changing the output path considered a breaking change though if user workflows expect a particular pattern?
c
yes,i guess we can consider that a breaking change then Should we republish the version of the connector/update docs too? in parallel, csv/json files are also compressed now
s
Republish meaning revert the name path change and publish? I am not familiar with why the change was made, so not 100% sure. But in general I would say we should keep backwards compatibility unless there's a strong reason to break it
t
The buffered output is a very welcome change, but the fact that the final file being written won't follow the pattern of the older files is a problem. Now I'd have to rely on reading the file's metadata to know what are the files I should read instead of using simple string patters.
On the other hand, it would be awesome to have the possibility to determine the full path variables (as it's already done internally for the file name). So I could chose the path as
/bucket/path/${YEAR}/${MONTH}/${DAY}
if I wanted (or even more granular for hourly batches.
c
yes along with the buffering, the format is now exposed for the path it would require extra efforts to mmick the same on the file’s name generation too but if we do that, we could set a default value that is backward compatible with before
t
regarding overwriting files: if that's a concern, I'd recommend having the file prefix being the time format (YYYY-MM-DD-????) and the suffix being the uuid
That would be welcome (the feature of using the variables on the path name), but it's a secondary request. The most important for now would be revert that specific factor of having the file being prefixed with the datetime
(thanks for checking the case, btw)
g
@Thiago Costa just to confirm, if filenames are prefixed with datetime (like they were before), but also have the UUID part (like the do now) - that would solve the problem for you?
t
I'd say so, yes
so we don't have to rely on metadata to identify the files
c
we have to test this idea, but the new connector now has a field
s3_path_format
and default pattern is
${NAMESPACE}/${STREAM_NAME}/
But maybe trying to use:
${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}-${HOUR}-${MINUTE}-{SECOND}
instead (notice there is no
/
at the end) it might do what you want… while appending <uuid>.<extension> after
yes it worked
t
awesome, I'll test it from our side
and update you
the s3_path_format will only reflect the file name or the the actual path as well?
c
it’s actually both
t
ok, that worked perfectly. guess the only thing needed would be to document this behavior
thanks for the support @Chris Duong [Airbyte] @Greg Solovyev (Airbyte) @Sherif Nada