https://linen.dev logo
#troubleshooting
Title
# troubleshooting
t

Thiago Costa

04/01/2022, 3:59 AM
Hi, we recently updated our airbyte servers to the latest version and since then we've noticed that the S3 destination writing parquet files doesn't follow the naming convention from before the update. Seems like the new buffer file implementation changed something there
you can see the results from before and after here
the logs do show that same thing:
Copy code
2022-03-31 21:35:19 [32mINFO[m i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):305 - Total records read: 444 (1 MB)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.b.FailureTrackingAirbyteMessageConsumer(close):65 - Airbyte message consumer: succeeded.
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.b.BufferedStreamConsumer(close):170 - executing on success close procedure.
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.r.SerializedBufferingStrategy(flushAll):92 - Flushing all 9 current buffers (524 KB in total)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.r.SerializedBufferingStrategy(lambda$flushAll$2):95 - Flushing buffer of stream shopify_abandoned_checkouts (8 KB)
2022-03-31 21:35:19 [43mdestination[0m > 2022-03-31 21:35:19 [32mINFO[m i.a.i.d.s.S3ConsumerFactory(lambda$flushBufferFunction$3):136 - Flushing buffer for stream shopify_abandoned_checkouts (8 KB) to storage
2022-03-31 21:35:19 [32mINFO[m i.a.w.DefaultReplicationWorker(run):163 - One of source or destination thread complete. Waiting on the other.
2022-03-31 21:35:20 [43mdestination[0m > 2022-03-31 21:35:20 [32mINFO[m i.a.i.d.s.p.ParquetSerializedBuffer(flush):106 - Finished writing data to ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet (8 KB)
2022-03-31 21:35:20 [43mdestination[0m > 2022-03-31 21:35:20 [32mINFO[m i.a.i.d.s.u.StreamTransferManagerHelper(getDefault):55 - PartSize arg is set to 10 MB
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(getMultiPartOutputStreams):329 - Initiated multipart upload to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with full ID evEsKdJr5EPtQTe_CKm1svtdmT8FnXwg3dytZIWyUew1pjBZRV9vSnSFWt3NXiM4jfMkqgW8vIvwib0SgkO94dTG4fxeqmZK3DqBTX9M7DmtHNf94vlfXKC3kXKbePTt
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.MultiPartOutputStream(close):158 - Called close() on [MultipartOutputStream for parts 1 - 10000]
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(complete):367 - [Manager uploading to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with id evEsKdJr5...3kXKbePTt]: Uploading leftover stream [Part number 1 containing 0.06 MB]
2022-03-31 21:35:21 [43mdestination[0m > 2022-03-31 21:35:21 [32mINFO[m a.m.s.StreamTransferManager(uploadStreamPart):558 - [Manager uploading to stage-polyops-useast2-airbyte/drinkhaus/shopify/shopify_abandoned_checkouts/ca6e80bf-cfc8-420f-ad2f-c481bf76e41817573920884781195944.parquet with id evEsKdJr5...3kXKbePTt]: Finished uploading [Part number 1 containing 0.06 MB]
cc: @Zak Keener
also cc'ing @Chris Duong [Airbyte] since they might be familiar with the changes
z

Zak Keener

04/01/2022, 5:59 AM
Note: we also updated the S3 destination to latest
s

Sherif Nada

04/01/2022, 6:05 AM
Cc @Greg Solovyev (Airbyte)
c

Chris Duong [Airbyte]

04/01/2022, 6:12 AM
Yes the new buffering is using new classes to handle files (and filenames) it might generate multiple smaller files and to avoid overwriting files (and file conflicts) filenames are now following uuid Maybe the new s3_path_format field could be improve so users can both specified the folder name format and the filename too
s

Sherif Nada

04/01/2022, 3:29 PM
@Chris Duong [Airbyte] isn't changing the output path considered a breaking change though if user workflows expect a particular pattern?
c

Chris Duong [Airbyte]

04/01/2022, 3:34 PM
yes,i guess we can consider that a breaking change then Should we republish the version of the connector/update docs too? in parallel, csv/json files are also compressed now
s

Sherif Nada

04/01/2022, 3:40 PM
Republish meaning revert the name path change and publish? I am not familiar with why the change was made, so not 100% sure. But in general I would say we should keep backwards compatibility unless there's a strong reason to break it
t

Thiago Costa

04/01/2022, 3:48 PM
The buffered output is a very welcome change, but the fact that the final file being written won't follow the pattern of the older files is a problem. Now I'd have to rely on reading the file's metadata to know what are the files I should read instead of using simple string patters.
On the other hand, it would be awesome to have the possibility to determine the full path variables (as it's already done internally for the file name). So I could chose the path as
/bucket/path/${YEAR}/${MONTH}/${DAY}
if I wanted (or even more granular for hourly batches.
c

Chris Duong [Airbyte]

04/01/2022, 3:52 PM
yes along with the buffering, the format is now exposed for the path it would require extra efforts to mmick the same on the file’s name generation too but if we do that, we could set a default value that is backward compatible with before
t

Thiago Costa

04/01/2022, 3:52 PM
regarding overwriting files: if that's a concern, I'd recommend having the file prefix being the time format (YYYY-MM-DD-????) and the suffix being the uuid
That would be welcome (the feature of using the variables on the path name), but it's a secondary request. The most important for now would be revert that specific factor of having the file being prefixed with the datetime
(thanks for checking the case, btw)
g

Greg Solovyev (Airbyte)

04/01/2022, 4:28 PM
@Thiago Costa just to confirm, if filenames are prefixed with datetime (like they were before), but also have the UUID part (like the do now) - that would solve the problem for you?
t

Thiago Costa

04/01/2022, 4:55 PM
I'd say so, yes
so we don't have to rely on metadata to identify the files
c

Chris Duong [Airbyte]

04/01/2022, 5:00 PM
we have to test this idea, but the new connector now has a field
s3_path_format
and default pattern is
${NAMESPACE}/${STREAM_NAME}/
But maybe trying to use:
${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}-${HOUR}-${MINUTE}-{SECOND}
instead (notice there is no
/
at the end) it might do what you want… while appending <uuid>.<extension> after
yes it worked
t

Thiago Costa

04/01/2022, 5:24 PM
awesome, I'll test it from our side
and update you
the s3_path_format will only reflect the file name or the the actual path as well?
c

Chris Duong [Airbyte]

04/01/2022, 5:25 PM
it’s actually both
t

Thiago Costa

04/01/2022, 6:45 PM
ok, that worked perfectly. guess the only thing needed would be to document this behavior
thanks for the support @Chris Duong [Airbyte] @Greg Solovyev (Airbyte) @Sherif Nada
5 Views