https://pinot.apache.org/ logo
k

Ken Krugler

12/03/2020, 2:58 PM
I ran into an issue where a segment I created was > 8GB when tarred, and thus failed during the “converting segment” phase: Converting segment: /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0 to v3 format v3 segment location for segment: crawldata_OFFLINE_2018-10-13_2020-10-11_0 is /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0/v3 Deleting files in v1 segment directory: /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0 Computed crc = 1033854200, based on files [/tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0/v3/columns.psf, /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0/v3/index_map, /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0/v3/metadata.properties] Driver, record read time : 236809 Driver, stats collector time : 0 Driver, indexing time : 122449 Tarring segment from: /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0 to: /tmp/pinot-d6bab609-8906-4c84-966b-5f96d41b1d80/output/crawldata_OFFLINE_2018-10-13_2020-10-11_0.tar.gz Failed to generate Pinot segment for file - s3://adbeat-pinot-files/compressed/3.gz java.lang.RuntimeException: entry size ‘8991809155’ is too big ( > 8589934591 ). at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.failForBigNumber(TarArchiveOutputStream.java:636) ~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafcd9b849a1ecdec7a11203c7027e21]
k

Kishore G

12/03/2020, 3:54 PM
Yes please. 8 Gb is quite big, can you break it up into smaller size?
k

Ken Krugler

12/03/2020, 4:19 PM
Yes, I can - I need to figure out how to get Flink batch to bucket by day, as that’s how I’m segmenting.
k

Kishore G

12/03/2020, 4:20 PM
after 2gb, we typically run into JVM limits on offset length etc. Also, segment is the unit of parallelism
k

Ken Krugler

12/03/2020, 4:23 PM
Is there any rule of thumb for target number of segments in table? As in say one active/hot segment per server core?
Or is it fine to have a lot more (smaller) segments, to support finer-grained exclusion of segments and thus more efficient querying?
k

Kishore G

12/03/2020, 4:46 PM
150 MV to 500mb is sweet spot
👍 1