https://pinot.apache.org/ logo
d

Dan Hill

06/02/2020, 12:13 AM
Hmm, what's the recommended structure for calling LaunchDataIngestionJob? I'm trying to run it with 300 million rows and I'm hitting a tar size issue. I can split the DataIngestion calls but I'm curious as to what is recommended.
Copy code
java.lang.RuntimeException: entry size '14879990781' is too big ( > 8589934591 )
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.failForBigNumber(TarArchiveOutputStream.java:623) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ed26e8589fe5f91d2876d417aebf23575010cc76]
Logs
x

Xiang Fu

06/02/2020, 12:41 AM
Typically we do hundreds MB per segment, the hard limit here is per column index size should not exceed 2gb.
k

kish

06/02/2020, 2:34 AM
Hi: Does it mean that input file can be larger than 2GB if there are N (> 1) columns as long as ANY one column index size does not exceed larger than 2GB?
x

Xiang Fu

06/02/2020, 3:01 AM
Yes
d

Dan Hill

06/02/2020, 4:27 PM
Hmm, even if I try to load a small number of rows (10 million rows), I hit this issue if I have too many star tree indices.
What sort of limits should I have for the star tree schemas?
x

Xiang Fu

06/02/2020, 5:43 PM
@Jackie ^^
j

Jackie

06/02/2020, 5:47 PM
@Dan Hill How many columns do you have in the star-tree?
d

Dan Hill

06/02/2020, 5:49 PM
Around 7-8 dimensions and 30 metrics. All of them are current numbers. There are 18 of them.
j

Jackie

06/02/2020, 6:10 PM
In that case, I would recommend ~1M records per segment
Also, why do you need 18 star-trees?
d

Dan Hill

06/02/2020, 6:15 PM
Gotcha. Are there any limits to number of segments? I have roughly 40 dimensions. I can't rely on the built in time support so I have 4 date dimensions (depending on what scope is being used). I have a entity hierarchy that's about 5 levels deep. Then I have some separate star trees depending on combinations of the remaining dimensions.
I'll send you the current one in a direct message.
Hmm, how does this scale with realtime ingestion? I can shard better in the offline case.