Hmm, what's the recommended structure for calling ...
# troubleshooting
d
Hmm, what's the recommended structure for calling LaunchDataIngestionJob? I'm trying to run it with 300 million rows and I'm hitting a tar size issue. I can split the DataIngestion calls but I'm curious as to what is recommended.
Copy code
java.lang.RuntimeException: entry size '14879990781' is too big ( > 8589934591 )
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.failForBigNumber(TarArchiveOutputStream.java:623) ~[pinot-all-0.4.0-SNAPSHOT-jar-with-dependencies.jar:0.4.0-SNAPSHOT-ed26e8589fe5f91d2876d417aebf23575010cc76]
Logs
x
Typically we do hundreds MB per segment, the hard limit here is per column index size should not exceed 2gb.
k
Hi: Does it mean that input file can be larger than 2GB if there are N (> 1) columns as long as ANY one column index size does not exceed larger than 2GB?
x
Yes
d
Hmm, even if I try to load a small number of rows (10 million rows), I hit this issue if I have too many star tree indices.
What sort of limits should I have for the star tree schemas?
x
@Jackie ^^
j
@Dan Hill How many columns do you have in the star-tree?
d
Around 7-8 dimensions and 30 metrics. All of them are current numbers. There are 18 of them.
j
In that case, I would recommend ~1M records per segment
Also, why do you need 18 star-trees?
d
Gotcha. Are there any limits to number of segments? I have roughly 40 dimensions. I can't rely on the built in time support so I have 4 date dimensions (depending on what scope is being used). I have a entity hierarchy that's about 5 levels deep. Then I have some separate star trees depending on combinations of the remaining dimensions.
I'll send you the current one in a direct message.
Hmm, how does this scale with realtime ingestion? I can shard better in the offline case.