Ilya Yatsishin
07/05/2022, 3:39 PMRecordReader initialized will read a total of 99997497 records.
But then it tries to push data after 43172732
- not sure if it was planning to process the rest or failed earlier.
3. There is still empty list of segments that it is trying to push. I donât understand if something is wrong or not. Nothing is written
4. It looks job finished and no data can be seen. I canât see error
at row 43172732. reading next block
block read in memory in 1682 ms. row count = 262144
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: []... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@7b7b3edb] for table hits
Full log https://pastila.nl/?017b92b4/361513a635141c533150a5b79f6a4848Ken Krugler
07/05/2022, 3:41 PMIlya Yatsishin
07/05/2022, 3:42 PMKen Krugler
07/05/2022, 3:45 PMKen Krugler
07/05/2022, 3:46 PMStart pushing segments: []... to locations
means no segments were successfully generated, which is why you have no results. The without errors
bit is whatâs weird in that case. Please post (or repost) your job spec, thanks.Ken Krugler
07/05/2022, 3:49 PMpinot-all.log
file in the pinot install directoryâs logs
sub-dir?Ilya Yatsishin
07/05/2022, 3:55 PMwget --continue '<https://datasets.clickhouse.com/hits_compatible/hits.parquet>'
Rong R
07/05/2022, 4:44 PMIlya Yatsishin
07/05/2022, 5:33 PMRong R
07/05/2022, 5:34 PMRong R
07/05/2022, 5:34 PMRong R
07/05/2022, 6:37 PMRong R
07/06/2022, 2:56 PMRong R
07/06/2022, 2:57 PMIlya Yatsishin
07/07/2022, 11:11 AMsplit --additional-suffix .tsv --verbose -n l/10 hits.tsv input
. After that I finally was able to get my error
Caught exception while gathering stats
java.lang.RuntimeException: java.io.IOException: (startline 498626) EOF reached before encapsulated token finished
at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.plugin.inputformat.csv.CSVRecordReader.hasNext(CSVRecordReader.java:136) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:64) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:42) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:173) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:155) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:104) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:118) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:263) ~[pinot-batch-ingestion-standalone-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: (startline 498626) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:450) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:395) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
... 14 more
Failed to generate Pinot segment for file - file:/home/ubuntu/pinot/parts93.tsv
3. I did minimize the file again and found a simple row with value that contains single "
. This one char broke the parsing. I can see there are many other occurencies of "
in the file, but only this one broke the whole parsing. I attach file with this bad row. Table definition and schema are the same as above.Rong R
07/07/2022, 2:58 PMIlya Yatsishin
07/07/2022, 3:00 PMCOUNT(*)=94465149
instead of 99997497
You can get original CSV file
wget --continue '<https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz>'
gzip -d hits.tsv.gz
Rong R
07/07/2022, 3:03 PMRong R
07/07/2022, 3:04 PMIlya Yatsishin
07/07/2022, 3:11 PM