Continuation of :thread: <https://apache-pinot.sla...
# troubleshooting
i
Continuation of đŸ§” https://apache-pinot.slack.com/archives/C011C9JHN7R/p1656941781724149 I got the same dataset in Parquet and tried to import without this CSV parser. 1. It again writes something about AVRO. 2. It sees this amount of records
RecordReader initialized will read a total of 99997497 records.
But then it tries to push data after
43172732
- not sure if it was planning to process the rest or failed earlier. 3. There is still empty list of segments that it is trying to push. I don’t understand if something is wrong or not. Nothing is written 4. It looks job finished and no data can be seen. I can’t see error
Copy code
at row 43172732. reading next block
block read in memory in 1682 ms. row count = 262144
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: []... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@7b7b3edb] for table hits
Full log https://pastila.nl/?017b92b4/361513a635141c533150a5b79f6a4848
k
See https://docs.pinot.apache.org/basics/data-import/pinot-input-formats#csv, and specifically note about “Your CSV file may have raw text fields that cannot be reliably delimited using any character”, as possible issue with parsing the CSV.
i
You answer on the different question. Here Job finished without errors or results.
k
OK, sorry - I saw the “Hi! I’m trying to ingest some data from CSV file. Created schema and table”, and thought it was still the original question.
Seeing
Start pushing segments: []... to locations
means no segments were successfully generated, which is why you have no results. The
without errors
bit is what’s weird in that case. Please post (or repost) your job spec, thanks.
Also, assuming you’re running this ingestion job as stand-alone, there’s nothing in the
pinot-all.log
file in the pinot install directory’s
logs
sub-dir?
i
I did not find anything interesting in pinot-all.log Here is the data and the configs.
Copy code
wget --continue '<https://datasets.clickhouse.com/hits_compatible/hits.parquet>'
👍 1
r
one quick observation is the schema.json, there's no multi-value column. which means the space-separated row will have ingestion failure. do you know which column could have multiple-values associated with it?
i
I don’t have any multi-value columns. Just STRING/LONG columns.
r
Oh.....I see. You meant to ingest string with space but Pinot ended up considering that as multi value column.
Got it .let me take a look
a quick update I am still trying to download the parquet file 😅. will get to it once finish
update. I was unable to reproduce the error. but one thing seems obvious is that the segment creation job runner exception is not piped back to the main thread and thus the segment upload job still runs trying to upload empty TAR file. let me update the UX ticket to include this.
(discovered this when I incorrectly package the hadoop JAR and no class found error was not populate back)
i
I figured out what is the reason of the issue. 1. You lose exceptions somehow when you process large single file. I tried both csv and parquet and it looks the same. Something is wrong, but no error. You can reproduce it locally with the data above. 2. I really needed to solve the issue, so I splitted CSV file into parts(100 to be exact) with simple
split --additional-suffix .tsv --verbose -n l/10 hits.tsv input
. After that I finally was able to get my error
Copy code
Caught exception while gathering stats
java.lang.RuntimeException: java.io.IOException: (startline 498626) EOF reached before encapsulated token finished
        at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.inputformat.csv.CSVRecordReader.hasNext(CSVRecordReader.java:136) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:64) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:42) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:173) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:155) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:104) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:118) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:263) ~[pinot-batch-ingestion-standalone-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: (startline 498626) EOF reached before encapsulated token finished
        at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:450) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:395) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        ... 14 more
Failed to generate Pinot segment for file - file:/home/ubuntu/pinot/parts93.tsv
3. I did minimize the file again and found a simple row with value that contains single
"
. This one char broke the parsing. I can see there are many other occurencies of
"
in the file, but only this one broke the whole parsing. I attach file with this bad row. Table definition and schema are the same as above.
r
I think there are 2 fundamental issue with the exception --> it is not pipeback thus it doesn't stop the segment TAR push; somehow when the error occurs in large files. the LOGGER stop working since last time you show on the large parquet file. log abruptely stopped after 4million record were scanned which is very weird. I've shared the context on the ingestion job util github issue. regarding this csv exceptions let me dig deeper today, this is great information. thank you so much!
i
And which is sad. After I inserted the dataset splitted to 100 parts I can see that number of rows do not match.
COUNT(*)=94465149
instead of
99997497
You can get original CSV file
Copy code
wget --continue '<https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz>'
gzip -d hits.tsv.gz
r
isn't it b/c one of the csv ingestion were failing?
oh but if it is split evenly this is still missing a lot of data.
i
After split I fixed this bad row. It should be the correct insert as I do not see errors. Probably some rows were treated as one value. I did not check where is the actual diff