Continuation of thread <https apache pinot slack com archive Apache Pinot #troubleshooting

Continuation of :thread: <https://apache-pinot.sla...

Ilya Yatsishin

07/05/2022, 3:39 PM

Continuation of 🧵 https://apache-pinot.slack.com/archives/C011C9JHN7R/p1656941781724149 I got the same dataset in Parquet and tried to import without this CSV parser. 1. It again writes something about AVRO. 2. It sees this amount of records

RecordReader initialized will read a total of 99997497 records.

But then it tries to push data after

43172732

- not sure if it was planning to process the rest or failed earlier. 3. There is still empty list of segments that it is trying to push. I don’t understand if something is wrong or not. Nothing is written 4. It looks job finished and no data can be seen. I can’t see error

Copy code

at row 43172732. reading next block
block read in memory in 1682 ms. row count = 262144
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: []... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@7b7b3edb] for table hits

Full log https://pastila.nl/?017b92b4/361513a635141c533150a5b79f6a4848

Ken Krugler

07/05/2022, 3:41 PM

See https://docs.pinot.apache.org/basics/data-import/pinot-input-formats#csv, and specifically note about “Your CSV file may have raw text fields that cannot be reliably delimited using any character”, as possible issue with parsing the CSV.

Ilya Yatsishin

07/05/2022, 3:42 PM

You answer on the different question. Here Job finished without errors or results.

Ken Krugler

07/05/2022, 3:45 PM

OK, sorry - I saw the “Hi! I’m trying to ingest some data from CSV file. Created schema and table”, and thought it was still the original question.

Ken Krugler

07/05/2022, 3:46 PM

Seeing

Start pushing segments: []... to locations

means no segments were successfully generated, which is why you have no results. The

without errors

bit is what’s weird in that case. Please post (or repost) your job spec, thanks.

Ken Krugler

07/05/2022, 3:49 PM

Also, assuming you’re running this ingestion job as stand-alone, there’s nothing in the

pinot-all.log

file in the pinot install directory’s

logs

sub-dir?

Ilya Yatsishin

07/05/2022, 3:55 PM

I did not find anything interesting in pinot-all.log Here is the data and the configs.

Copy code

wget --continue '<https://datasets.clickhouse.com/hits_compatible/hits.parquet>'

parquet.yaml schema.json offline_table.json

👍 1

Rong R

07/05/2022, 4:44 PM

one quick observation is the schema.json, there's no multi-value column. which means the space-separated row will have ingestion failure. do you know which column could have multiple-values associated with it?

Ilya Yatsishin

07/05/2022, 5:33 PM

I don’t have any multi-value columns. Just STRING/LONG columns.

Rong R

07/05/2022, 5:34 PM

Oh.....I see. You meant to ingest string with space but Pinot ended up considering that as multi value column.

Rong R

07/05/2022, 5:34 PM

Got it .let me take a look

Rong R

07/05/2022, 6:37 PM

a quick update I am still trying to download the parquet file 😅. will get to it once finish

Rong R

07/06/2022, 2:56 PM

update. I was unable to reproduce the error. but one thing seems obvious is that the segment creation job runner exception is not piped back to the main thread and thus the segment upload job still runs trying to upload empty TAR file. let me update the UX ticket to include this.

Rong R

07/06/2022, 2:57 PM

(discovered this when I incorrectly package the hadoop JAR and no class found error was not populate back)

Ilya Yatsishin

07/07/2022, 11:11 AM

I figured out what is the reason of the issue. 1. You lose exceptions somehow when you process large single file. I tried both csv and parquet and it looks the same. Something is wrong, but no error. You can reproduce it locally with the data above. 2. I really needed to solve the issue, so I splitted CSV file into parts(100 to be exact) with simple

split --additional-suffix .tsv --verbose -n l/10 hits.tsv input

. After that I finally was able to get my error

Copy code

Caught exception while gathering stats
java.lang.RuntimeException: java.io.IOException: (startline 498626) EOF reached before encapsulated token finished
        at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.inputformat.csv.CSVRecordReader.hasNext(CSVRecordReader.java:136) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:64) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:42) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:173) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:155) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:104) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:118) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:263) ~[pinot-batch-ingestion-standalone-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: (startline 498626) EOF reached before encapsulated token finished
        at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:450) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:395) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        ... 14 more
Failed to generate Pinot segment for file - file:/home/ubuntu/pinot/parts93.tsv

3. I did minimize the file again and found a simple row with value that contains single

. This one char broke the parsing. I can see there are many other occurencies of

in the file, but only this one broke the whole parsing. I attach file with this bad row. Table definition and schema are the same as above.

nano470.tsv

Rong R

07/07/2022, 2:58 PM

I think there are 2 fundamental issue with the exception --> it is not pipeback thus it doesn't stop the segment TAR push; somehow when the error occurs in large files. the LOGGER stop working since last time you show on the large parquet file. log abruptely stopped after 4million record were scanned which is very weird. I've shared the context on the ingestion job util github issue. regarding this csv exceptions let me dig deeper today, this is great information. thank you so much!

Ilya Yatsishin

07/07/2022, 3:00 PM

And which is sad. After I inserted the dataset splitted to 100 parts I can see that number of rows do not match.

COUNT(*)=94465149

instead of

99997497

You can get original CSV file

Copy code

wget --continue '<https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz>'
gzip -d hits.tsv.gz

Rong R

07/07/2022, 3:03 PM

isn't it b/c one of the csv ingestion were failing?

Rong R

07/07/2022, 3:04 PM

oh but if it is split evenly this is still missing a lot of data.

Ilya Yatsishin

07/07/2022, 3:11 PM

After split I fixed this bad row. It should be the correct insert as I do not see errors. Probably some rows were treated as one value. I did not check where is the actual diff

Open in Slack

Previous Next