Hi! I’m trying to ingest some data from CSV file. ...
# troubleshooting
i
Hi! I’m trying to ingest some data from CSV file. Created schema and table. Started ingestion job, but it is everything what I see in output. I can’t find any logs, nothing in UI about jobs. It is my second time when I try to use Pinot, but there is no diagnostics and I fail.
Copy code
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Creating an executor service with 10 threads(Job parallelism: 10, available cores: 16.)
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: []... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@73ab3aac] for table hits
m
Hi @Ilya Yatsishin do you mind sharing the command you ran, or point me to instructions you are following?
i
Copy code
PINOT_VERSION=0.10.0

wget <https://downloads.apache.org/pinot/apache-pinot-$PINOT_VERSION/apache-pinot-$PINOT_VERSION-bin.tar.gz>
tar -zxvf apache-pinot-$PINOT_VERSION-bin.tar.gz

./apache-pinot-$PINOT_VERSION-bin/bin/pinot-admin.sh QuickStart -type batch &
sleep 30
./apache-pinot-$PINOT_VERSION-bin/bin/pinot-admin.sh AddTable -tableConfigFile offline_table.json -schemaFile schema.json -exec

./apache-pinot-$PINOT_VERSION-bin/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile local.yaml
r
can you share your environment variable and see if
JAVA_OPTS
is set? if it is then pinot will use your env var setting log4j configuration instead of using the one in the conf directory
i
It is not set.
r
interesting. i ran exactly through your instruction and this is what I received
Copy code
at org.apache.pinot.spi.filesystem.LocalPinotFS.listFiles(LocalPinotFS.java:113) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:169) ~[pinot-batch-ingestion-standalone-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:146) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
	... 13 more
java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:148)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:117)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:121)
	at org.apache.pinot.tools.Command.call(Command.java:33)
	at org.apache.pinot.tools.Command.call(Command.java:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:161)
	at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:192)
Caused by: java.nio.file.NoSuchFileException: /home/ubuntu
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
	at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:148)
	at java.base/java.nio.file.Files.readAttributes(Files.java:1851)
	at java.base/java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:220)
	at java.base/java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:277)
	at java.base/java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:323)
	at java.base/java.nio.file.FileTreeIterator.<init>(FileTreeIterator.java:71)
	at java.base/java.nio.file.Files.walk(Files.java:3918)
	at java.base/java.nio.file.Files.walk(Files.java:3973)
	at org.apache.pinot.spi.filesystem.LocalPinotFS.listFiles(LocalPinotFS.java:113)
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:169)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:146)
	... 13 more
i
It is probably because you have different path where you store those files
It was an absolute path
r
i see.
g
I ran it (changing the path in local.yml) and it just ran successfully
i
Can you see the data in the table?
g
the last log is
Copy code
Start pushing segments: []... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@390877d2] for table hits
and then the program finished with status 0
i
I also see this log. It is in the first message. But no data is loaded. this
mini.tsv
is just a head of the real table.
g
oh, ok
yeah, in that case I have the same scenario. The table looks empty
r
ah.. i see
so I just ran your job and here is what happened. TL:DR your glob setting is incorrect. should be:
'glob:**/*.tsv'
instead of just the
*.tsv
long answer: b/c we can't find the proper tsv file. it tries to ingest
[]
files into segments. which means empty. b/c the list of files to be ingest is empty, there's no success message for each of the segments being generated.
pinot should probably still print out a summary on all the files that are pushed like
Summary, successfully pushed segments: [ xxx, yyy ]. created from list of files [ xxx, yyy]
then we clearly knew that we didn't filter any files in our processing list
r
yeah this was a bit weird. because we default recursive search for file match. so in our search list it will be absolute path as well. this means
glob:*.tsv
meaning tsv files in your root directory.
i
But if I don’t want to traverese to subdirectories? Can I just provide path instead of this strange pattern matching hacks?
r
it seems like the recursive flag is default hard-coded to true. let me file an issue to support a command line argument to only ingest from directly listed out files
k
@Ilya Yatsishin - what happens if you give it a pattern like
glob:/absolute/path/to/your/input/directory/*.tsv
?
Just FYI, I use https://www.digitalocean.com/community/tools/glob?comments=true to try out glob patterns…
i
I just added to make it find the file. Patterns that are not relative to
inputDir
are really confusing. You need to add a note at least to docs/examples about it.
Copy code
inputDirURI: '/home/ubuntu/pinot'
includeFileNamePattern: 'glob:/home/ubuntu/pinot/mini.tsv'
But if you also made a next step you encountered the next issue. 1. I don’t get why it logs something about file format AVRO, when i explicitly have
Copy code
recordReaderSpec:
  dataFormat: 'csv'
2. It does not work with string column that has space in it? Or I don’t understand what is the issue.
Copy code
Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to read segment, ignoring configured file format: AVRO
RecordReaderSegmentCreationDataSource is used
Caught exception while gathering stats
java.lang.RuntimeException: Caught exception while transforming data type for column: URL
        at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:95) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:83) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:80) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:42) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:173) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:155) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:104) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:118) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:263) ~[pinot-batch-ingestion-standalone-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.IllegalArgumentException: Cannot read single-value from Object[]: [<http://bonprix.ru/index.ru/cinema/art/A00387,3797>),  ru)&bL] for column: URL
        at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.standardize(DataTypeTransformer.java:145) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:63) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        ... 13 more
Failed to generate Pinot segment for file - file:/home/ubuntu/ClickHouse/benchmark/pinot/mini.tsv
java.lang.RuntimeException: Caught exception while transforming data type for column: URL
        at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:95) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:83) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:80) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:42) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:173) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:155) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:104) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:118) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:263) ~[pinot-batch-ingestion-standalone-0.10.0-shaded.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.IllegalArgumentException: Cannot read single-value from Object[]: [<http://bonprix.ru/index.ru/cinema/art/A00387,3797>),  ru)&bL] for column: URL
        at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.standardize(DataTypeTransformer.java:145) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:63) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        ... 13 more
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: []... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@65ae095c] for table hits
k
@Ilya Yatsishin - I would start a new thread for your question about what is now going wrong, since you’ve got a workaround for finding the right input files.
👍 1
But if you know your input file is CSV (comma-separated values), and you’ve got this error message:
Copy code
Cannot read single-value from Object[]: [<http://bonprix.ru/index.ru/cinema/art/A00387,3797>),  ru)&bL] for column: URL
Then I’d guess you haven’t encoded the CSV data properly, and the ‘,’ in the URL is confusing the CSV parser.
i
All other DBMS that I checked worked perfectly with this file. There is no
delimeter
in this value, so quotes are not required. For example last one was Druid. I got the same dataset in Parquet and tried to import without this CSV parser. 1. It again writes something about AVRO. 2. It sees this amount of records
RecordReader initialized will read a total of 99997497 records.
But then it tries to push data after
43172732
- not sure if it was planning to process the rest or failed earlier. 3. There is still empty list of segments that it is trying to push. I don’t understand if something is wrong or not. Nothing is written 4. It looks job finished and no data can be seen. I can’t see error
Copy code
at row 43172732. reading next block
block read in memory in 1682 ms. row count = 262144
Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Start pushing segments: []... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@7b7b3edb] for table hits
Full log https://pastila.nl/?017b92b4/361513a635141c533150a5b79f6a4848
k
@Ilya Yatsishin - OK, might be multi-value delimiter issue…please repost as new troubleshooting question, thanks
r
looks like 1. the AVRO error message is extracted from a deprecated field. let me remove that (this is definitely confusing, sorry for the bad UX) 2. i think there's sth wrong with the schema you defined. could you share what data type you are using for that column with space-separated results? 3. /4. it most likely because the segment didn't get generated successfully, thus it is not pushing anything (again bad UX) i've filed https://github.com/apache/pinot/issues/9016
oops you already started another thread. we can continue there
h
@Laila Sabar