Is anybody using Pinot with an on-prem S3-like fil...
# troubleshooting
a
Is anybody using Pinot with an on-prem S3-like filesystem rather than AWS' S3? I am doing this and trying to run a batch ingest, and I get this error:
Copy code
Got exception to kick off standalone data ingestion job -                                                                                             
java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner           
        at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:144) ~[pinot-all-0.7.0-SNAPSHOT-jar
-with-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694]                                                                       
        at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113) ~[pinot-all-0.7.0-SNAPSHOT-jar-wit
h-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694]                                                                           
        at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132) [pinot-all-0.7.0-SNAPSHO
T-jar-with-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694]                                                                  
        at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:164) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.
7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694]                                                                                                
        at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:184) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0
-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694]                                                                                                   
Caused by: java.io.IOException: software.amazon.awssdk.services.s3.model.S3Exception: The AWS Access Key Id you provided does not exist in our records
. (Service: S3, Status Code: 403, Request ID: 0306422796023ADB, Extended Request ID: njXFdh82iDAWK78LUjRq1SCfJDgSD0Dcr9EhworrYh4CT7X0ZsPFVmHl2TUSmLK9e
P/EyAwhAm8=)                                                                                                                                          
        at org.apache.pinot.plugin.filesystem.S3PinotFS.mkdir(S3PinotFS.java:308) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-
7ac8650777d6b25c8cae4ca1bd5460f25488a694]                                                                                                             
        at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:127) ~[pinot-batch-ingest
ion-standalone-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694]                                                     
        at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142) ~[pinot-all-0.7.0-SNAPSHOT-jar
-with-dependencies.jar:0.7.0-SNAPSHOT-7ac8650777d6b25c8cae4ca1bd5460f25488a694]                                                                       
        ... 4 more
Ok so -- looks like the batch ingest job was loading my credentials from
~/.aws/credentials
which 1) were not for this filer and 2) don't have the ability to specify my endpoint.
I've configured the controller and server with the right credentials and endpoint as documented here: https://docs.pinot.apache.org/basics/data-import/pinot-file-system/amazon-s3
i.e. I'm setting:
Copy code
pinot.controller.storage.factory.s3.region=ap-southeast-1
pinot.controller.storage.factory.s3.accessKey=foo
pinot.controller.storage.factory.s3.secretKey=foo
pinot.controller.storage.factory.s3.endpoint=<http://foo>
(and s/controller/server as well for the server conf)
How can I pick up these settings for the batch ingest job? After deleting .aws/credentials I get this error on batch ingest:
Copy code
aused by: java.io.IOException: software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from any of the providers in the 
chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityToke
nCredentialsProvider(), ProfileCredentialsProvider(), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : [SystemPropertyCredenti
alsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or sy
stem property (aws.accessKeyId)., EnvironmentVariableCredentialsProvider(): Unable to load credentials from system settings. Access key must be specif
ied either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., WebIdentityTokenCredentialsProvider(): Either the envir
onment variable AWS_WEB_IDENTITY_TOKEN_FILE or the javaproperty aws.webIdentityTokenFile must be set., ProfileCredentialsProvider(): Profile file cont
ained no credentials for profile 'default': ProfileFile(profiles=[]), ContainerCredentialsProvider(): Cannot fetch credentials from container - neithe
r AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., InstanceProfileCredentialsProvider(): U
nable to load credentials from service endpoint.]
Is there any way to set my own endpoint for batch ingestion?
n
When you say S3 like can you give more detail? I don’t know the low level details of the S3 plugin but I’m guessing you won’t want to use that unless it’s actually S3 you’re grabbing from.
a
Oh sure -- it's literally API-compatible with S3, just I need to set the endpoint to something on-prem rather than AWS' servers
In other words, from the Pinot docs, I need to set
pinot.controller.storage.factory.s3.endpoint
and the server equivalent I should be good -- but somehow this doesn't seem to be working for the batch ingest?
I think the S3 plugin should work. I already do this with Trino and Trino's built-in S3 support using the aws sdk works
Ok, I think I figured this out -- in addition to the S3PinotFS config options in the controller and server configuration files, I needed to set them in the job spec
n
I had to do the same for GCP. Not sure if you’ve seen this but this doc has an example job file here
a
Thanks! I wasn't aware that I could put more under
configs
than region, this seems to work!
you can put endpoint and more configs