Hello, I'm new to Pinot and I'm trying to run some...
# general
h
Hello, I'm new to Pinot and I'm trying to run some tests on the tool using a GCP Deployment. I have deployed GCP o GKE using the the official documentation and it seems to work well. The next step is to import batch data from Google Cloud Storage in order to run some queries. The data on Cloud Storage is 300 Gb of csv files. First thing : I couldn't figure out how to link Cloud Storage to Apache Pinot via the pinot plugin. Second thing : How can I transfrom the csv files into Pinot Tables ? Can I have some guidance on these two subjects ? Thanks in advance 😄
h
I have checked this page, but it isn't clear where I can access the controller and server config
m
These are config files used to start pinot components. For example: https://docs.pinot.apache.org/configuration-reference/server
@User How do we customize these configs when using GKE?
h
I have also followed the second link and I can successfully access the pinot UI via localhost:9000 while it's running on GKE
But I can't figure out how and where to edit the config files in order to enable the gcs plugin
m
So these are config files you provide when starting Pinot components. I'll check how these configs get hooked up when using GKE. Also tagging @User who did this part.
x
for gke part, if you deploy under k8s then you can edit the configMap to add all the required configs
m
Could we add this in the doc?
x
those things can be edit inside the
values.yaml
when deploy helm
h
Thank you ! That was helpful
In order to launch an ingestion job I created a yml file using this template :
Copy code
executionFrameworkSpec:
    name: 'standalone'
    segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
    segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
    segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '<gs://my-bucket/path/to/input/directory/>'
outputDirURI: '<gs://my-bucket/path/to/output/directory/>'
overwriteOutput: true
pinotFSSpecs:
    - scheme: gs
      className: org.apache.pinot.plugin.filesystem.GcsPinotFS
      configs:
        projectId: 'my-project'
        gcpKey: 'path-to-gcp json key file'
recordReaderSpec:
    dataFormat: 'csv'
    className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
    configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
    tableName: 'students'
pinotClusterSpecs:
    - controllerURI: '<http://localhost:9000>'
I created the file under :
/incubator-pinot/kubernetes/helm/pinot
x
in k8s?
you can do that by change the
Copy code
controllerURI: '<http://localhost:9000>'
and for gcp Key
you need to add that into your k8s as a secret
then mount it to the container
h
and what should I do to launch it ?
x
you can create a k8s batch job to launch it as a one time job
h
Thank you I'll try to go in depth !
Hello @User , I tried to follow the steps and things seem to work fine. But I can't run the gcp ingestion job. When I try to inspect the problem in logs, the error comes from reading the gcp key. I tried to generate the gcp key as a k8s secret using :
kubectl create secret generic gcpkey --from-file ~/pinot-key.json
But then I don't know how to use this key in my yaml job passing to the parameter :
gcpKey
. Do you have an idea on how can I pass the k8s secret (gcp key) to the ingestion job ?
x
For k8s, you need to mount the secret key to your ingestion job container, then you can use this key file in your container
h
Thank you I succeeded doing that. But I get this error :
Copy code
java.lang.RuntimeException: Failed to read from Schema URI - '<http://localhost:9000/tables/fhv_trips_data/schema>' at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.getSchema(SegmentGenerationUtils.java:87) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.init(SegmentGenerationJobRunner.java:144) ~[pinot-batch-ingestion-standalone-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:140) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:166) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:186) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_292] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_292] at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292] at java.net.Socket.connect(Socket.java:556) ~[?:1.8.0_292] at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) ~[?:1.8.0_292] at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.fetchUrl(SegmentGenerationUtils.java:231) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.getSchema(SegmentGenerationUtils.java:85) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] ... 6 more
This is my schema :
Copy code
{
  "schemaName": "fhvTrips",
  "dimensionFieldSpecs": [
    {
      "name": "Dispatching_base_num",
      "dataType": "STRING"
    },
    {
      "name": "locationID",
      "dataType": "INT"
    }
  ],
  "timeFieldSpec": {
    "incomingGranularitySpec": {
      "name": "SecondsSinceEpoch",
      "dataType": "INT",
      "timeType": "SECONDS"
    }
  },
  "dateTimeFieldSpecs": [
    {
      "name": "Pickup_date",
      "dataType": "INT",
      "format": "1:DAYS:EPOCH",
      "granularity": "1:DAYS"
    }
  ]
}
x
for k8s you need to change schema call and table call using your pinot-controller service name
Copy code
'http://<pinot-controller-svc>:9000/tables/fhv_trips_data/schema
not
'<http://localhost:9000/tables/fhv_trips_data/schema>
h
Alright that works fine, thanks ! But still it can't load the data. I tried ingesting one file and it worked but when I try to ingest 70 files it doesn't.
x
must be oom 😛
how big is your single file
do you have the errors
h
Files can go up to 1.4 Go
x
then give bigger k8s container and increase jvm size
Also try to not parallel create segments
h
I'm using
SegmentCreationAndTarPush
, is it the right one to pass as jobType ?
x
I think it’s fine
What’s the exceptions?
Have you tried bigger container?
h
By bigger container you mean changing the resources on ?
Copy code
resources:
          limits:
            memory: 512Mi
            cpu: 1
          requests:
            memory: 512Mi
            cpu: 1
The error I get is :
/var/pinot/example/ingest_gcs.sh: line 3: 8 Killed bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /var/pinot/example/gcs_ingestion_job.yaml
I solved the issue changing the memory paramter in resources !
But It seems that the segment creation takes a long time, is there a way to reduce the segment creation time ?
m
Right now a single input file corresponds to one output pinot segment. One way to improve time would be to create larger number of input files. Ideally you want to get pinot segment size in the range of few hundred MBs. What's the input size right now (overall, and per file)?
h
I'm testing on an input folder of 72 files, each file goes from 100MB to 1.6 GB
m
What's the parallelism you are using?
h
I'm not using parallelism
m
Ok, please use parallelism (you may need to give ample xmx in that case)
h
Editing the xmx should be on this block ?
Copy code
env:
           - name: JAVA_OPTS
             value: "-Xms4G -Xmx4G -Dpinot.admin.system.exit=true"
m
yes
h
Alright thanks I will try that