https://pinot.apache.org/ logo
#general
Title
# general
h

Hamza

05/21/2021, 3:06 PM
Hello, I'm new to Pinot and I'm trying to run some tests on the tool using a GCP Deployment. I have deployed GCP o GKE using the the official documentation and it seems to work well. The next step is to import batch data from Google Cloud Storage in order to run some queries. The data on Cloud Storage is 300 Gb of csv files. First thing : I couldn't figure out how to link Cloud Storage to Apache Pinot via the pinot plugin. Second thing : How can I transfrom the csv files into Pinot Tables ? Can I have some guidance on these two subjects ? Thanks in advance 😄
h

Hamza

05/21/2021, 3:20 PM
I have checked this page, but it isn't clear where I can access the controller and server config
m

Mayank

05/21/2021, 3:23 PM
These are config files used to start pinot components. For example: https://docs.pinot.apache.org/configuration-reference/server
@User How do we customize these configs when using GKE?
h

Hamza

05/21/2021, 3:28 PM
I have also followed the second link and I can successfully access the pinot UI via localhost:9000 while it's running on GKE
But I can't figure out how and where to edit the config files in order to enable the gcs plugin
m

Mayank

05/21/2021, 3:33 PM
So these are config files you provide when starting Pinot components. I'll check how these configs get hooked up when using GKE. Also tagging @User who did this part.
x

Xiang Fu

05/21/2021, 5:13 PM
for gke part, if you deploy under k8s then you can edit the configMap to add all the required configs
m

Mayank

05/21/2021, 5:17 PM
Could we add this in the doc?
x

Xiang Fu

05/21/2021, 5:17 PM
those things can be edit inside the
values.yaml
when deploy helm
h

Hamza

05/21/2021, 6:55 PM
Thank you ! That was helpful
In order to launch an ingestion job I created a yml file using this template :
Copy code
executionFrameworkSpec:
    name: 'standalone'
    segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
    segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
    segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '<gs://my-bucket/path/to/input/directory/>'
outputDirURI: '<gs://my-bucket/path/to/output/directory/>'
overwriteOutput: true
pinotFSSpecs:
    - scheme: gs
      className: org.apache.pinot.plugin.filesystem.GcsPinotFS
      configs:
        projectId: 'my-project'
        gcpKey: 'path-to-gcp json key file'
recordReaderSpec:
    dataFormat: 'csv'
    className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
    configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
    tableName: 'students'
pinotClusterSpecs:
    - controllerURI: '<http://localhost:9000>'
I created the file under :
/incubator-pinot/kubernetes/helm/pinot
x

Xiang Fu

05/21/2021, 6:58 PM
in k8s?
you can do that by change the
Copy code
controllerURI: '<http://localhost:9000>'
and for gcp Key
you need to add that into your k8s as a secret
then mount it to the container
h

Hamza

05/21/2021, 6:59 PM
and what should I do to launch it ?
x

Xiang Fu

05/21/2021, 7:02 PM
you can create a k8s batch job to launch it as a one time job
h

Hamza

05/21/2021, 7:06 PM
Thank you I'll try to go in depth !
Hello @User , I tried to follow the steps and things seem to work fine. But I can't run the gcp ingestion job. When I try to inspect the problem in logs, the error comes from reading the gcp key. I tried to generate the gcp key as a k8s secret using :
kubectl create secret generic gcpkey --from-file ~/pinot-key.json
But then I don't know how to use this key in my yaml job passing to the parameter :
gcpKey
. Do you have an idea on how can I pass the k8s secret (gcp key) to the ingestion job ?
x

Xiang Fu

06/08/2021, 10:25 AM
For k8s, you need to mount the secret key to your ingestion job container, then you can use this key file in your container
h

Hamza

06/08/2021, 2:49 PM
Thank you I succeeded doing that. But I get this error :
Copy code
java.lang.RuntimeException: Failed to read from Schema URI - '<http://localhost:9000/tables/fhv_trips_data/schema>' at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.getSchema(SegmentGenerationUtils.java:87) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.init(SegmentGenerationJobRunner.java:144) ~[pinot-batch-ingestion-standalone-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:140) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:166) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:186) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_292] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_292] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_292] at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_292] at java.net.Socket.connect(Socket.java:556) ~[?:1.8.0_292] at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.New(HttpClient.java:339) ~[?:1.8.0_292] at sun.net.www.http.HttpClient.New(HttpClient.java:357) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570) ~[?:1.8.0_292] at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) ~[?:1.8.0_292] at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.fetchUrl(SegmentGenerationUtils.java:231) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.getSchema(SegmentGenerationUtils.java:85) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-e91f426ac179d12e6ef310ddb849912b693bef96] ... 6 more
This is my schema :
Copy code
{
  "schemaName": "fhvTrips",
  "dimensionFieldSpecs": [
    {
      "name": "Dispatching_base_num",
      "dataType": "STRING"
    },
    {
      "name": "locationID",
      "dataType": "INT"
    }
  ],
  "timeFieldSpec": {
    "incomingGranularitySpec": {
      "name": "SecondsSinceEpoch",
      "dataType": "INT",
      "timeType": "SECONDS"
    }
  },
  "dateTimeFieldSpecs": [
    {
      "name": "Pickup_date",
      "dataType": "INT",
      "format": "1:DAYS:EPOCH",
      "granularity": "1:DAYS"
    }
  ]
}
x

Xiang Fu

06/08/2021, 5:20 PM
for k8s you need to change schema call and table call using your pinot-controller service name
Copy code
'http://<pinot-controller-svc>:9000/tables/fhv_trips_data/schema
not
'<http://localhost:9000/tables/fhv_trips_data/schema>
h

Hamza

06/08/2021, 6:43 PM
Alright that works fine, thanks ! But still it can't load the data. I tried ingesting one file and it worked but when I try to ingest 70 files it doesn't.
x

Xiang Fu

06/08/2021, 6:59 PM
must be oom 😛
how big is your single file
do you have the errors
h

Hamza

06/08/2021, 7:05 PM
Files can go up to 1.4 Go
x

Xiang Fu

06/08/2021, 7:44 PM
then give bigger k8s container and increase jvm size
Also try to not parallel create segments
h

Hamza

06/09/2021, 9:45 AM
I'm using
SegmentCreationAndTarPush
, is it the right one to pass as jobType ?
x

Xiang Fu

06/09/2021, 10:02 AM
I think it’s fine
What’s the exceptions?
Have you tried bigger container?
h

Hamza

06/09/2021, 10:04 AM
By bigger container you mean changing the resources on ?
Copy code
resources:
          limits:
            memory: 512Mi
            cpu: 1
          requests:
            memory: 512Mi
            cpu: 1
The error I get is :
/var/pinot/example/ingest_gcs.sh: line 3: 8 Killed bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /var/pinot/example/gcs_ingestion_job.yaml
I solved the issue changing the memory paramter in resources !
But It seems that the segment creation takes a long time, is there a way to reduce the segment creation time ?
m

Mayank

06/09/2021, 4:58 PM
Right now a single input file corresponds to one output pinot segment. One way to improve time would be to create larger number of input files. Ideally you want to get pinot segment size in the range of few hundred MBs. What's the input size right now (overall, and per file)?
h

Hamza

06/09/2021, 5:24 PM
I'm testing on an input folder of 72 files, each file goes from 100MB to 1.6 GB
m

Mayank

06/09/2021, 5:31 PM
What's the parallelism you are using?
h

Hamza

06/09/2021, 5:33 PM
I'm not using parallelism
m

Mayank

06/09/2021, 5:33 PM
Ok, please use parallelism (you may need to give ample xmx in that case)
h

Hamza

06/09/2021, 5:35 PM
Editing the xmx should be on this block ?
Copy code
env:
           - name: JAVA_OPTS
             value: "-Xms4G -Xmx4G -Dpinot.admin.system.exit=true"
m

Mayank

06/09/2021, 5:35 PM
yes
h

Hamza

06/09/2021, 5:35 PM
Alright thanks I will try that