How to ingest Data into pinot on kubernetes using ...
# general
a
How to ingest Data into pinot on kubernetes using native batch ingestion? Hi, I am trying to ingest csv data into pinot deployed on kubernetes using LaunchDataIngestionJob arg. I have verified that the table has been created on pinot and the job-spec and csv data are present on the node. This is my job-spec file
Copy code
apiVersion: batch/v1
kind: Job
metadata:
  name: pinot-case-offline-ingestion
  namespace: my-pinot-kube
spec:
  template:
    spec:
      containers:
        - name: pinot-load-case-offline
          image: apachepinot/pinot:0.3.0-SNAPSHOT
          args: ["LaunchDataIngestionJob", "-jobSpecFile", "/opt/data/table-configs/case_history/job-spec.yml"]
          volumeMounts:
            - name: mount-data
              mountPath: /opt/data
      restartPolicy: OnFailure
      volumes:
        - name: mount-data
          hostPath:
            path: /opt/data
  backoffLimit: 100
After applying this job to node, nothing happens and this is the log of the pod.
Copy code
SegmentGenerationJobSpec: 
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**/*.csv
inputDirURI: /opt/data/csv_data/case_prod_data
jobType: SegmentCreationAndTarPush
outputDirURI: /pinot-segments/case_history
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: '<http://192.168.49.2:30892/>'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: null
recordReaderSpec:
  className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
  configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig
  configs: {delimiter: '|', multiValueDelimiter: ''}
  dataFormat: csv
segmentNameGeneratorSpec:
  configs: {segment.name.prefix: case_history, exclude.sequence.id: 'true'}
  type: normalizedDate
tableSpec: {schemaURI: null, tableConfigURI: null, tableName: case_history}

Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
Am I ingesting the data incorrectly ?
x
I think you are missing pushJobSpec?
Copy code
pushJobSpec: null
a
Hi @Xiang Fu, Thank you for helping. I tried adding pushJobSpec to job-spec
Copy code
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000
But the job gets completed with no errors. And the pod log is
Copy code
SegmentGenerationJobSpec: 
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
includeFileNamePattern: glob:**/*.csv
inputDirURI: /opt/data/csv_data/case_prod_data
jobType: SegmentCreationAndTarPush
outputDirURI: /pinot-segments/case_history
overwriteOutput: true
pinotClusterSpecs:
- {controllerURI: '<http://192.168.49.2:30892/>'}
pinotFSSpecs:
- {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file}
pushJobSpec: {pushAttempts: 2, pushParallelism: 2, pushRetryIntervalMillis: 1000,
  segmentUriPrefix: null, segmentUriSuffix: null}
recordReaderSpec:
  className: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader
  configClassName: org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig
  configs: {delimiter: '|', multiValueDelimiter: ''}
  dataFormat: csv
segmentNameGeneratorSpec:
  configs: {segment.name.prefix: case_history, exclude.sequence.id: 'true'}
  type: normalizedDate
tableSpec: {schemaURI: null, tableConfigURI: null, tableName: case_history}

Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS
x
ok
what’s the logs for the job?
a
Here is the log of the job:
Copy code
16:26:48:ayush@:pinot :alien: kubectl -n my-pinot-kube describe jobs.batch pinot-case-offline-ingestion 
Name:           pinot-case-offline-ingestion
Namespace:      my-pinot-kube
Selector:       controller-uid=25b4e843-b600-4de2-a2ad-584ac8ce17b5
Labels:         controller-uid=25b4e843-b600-4de2-a2ad-584ac8ce17b5
                job-name=pinot-case-offline-ingestion
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Fri, 05 Mar 2021 16:26:41 -0500
Completed At:   Fri, 05 Mar 2021 16:26:44 -0500
Duration:       3s
Pods Statuses:  0 Running / 1 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=25b4e843-b600-4de2-a2ad-584ac8ce17b5
           job-name=pinot-case-offline-ingestion
  Containers:
   pinot-load-case-offline:
    Image:      apachepinot/pinot:0.3.0-SNAPSHOT
    Port:       <none>
    Host Port:  <none>
    Args:
      LaunchDataIngestionJob
      -jobSpecFile
      /opt/data/table-configs/case_history/job-spec.yml
    Environment:  <none>
    Mounts:
      /opt/data from mount-data (rw)
  Volumes:
   mount-data:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/data
    HostPathType:  
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  27s   job-controller  Created pod: pinot-case-offline-ingestion-mfvrx
  Normal  Completed         24s   job-controller  Job completed
The following is the job spec file to refer. What should be the pinotClusterSpecs.controllerURI value? I tried changing it to anything gibberish and I faced the same logs. I think, my value of pinotClusterSpecs.controllerURI is incorrect.
Copy code
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/opt/data/csv_data/case_prod_data'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/pinot-segments/case_history'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs:
    delimiter: '|'
    multiValueDelimiter: ''
tableSpec:
  tableName: 'case_history'
pinotClusterSpecs:
  # - controllerURI: 'pinot-controller:9000'
  - controllerURI: '<http://192.168.49.2:30892/>'
segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'case_history'
    exclude.sequence.id: true
pushJobSpec:
  pushParallelism: 2
  pushAttempts: 2
  pushRetryIntervalMillis: 1000
x
then are there data on
/opt/data/csv_data/case_prod_data
a
yes. I checked by running a ubuntu container and bashed into it. there is data present on this path
x
can you try a newer image as well
Copy code
apachepinot/pinot:0.6.0
0.3.0 is very old image which I cannot recall the details
a
ok. so changed the image. it worked. at the very end of the log it says
Copy code
Response for pushing table case_history segment case_history to location <http://192.168.49.2:30892> - 200: {"status":"Successfully uploaded segment: case_history of table: case_history"}
But, wondering why I cannot query it on the pinot query UI
there are no records returned from the query select * from case_history limit 10
x
hmm
it should be
a
seems, like another issue that I have to look into. But anyways, thank you very much @Xiang Fu for you promt responses and help. The new image worked out well.
x
can you check pinot server log?
seems like so
a
ok. I do see some errors on pinot-server.
Copy code
2021/03/05 20:45:00.943 INFO [HelixServerStarter] [Start a Pinot [SERVER]] Starting Pinot server
2021/03/05 20:45:00.944 INFO [HelixServerStarter] [Start a Pinot [SERVER]] Initializing Helix manager with zkAddress: pinot-zookeeper:2181, clusterName: pinot-quickstart, instanceId: Server_pinot-server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098
2021/03/05 20:45:02.560 INFO [HelixServerStarter] [Start a Pinot [SERVER]] Initializing server instance and registering state model factory
2021/03/05 20:45:51.252 INFO [HelixServerStarter] [Start a Pinot [SERVER]] Connecting Helix manager
2021/03/05 20:46:42.537 WARN [ClientCnxn] [Start a Pinot [SERVER]-SendThread(pinot-zookeeper:2181)] Client session timed out, have not heard from server in 31084ms for sessionid 0x0
2021/03/05 20:46:44.353 WARN [ParticipantHealthReportTask] [Start a Pinot [SERVER]] ParticipantHealthReportTimerTask already stopped
2021/03/05 20:47:10.343 WARN [CallbackHandler] [Start a Pinot [SERVER]] Callback handler received event in wrong order. Listener: org.apache.helix.messaging.handling.HelixTaskExecutor@2767bcd8, path: /pinot-quickstart/INSTANCES/Server_pinot-server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098/MESSAGES, expected types: [CALLBACK, FINALIZE] but was INIT
2021/03/05 20:47:11.245 INFO [HelixServerStarter] [Start a Pinot [SERVER]] Instance config for instance: Server_pinot-server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098 has instance tags: [DefaultTenant_OFFLINE, DefaultTenant_REALTIME], host: pinot-server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local, port: 8098, no need to update
2021/03/05 20:47:11.249 INFO [HelixServerStarter] [Start a Pinot [SERVER]] Using class: org.apache.pinot.server.api.access.AllowAllAccessFactory as the AccessControlFactory
2021/03/05 20:47:11.455 INFO [HelixServerStarter] [Start a Pinot [SERVER]] Starting server admin application on: <http://0.0.0.0:8097>
2021/03/05 20:47:13.650 WARN [ClientCnxn] [Start a Pinot [SERVER]-SendThread(pinot-zookeeper:2181)] Session 0x10001285ff10004 for server pinot-zookeeper/10.107.87.233:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_282]
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_282]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_282]
	at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_282]
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[?:1.8.0_282]
	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
2021/03/05 20:47:46.344 WARN [ZKHelixManager] [ZkClient-EventThread-16-pinot-zookeeper:2181] KeeperState:Disconnected, SessionId: 10001285ff10004, instance: Server_pinot-server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098, type: PARTICIPANT
Mar 05, 2021 8:48:39 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8097]
Mar 05, 2021 8:48:40 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
2021/03/05 20:48:41.841 WARN [ZKHelixManager] [ZkClient-EventThread-16-pinot-zookeeper:2181] KeeperState:Disconnected, SessionId: 10001285ff10004, instance: Server_pinot-server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098, type: PARTICIPANT
2021/03/05 20:50:17.063 WARN [ZKHelixManager] [ZkClient-EventThread-16-pinot-zookeeper:2181] KeeperState:Disconnected, SessionId: 10001285ff10004, instance: Server_pinot-server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098, type: PARTICIPANT
2021/03/05 20:51:06.653 ERROR [StartServiceManagerCommand] [Start a Pinot [SERVER]] Failed to start a Pinot [SERVER] at 368.2 since launch
org.apache.helix.HelixException: fail to set config. cluster: pinot-quickstart is NOT setup.
	at org.apache.helix.ConfigAccessor.set(ConfigAccessor.java:300) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.helix.manager.zk.ZKHelixAdmin.setConfig(ZKHelixAdmin.java:1092) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.pinot.server.starter.helix.HelixServerStarter.start(HelixServerStarter.java:361) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.pinot.tools.service.PinotServiceManager.startServer(PinotServiceManager.java:150) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.pinot.tools.service.PinotServiceManager.startRole(PinotServiceManager.java:95) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.pinot.tools.admin.command.StartServiceManagerCommand$1.lambda$run$0(StartServiceManagerCommand.java:260) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.startPinotService(StartServiceManagerCommand.java:286) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.access$000(StartServiceManagerCommand.java:57) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
	at org.apache.pinot.tools.admin.command.StartServiceManagerCommand$1.run(StartServiceManagerCommand.java:260) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-b2d716d9c465eaf69685f8e284015de5cd7b038e]
2021/03/05 21:37:47.170 WARN [ConfigAccessor] [ZkClient-EventThread-16-pinot-zookeeper:2181] No config found at /pinot-quickstart/CONFIGS/RESOURCE/case_history_OFFLINE
I dont know why it is looking for pinot-quickstart configs
x
hmm when you start pinot server did you give a clustername?
a
I start pinot using helm like
Copy code
kubectl create ns my-pinot-kube
helm install pinot /home/ayush/spyne/incubator-pinot/kubernetes/helm/pinot -n my-pinot-kube --set replicas=1
x
hmmm
can you describe the statefulset of pinot-controller and pinot-server and see what's the arguments for that
a
ok. All the pinot workers are in running state. I do see these 2 errors on pinot-controller
Copy code
WARN [PinotInstanceRestletResource] [grizzly-http-server-1] Admin port is not set for instance: Server_pinot-server-0.pinot-server-headless.my-pinot-kube.svc.cluster.local_8098
...
...
Copy code
WARN [PinotInstanceRestletResource] [grizzly-http-server-1] Grpc port is not set for instance: Controller_pinot-controller-0.pinot-controller-headless.my-pinot-kube.svc.cluster.local_9000
...
...
or, I think this could mean something (log on pinot-controller)
Copy code
WARN [SegmentStatusChecker] [pool-7-thread-2] Table case_history_OFFLINE has 1 segments with no online replicas
WARN [SegmentStatusChecker] [pool-7-thread-2] Table case_history_OFFLINE has 0 replicas, below replication threshold :1
x
this means your controller is up, but no pinot server is connected to the cluster
i feel something goes wrong with the server setup
can you try to restart pinot-server pod and see if it's reconnecting?
a
yes. restarting the node
yes. restarting the node worked out! Thank you very much @Xiang Fu. 🙏
x
cool!
I think the issue is that pinot server pod started before pinot controller which requires setup the zookeeper structure
👍 1
so restart should fix it
a
yes. whenever I start using helm, zookeeper and controller are the last ones to start and because of that server and broker takes multiple restarts.