Hi, I have setup pinot on EKS, how can I start wit...
# general
s
Hi, I have setup pinot on EKS, how can I start with batch upload from s3 ? how can I set configs,? create tables ? I have started the incubator .
x
you can check this doc for the corresponding configs to set in the
values.yml
file: https://docs.pinot.apache.org/users/tutorials/use-s3-as-deep-store-for-pinot
n
I’m fairly new to Pinot, but I just did something similar in GCP so this should be the general direction: 1. Add s3 plugin and configuration from here: https://docs.pinot.apache.org/basics/data-import/pinot-file-system/amazon-s3 2. Create and upload your schema definition (I used API call to controller) 3. Create and upload your table definition (API call again) 4. Create a job spec (s3 link above has an example, use
jobType: SegmentCreationAndUriPush
if you want your server to both create the segments and make the data query-able) 5. ssh into your server pod, put your job spec file somewhere in the pod, and run
/opt/pinot/bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /path/to/job/spec &
Script location may be different depending on how/where you installed. I also added the
&
at the end so I could disconnect from the pod and leave it running.
If anyone has an easier way to do 5 I’m all ears, but didn’t see an endpoint in the swagger docs to upload a job spec file for processing on a specific worker.
x
Thanks Nick! For 5, we recently add a new ImportData command:
Copy code
bin/pinot-admin.sh ImportData -dataFilePath file:/Users/xiangfu/temp/github-data/part-00006-493ff49d-d946-4437-9e37-523b90c3d96f-c000.gz.parquet -format parquet -table githubEvents
n
Also my pods kept dying because they would fill up the
tmp
directory, so to get around this I modified the helm chart’s statefulset and added a PV mount since my nodes didn’t have enough space.
x
Thanks for reporting this, are you be able to get what’s inside the tmp directory?
n
Thanks Xiang!
Yes it’s downloading the files from my bucket in there. I have a lot so it fills up, node gets diskPressure, and the pod gets evicted.
Also I wanted to help with some edits to the documentation but says it moved. How can I contribute?
x
Pinot docs are at this repo: https://github.com/pinot-contrib/pinot-docs
👍 1
for the temp directory, I will add an option to allow configurable temp directory for segment generation
n
Awesome thank you so much. That’ll save me from having to make a custom helm chart.
x
yes, i know someone tried to mount a big pvc to /tmp 😂
n
guilty as charged ¯\_(ツ)_/¯
s
Thanks @Xiang Fu , @Nick Bowles
@Nick Bowles the steps you mentioned for GCP , how did you setup s3 plugin ? like you changed configuration in YAML and deployed, can you please describe the steps to apply controller config , server config and job spec and any other steps
@Xiang Fu you mentioned he inport command , so the steps says it has to be run from POD , but if we want to automate that process, is there a way to push data to pinot from s3 from some other vm OR we need to add a cron job in kube yamol to run this command periodically ? is this correct understanding? Also is there a way to confgure kinesis
?
@Nick Bowles "I used API call to controller" , what kind of call and which api i need to use?
x
yes, actually you can run those command from a different pod
x
but that requires you to run a k8s batch job or a cron job
s
but , not getting how I can apply these properties
I installed pinot with default settings from the documentation
x
oh?
which properties?
s
I want to confugre s3 with pinot
I installed pinot on EKS using documentation
x
right
have you seen this one
you need to add these configs to pinot-controller conf file:
s
now I want to configure s3 parquets as source
x
Copy code
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.controller.storage.factory.s3.region=us-west-2
in k8s, you can add them under
controller.extr.configs
in
values.yml
file
Copy code
# Extra configs will be appended to pinot-controller.conf file
  extra:
    configs: |-
      pinot.set.instance.id.to.hostname=true
      controller.task.scheduler.enabled=true
you can append those configs there
similar for servers
s
in kubernets/helm/templates/controller here?
x
you can add
Copy code
pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS
pinot.controller.storage.factory.s3.region=us-west-2

pinot.controller.segment.fetcher.protocols=file,http,s3
pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
under file values.yml
Copy code
# Extra configs will be appended to pinot-controller.conf file
  extra:
    configs: |-
ah
so I guess you install pinot use
helm install ..
s
yes I started from the documentation
using helm on EKS
should I go with some other approach ?
it’s still same
similar to this, you need to do
helm inspect values pinot/pinot > /tmp/pinot-values.yaml
then you can edit the pinot values
then install pinot using this values.yaml file
Copy code
helm install presto pinot/pinot -n pinot-quickstart --values /tmp/pinot-values.yaml
s
and then I should do helm upgrde ?
x
yes you can
s
ok thanks @Xiang Fu , let me try this
n
Sorry for the late reply I was asleep. As Xiang mentioned you could kick the jobs off through a k8s job/cronjob, or alternatively could use something like Jenkins or Airflow.
x
True, I will add an examples for k8s cronjobs
also we recently added minion task framework
it allows setting up schedule jobs to create segments. will add docs for this as well.