hey friends i m seeing our space in zookeeper almost getting Apache Pinot #troubleshooting

hey friends, i’m seeing our space in zookeeper alm...

Luis Fernandez

04/18/2022, 8:09 PM

hey friends, i’m seeing our space in zookeeper almost getting to max usage in terms of disk, we have 5gb disk space, we currently have it setup with

Copy code

- name: ZK_PURGE_INTERVAL
              value: "1"
            - name: ZK_SNAP_RETAIN_COUNT
              value: "3"

in the logs i can see things getting set:

Copy code

2022-04-15 16:14:35,914 [myid:1] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2022-04-15 16:14:35,915 [myid:1] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1
2022-04-15 16:14:35,979 [myid:1] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2022-04-15 16:14:35,980 [myid:1] - INFO  [main:ManagedUtil@46] - Log4j found with jmx enabled.
2022-04-15 16:14:35,988 [myid:1] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.

however i don’t see any more logs of clean up logs after, is there a reason for this? also, i can see that the space is being chugged by the logs does anyone know why things may not get cleaned up? Thank you would appreciate your help.

Mayank

04/18/2022, 8:20 PM

@Xiaoman Dong ^^

Luis Fernandez

04/18/2022, 8:24 PM

i did an ls -lh

/data/log/version-2

Luis Fernandez

04/18/2022, 8:24 PM

and things are getting filled up there

Luis Fernandez

04/18/2022, 8:24 PM

Copy code

$ ls -lh
total 3.2G
-rw-rw-r-- 1 zookeeper zookeeper 768M Apr  5 00:45 log.700000001
-rw-rw-r-- 1 zookeeper zookeeper 1.0G Apr 11 17:51 log.700014b0e
-rw-rw-r-- 1 zookeeper zookeeper 1.1G Apr 15 16:01 log.8000001e6
-rw-r--r-- 1 zookeeper zookeeper 448M Apr 18 20:11 log.80000ea05

Xiaoman Dong

04/18/2022, 8:25 PM

A bit off topic but I don’t think this is related to zookeeper node size too big issue; this looks very zookeeper specific cc @Mayank I will try to have a quick look at zookeeper but we may need some zookeeper expertise here

Mayank

04/18/2022, 8:30 PM

What’s filling the logs @Luis Fernandez? Also, for prod ZK, you probably want to have persistent store fo snapshots.

Luis Fernandez

04/18/2022, 8:31 PM

whatever gets stored in this route

Copy code

/data/log/version-2

Luis Fernandez

04/18/2022, 8:31 PM

they do not look like logs tho lol they look like our configs

Mayank

04/18/2022, 8:31 PM

Transaction logs?

Luis Fernandez

04/18/2022, 8:32 PM

example

Luis Fernandez

04/18/2022, 8:32 PM

Copy code

`ZKLG
?5
  "id" : "WorkflowContext",
  "simpleFields" : {
    "LAST_PURGE_TIME" : "1646699352774",
    "NAME" : "TaskQueue_RealtimeToOfflineSegmentsTask",
    "START_TIME" : "1646234218213",
    "STATE" : "IN_PROGRESS"
  },
  "mapFields" : {
    "JOB_STATES" : {
      "TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1646698682995" : "COMPLETED"
    },
    "StartTime" : {
      "TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1646698682995" : "1646698697926"
    }
  },
  "listFields" : { }
}
  "id" : "etsyads_metrics_dev__0__5__20220307T2037Z",
  "simpleFields" : {

Mayank

04/18/2022, 8:32 PM

Yeah, I am saying it is the ZK transaction logs

Luis Fernandez

04/18/2022, 8:32 PM

yes so this is filling up

Luis Fernandez

04/18/2022, 8:32 PM

any way to purge them?

Mayank

04/18/2022, 8:32 PM

You should mount snapshots on EVS, otherwise you are running the risk of data loss

Xiaoman Dong

04/18/2022, 8:33 PM

Not sure if it is helpful but I searched and found this: https://github.com/31z4/zookeeper-docker/issues/30

Mayank

04/18/2022, 8:33 PM

cc: @Daniel Lavoie

Luis Fernandez

04/18/2022, 8:33 PM

aren’t snapshots here?

Copy code

/data/version-2

Luis Fernandez

04/18/2022, 8:34 PM

content:

Copy code

total 3.5M
-rw-r--r-- 1 zookeeper zookeeper    1 Apr 15 19:06 acceptedEpoch
-rw-r--r-- 1 zookeeper zookeeper    1 Apr 15 19:06 currentEpoch
-rw-rw-r-- 1 zookeeper zookeeper 481K Mar  8 19:58 snapshot.500a20f00
-rw-rw-r-- 1 zookeeper zookeeper 1.3M Apr  5 00:45 snapshot.700014b0d
-rw-rw-r-- 1 zookeeper zookeeper 1.8M Apr 11 17:53 snapshot.8000001e5

Luis Fernandez

04/18/2022, 8:34 PM

also noob question what’s EVS?

Luis Fernandez

04/18/2022, 8:35 PM

@Xiaoman Dong we are using

apache-zookeeper-3.5.5

Mayank

04/18/2022, 8:35 PM

Typo EBS - persistent storage

Luis Fernandez

04/18/2022, 8:36 PM

we are using whatever is there in the helm image

Daniel Lavoie

04/18/2022, 8:37 PM

I think the ZK env variable have changed between 3.5 and 3.7

Daniel Lavoie

04/18/2022, 8:37 PM

For snapshot configure.

Daniel Lavoie

04/18/2022, 8:37 PM

can you share the content of

/conf/zoo.cfg

from within the ZK container?

Luis Fernandez

04/18/2022, 8:37 PM

Luis Fernandez

04/18/2022, 8:38 PM

Copy code

clientPort=2181
dataDir=/data
dataLogDir=/data/log
tickTime=2000
initLimit=10
syncLimit=10
maxClientCnxns=60
minSessionTimeout= 4000
maxSessionTimeout= 40000
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
4lw.commands.whitelist=*
server.1=pinot-zookeeper-0.pinot-zookeeper-headless.pinot-dev.svc.cluster.local:2888:3888
server.2=pinot-zookeeper-1.pinot-zookeeper-headless.pinot-dev.svc.cluster.local:2888:3888
server.3=pinot-zookeeper-2.pinot-zookeeper-headless.pinot-dev.svc.cluster.local:2888:3888

Daniel Lavoie

04/18/2022, 8:41 PM

Auto purge seems to be configured as expected.

Luis Fernandez

04/18/2022, 8:43 PM

is it that it still hasn’t hit those parameters whatever they are?

Daniel Lavoie

04/18/2022, 8:43 PM

I see that both dataDir and dataLogDir are within the same directory. I’ve personally encountered issue when these two are nested.

Daniel Lavoie

04/18/2022, 8:43 PM

Is that a default config that go you that directory configuration?

Luis Fernandez

04/18/2022, 8:43 PM

yes that was default

Luis Fernandez

04/18/2022, 8:43 PM

from the helm chart

Luis Fernandez

04/18/2022, 8:44 PM

Copy code

ZK_DATA_DIR=${ZK_DATA_DIR:-"/data"}
      ZK_DATA_LOG_DIR=${ZK_DATA_LOG_DIR:-"/data/log"}

Luis Fernandez

04/18/2022, 8:45 PM

i’m not super knowledgeable about zookeeper but how are these 2 directories related?

Daniel Lavoie

04/18/2022, 8:46 PM

the log dir stores the snapshots of your ZK state.

Daniel Lavoie

04/18/2022, 8:46 PM

sorry

Daniel Lavoie

04/18/2022, 8:46 PM

Data Dir stores the snapshots of your ZK state. That’s a complete of the state.

Daniel Lavoie

04/18/2022, 8:46 PM

The Data Log Dir holds the transaction logs. That’s the critical data.

Daniel Lavoie

04/18/2022, 8:47 PM

When a snapshot is made, ZK can purge the transactions log to reduce its size.

Luis Fernandez

04/18/2022, 8:47 PM

probably don't wanna lose that transaction log yes?

Daniel Lavoie

04/18/2022, 8:48 PM

You can lose the transactions if you have a fresh snapshot.

Daniel Lavoie

04/18/2022, 8:48 PM

But you would want to let ZK do its thing.

Daniel Lavoie

04/18/2022, 8:49 PM

I have reason to believe you could be impacted by a bug in the helm chart configuration. Can you manually edit the PVC to increase the size of the PVC? If you are on AWS it can be done without restart.

Luis Fernandez

04/18/2022, 8:50 PM

i will have to ask my team but i think we can increase it yes

Luis Fernandez

04/18/2022, 8:52 PM

/data/log/version-2

Copy code

total 3.2G
-rw-rw-r-- 1 zookeeper zookeeper 768M Apr  5 00:45 log.700000001
-rw-rw-r-- 1 zookeeper zookeeper 1.0G Apr 11 17:51 log.700014b0e
-rw-rw-r-- 1 zookeeper zookeeper 1.1G Apr 15 16:01 log.8000001e6
-rw-r--r-- 1 zookeeper zookeeper 448M Apr 18 20:37 log.80000ea05

/data/version-2

Copy code

total 3.5M
-rw-r--r-- 1 zookeeper zookeeper    1 Apr 15 19:06 acceptedEpoch
-rw-r--r-- 1 zookeeper zookeeper    1 Apr 15 19:06 currentEpoch
-rw-rw-r-- 1 zookeeper zookeeper 481K Mar  8 19:58 snapshot.500a20f00
-rw-rw-r-- 1 zookeeper zookeeper 1.3M Apr  5 00:45 snapshot.700014b0d
-rw-rw-r-- 1 zookeeper zookeeper 1.8M Apr 11 17:53 snapshot.8000001e5

Daniel Lavoie

04/18/2022, 8:53 PM

Snapshots are happening but the log isn’t cleaned.

Daniel Lavoie

04/18/2022, 8:54 PM

I observed that behaviour when dataDirLog and dataDir are nested in the same folder.

Daniel Lavoie

04/18/2022, 8:54 PM

I’ll do some testing tomorrow and see if we can get a recipe to fix this.

🙌 1

Luis Fernandez

04/18/2022, 8:54 PM

oh shoot D-:

Mayank

04/18/2022, 9:05 PM

Thanks @Daniel Lavoie for jumping on this.

Mayank

04/19/2022, 1:35 PM

@Harish Bohara is also running into ZK disk space issue

Harish Bohara

04/19/2022, 2:22 PM

I am using EBS and one of the ZK 20GB disk was 100% consumed. This happens in a cluster which was running for only 2 days

Luis Fernandez

04/19/2022, 2:23 PM

can you also share your zk configs?

Daniel Lavoie

04/19/2022, 2:26 PM

Can you also both share the helm values of your deployment?

Daniel Lavoie

04/19/2022, 2:26 PM

By removing sensitive stuff of course.

Luis Fernandez

04/19/2022, 2:26 PM

i have the yaml file

Harish Bohara

04/19/2022, 2:27 PM

Using confis from helm chart which is in Pinot site. Only change I, I increased the disk + CPU and memory for zk nodes

Luis Fernandez

04/19/2022, 2:27 PM

same here and added autopurge

Luis Fernandez

04/19/2022, 2:27 PM

Copy code

env:
            - name: ZK_REPLICAS
              value: "3"
            - name: JMXAUTH
              value: "false"
            - name: JMXDISABLE
              value: "false"
            - name: JMXPORT
              value: "1099"
            - name: JMXSSL
              value: "false"
            - name: ZK_HEAP_SIZE
              value: "256M"
            - name: ZK_SYNC_LIMIT
              value: "10"
            - name: ZK_TICK_TIME
              value: "2000"
            - name: ZK_PURGE_INTERVAL
              value: "1"
            - name: ZK_SNAP_RETAIN_COUNT
              value: "3"
            - name: ZOO_INIT_LIMIT
              value: "5"
            - name: ZOO_MAX_CLIENT_CNXNS
              value: "60"
            - name: ZOO_PORT
              value: "2181"
            - name: ZOO_STANDALONE_ENABLED
              value: "false"
            - name: ZOO_TICK_TIME
              value: "2000"

Luis Fernandez

04/19/2022, 2:28 PM

Copy code

resources:
            limits: 
              cpu: 500m
              memory: 1Gi
            requests:
              cpu: 500m
              memory: 1Gi
          volumeMounts:
            - name: data
              mountPath: /data
            - name: config
              mountPath: /config-scripts
volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: "5Gi"

Daniel Lavoie

04/19/2022, 2:36 PM

Settings

ZK_DATA_DIR

/data/snapshots

should do the trick. Big big spoiler. Do not attempt this on an environment you cannot affort to lose data. I have reason to believe that this could cause problem if all pods are updated at once. I would recommend update your statefulset update policy to

OnDelete

and set the env variable. Delete the pods one by one and checking the quorum status. Once the cluster is rollout, you can apply the changes with helm to ensure they are permanent.

Luis Fernandez

04/19/2022, 2:39 PM

you mean this

Copy code

updateStrategy:
    type: RollingUpdate

Luis Fernandez

04/19/2022, 2:39 PM

instead of that

OnDelete

Daniel Lavoie

04/19/2022, 2:39 PM

Yes.

Daniel Lavoie

04/19/2022, 2:39 PM

There is a fundamental problem with Zookeeper in Kubernetes.

Daniel Lavoie

04/19/2022, 2:40 PM

Is that ZK needs to be ready to start synchronizing with other nodes.

Daniel Lavoie

04/19/2022, 2:40 PM

Statefulset rollout only looks for pod readiness.

Daniel Lavoie

04/19/2022, 2:40 PM

when a ZK pod is ready doesn’t mean it’s synced.

Luis Fernandez

04/19/2022, 2:40 PM

which is this

Luis Fernandez

04/19/2022, 2:40 PM

Copy code

readinessProbe:
            exec:
              command:
                - sh
                - /config-scripts/ready

Daniel Lavoie

04/19/2022, 2:41 PM

If k8s health check is adapted to sync readiness, then it can never join and sync with the quorum given that there is no networking available when readiness is not reached.

Daniel Lavoie

04/19/2022, 2:43 PM

By switching to

OnDelete

you can control the rollout yourself.

Luis Fernandez

04/19/2022, 2:44 PM

so basically i first make the change to

OnDelete

on the updateStrategy, is that a no op change?

Daniel Lavoie

04/19/2022, 2:44 PM

That will trigger no change

Daniel Lavoie

04/19/2022, 2:44 PM

How many ZK nodes do you have?

Luis Fernandez

04/19/2022, 2:44 PM

right, then even when i change the folder and i merge that change that also won’t do anything yes?

Luis Fernandez

04/19/2022, 2:45 PM

I have 3

Luis Fernandez

04/19/2022, 2:45 PM

it will only do when i manually delete the pod and the pod comes back

Luis Fernandez

04/19/2022, 2:45 PM

so i can do it one by one

Daniel Lavoie

04/19/2022, 2:46 PM

1- Set OnDelete 2- Set EnvVar 3- Delete first pod and wait for it to be synced 4- Delete second and wait 5- Delte third pod and wait

Daniel Lavoie

04/19/2022, 2:47 PM

I would also assert between 3 and 4 that the issue is fixed, and that you don’t see the transaction log growing to crazy sizes again.

Luis Fernandez

04/19/2022, 2:48 PM

cause the space usage should be better yes?

Daniel Lavoie

04/19/2022, 2:48 PM

Yes.

Daniel Lavoie

04/19/2022, 2:48 PM

Oh.

Daniel Lavoie

04/19/2022, 2:48 PM

Just a minute

Daniel Lavoie

04/19/2022, 2:49 PM

You will need to delete

/data/version-2

manually.

Luis Fernandez

04/19/2022, 2:49 PM

in each of the pods as they are coming back yes?

Daniel Lavoie

04/19/2022, 2:50 PM

Once you confirmed that the new directory is stable and not going crazy

👍 1

Daniel Lavoie

04/19/2022, 2:50 PM

You can wait after the full rollout

Harish Bohara

04/19/2022, 2:52 PM

Copy code

limits:
              memory: 7G
            requests:
              cpu: 1000m
              memory: 6G
          env:
            - name: ZOO_DATA_LOG_DIR
              value: ""
            - name: ZOO_PORT_NUMBER
              value: "2181"
            - name: ZOO_TICK_TIME
              value: "2000"
            - name: ZOO_INIT_LIMIT
              value: "10"
            - name: ZOO_SYNC_LIMIT
              value: "5"
            - name: ZOO_MAX_CLIENT_CNXNS
              value: "60"
            - name: ZOO_4LW_COMMANDS_WHITELIST
              value: "srvr, mntr, ruok"
            - name: ZOO_LISTEN_ALLIPS_ENABLED
              value: "no"
            - name: ZOO_AUTOPURGE_INTERVAL
              value: "0"
            - name: ZOO_AUTOPURGE_RETAIN_COUNT
              value: "3"
            - name: ZOO_MAX_SESSION_TIMEOUT
              value: "40000"
            - name: ZOO_SERVERS
              value: zookeeper-0.zookeeper-headless.pinot.svc.cluster.local:2888:3888::1 zookeeper-1.zookeeper-headless.pinot.svc.cluster.local:2888:3888::2 zookeeper-2.zookeeper-headless.pinot.svc.cluster.local:2888:3888::3 
            - name: ZOO_ENABLE_AUTH
              value: "no"
            - name: ZOO_HEAP_SIZE
              value: "1024"
            - name: ZOO_LOG_LEVEL
              value: "ERROR"
            - name: ALLOW_ANONYMOUS_LOGIN
              value: "yes"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
          ports:
            - name: client
              containerPort: 2181
            - name: follower
              containerPort: 2888
            - name: election
              containerPort: 3888
          livenessProbe:
            exec:
              command: ['/bin/bash', '-c', 'echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok']
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 6
          readinessProbe:
            exec:
              command: ['/bin/bash', '-c', 'echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok']
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 6
          volumeMounts:
            - name: data
              mountPath: /bitnami/zookeeper
      volumes:
  volumeClaimTemplates:
    - metadata:
        name: data
        annotations:
      spec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: "100Gi"

Luis Fernandez

04/19/2022, 2:56 PM

@Daniel Lavoie when u said that we should afford some data loss what data would that be?

Daniel Lavoie

04/19/2022, 2:57 PM

What i said, to test this anywhere you can afford to lose data before rolling it out in production.

👍 1

Daniel Lavoie

04/19/2022, 2:57 PM

Anything stored in ZK could be corrupted if not done properly

Luis Fernandez

04/19/2022, 2:58 PM

oh yea 💯

Daniel Lavoie

04/19/2022, 2:58 PM

Ensure you reproduce the FS growing problem too on that non critical environment.

Luis Fernandez

04/19/2022, 3:03 PM

that may be a tricky one it has only been getting filled up as several days have gone by 😄

Luis Fernandez

04/19/2022, 3:03 PM

(3 weeks)

Daniel Lavoie

04/19/2022, 3:04 PM

Yeah. Ok, well at least test the rollout recipe so you can confirm you are confortable with it.

Harish Bohara

04/19/2022, 3:18 PM

These are the steps? 1. Update StatefulSet

Copy code

updateStrategy:
      type: RollingUpdate

Copy code

updateStrategy:
      type: OnDelete

Also add

Copy code

env: 
     -  name: ZK_DATA_DIR
       value: "/bitnami/zookeeper/data/snapshots"

This will not restart ZK nodes 2. Delete first of the ZK node manually - wait for it to be synced 3. Delete second of the ZK node manually - wait for it to be synced 4. Delete third of the ZK node manually - wait for it to be synced 5. Rollback the changes in StatefulSet and change updateStrategy back to OnDelete

Copy code

updateStrategy:
      type: RollingUpdate

Daniel Lavoie

04/19/2022, 3:19 PM

Copy code

env: 
     -  name: ZK_DATA_DIR
       value: "/data/snapshots"

Daniel Lavoie

04/19/2022, 3:22 PM

and once cluster is stabilized and FS is not growing anymore, clean

/data/version-2

on all pods

Harish Bohara

04/19/2022, 3:24 PM

In my setup i have

Copy code

volumeMounts:
   - name: data
     mountPath: /bitnami/zookeeper

I have no name!@zookeeper-0:/$ df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          50G  3.4G   47G   7% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/nvme0n1p1   50G  3.4G   47G   7% /etc/hosts
/dev/nvme1n1     99G  441M   98G   1% /bitnami/zookeeper
shm              64M     0   64M   0% /dev/shm
tmpfs           3.8G   12K  3.8G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           3.8G     0  3.8G   0% /proc/acpi
tmpfs           3.8G     0  3.8G   0% /sys/firmware


zookeeper-0:/bitnami/zookeeper/data$ ls -la
total 16
drwxrwsr-x 3 1001 1001 4096 Apr 19 07:58 .
drwxrwsr-x 4 root 1001 4096 Apr 19 07:58 ..
-rw-rw-r-- 1 1001 1001    2 Apr 19 07:58 myid
drwxrwsr-x 2 1001 1001 4096 Apr 19 15:08 version-2

Harish Bohara

04/19/2022, 3:24 PM

I see the setup I have the data dir is “bitnami/zookeeper/data”

Daniel Lavoie

04/19/2022, 3:25 PM

Ok, I see. Indeed you change make sense

Harish Bohara

04/19/2022, 3:25 PM

However I still did not understand why this fix should work 🙂.

Harish Bohara

04/19/2022, 3:26 PM

What will happen after making this change, The logs will go to “/data/snapshots” dir??

Daniel Lavoie

04/19/2022, 3:27 PM

Yes

Daniel Lavoie

04/19/2022, 3:27 PM

You have the snapshot and log dir in the same

/data

path.

Harish Bohara

04/19/2022, 3:34 PM

Copy code

I think I should do 2 changes: my configs do now have both ZK_DATA_DIR and ZK_DATA_LOG_DIR.
 
env:
   - name: ZK_DATA_DIR
     value: /bitnami/zookeeper/data/snapshots
   - name: ZK_DATA_LOG_DIR
     value: /bitnami/zookeeper/data/log

Daniel Lavoie

04/19/2022, 3:35 PM

Yes indeed.

ZK_DATA_LOG_DIR

will default to

/data/log/

Daniel Lavoie

04/19/2022, 3:36 PM

@Luis Fernandez I would recommend explcitely setting the value too if it’s not to late. I know that with ZK 3.7, the default value changed from the bitnami docker image.

Luis Fernandez

04/19/2022, 3:38 PM

will do so

Luis Fernandez

04/19/2022, 3:39 PM

also another change for the OSS may be to change this to

/data/snapshot

instead of that so that by default future clients don’t run into it

Daniel Lavoie

04/19/2022, 3:39 PM

Yes but let’s confirm this fixes the issue from your end.

👍 1

Daniel Lavoie

04/19/2022, 3:40 PM

We don’t recommend using Zookeeper in helm for production.

Luis Fernandez

04/19/2022, 3:46 PM

what should we use

Daniel Lavoie

04/19/2022, 3:46 PM

/data/log

Luis Fernandez

04/19/2022, 3:46 PM

oh wait sorry you said you don’t recommend using zookeeper in helm for production what should it be instead

Daniel Lavoie

04/19/2022, 3:47 PM

Sorry 🤣

🤣 1

Daniel Lavoie

04/19/2022, 3:47 PM

Zookeeper on baremetal

Daniel Lavoie

04/19/2022, 3:47 PM

or VM.

Daniel Lavoie

04/19/2022, 3:47 PM

there is a Zookeeper operator in the wild but I have no experience with it

Daniel Lavoie

04/19/2022, 3:48 PM

We have written our own Operator for zookeeper.

Harish Bohara

04/19/2022, 3:49 PM

Copy code

$printenv
ZK_DATA_LOG_DIR=/bitnami/zookeeper/data/log
ZK_DATA_DIR=/bitnami/zookeeper/data/snapshots
...

I dont see /bitnami/zookeeper/data/snapshots or  /bitnami/zookeeper/data/logs dir after restart..

Daniel Lavoie

04/19/2022, 3:49 PM

what is your

/conf/zoo.cfg

config?

Harish Bohara

04/19/2022, 3:50 PM

This is form POD

Copy code

I have no name!@zookeeper-2:/$ cat /opt/bitnami/zookeeper/bin/../conf/zoo.cfg
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.
dataDir=/bitnami/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
maxClientCnxns=60
#
# Be sure to read the maintenance section of the 
# administrator guide before turning on autopurge.
#
# <http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance>
#
# The number of snapshots to retain in dataDir
autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
autopurge.purgeInterval=0

## Metrics Providers
#
# <https://prometheus.io> Metrics Exporter
#metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
#metricsProvider.httpPort=7000
#metricsProvider.exportJvmInfo=true
preAllocSize=65536
snapCount=100000
maxCnxns=0
reconfigEnabled=false
quorumListenOnAllIPs=false
4lw.commands.whitelist=srvr, mntr, ruok
maxSessionTimeout=40000
server.1=zookeeper-0.zookeeper-headless.pinot.svc.cluster.local:2888:3888;2181
server.2=zookeeper-1.zookeeper-headless.pinot.svc.cluster.local:2888:3888;2181
server.3=zookeeper-2.zookeeper-headless.pinot.svc.cluster.local:2888:3888;2181
I have no name!@zookeeper-2:/$

Luis Fernandez

04/19/2022, 3:51 PM

btw you also need to make purgeInterval=1

Daniel Lavoie

04/19/2022, 3:51 PM

I don’t see

dataDir

Luis Fernandez

04/19/2022, 3:51 PM

and I don’t see the

dataLogDir

Daniel Lavoie

04/19/2022, 3:52 PM

Harish you could be facing a different issue than Luis.

Harish Bohara

04/19/2022, 3:53 PM

I am using helm chat from Pinot site. It dont think it takes config file from vol mount. It must be default config from ZK docker

Daniel Lavoie

04/19/2022, 3:54 PM

Yeah

Harish Bohara

04/19/2022, 4:17 PM

I think I can change “ZOO_DATA_LOG_DIR=/bitnami/zookeeper/data/log” but “ZK_DATA_DIR” will remain “/bitnami/zookeeper/data”? Will this workout?

Daniel Lavoie

04/19/2022, 4:17 PM

What we are suspecting is that having both dir in the same

/data

dir causes the issue

Daniel Lavoie

04/19/2022, 4:18 PM

Harish, I think you are simply encountering a snapshot misconfiguration issue

Daniel Lavoie

04/19/2022, 4:18 PM

Not the same issue than Luis

Harish Bohara

04/19/2022, 4:21 PM

Ok.. However, I did get the problem where my ZK failed to start due to “No space left” issue where this was 100% full. Is his not due to this config?

Luis Fernandez

04/19/2022, 4:23 PM

I think you may have 2 issues, and one of them is the same about space getting filled up, by looking at your zookeeper config I can tell that you don’t have the purging activated, plus the issue than Daniel and I have been trying to solve

Mayank

04/19/2022, 5:19 PM

This thread is now blog worthy 😅

Mayank

04/19/2022, 5:19 PM

Or at least a recipe. @Mark Needham

Daniel Lavoie

04/19/2022, 5:20 PM

Zookeeper on kubernetes should be a book by itself 😞

Luis Fernandez

04/19/2022, 5:22 PM

no wonder why 😄

Luis Fernandez

04/19/2022, 5:22 PM

there were some issues with our builds i’m gonna try to roll this on dev now

Luis Fernandez

04/19/2022, 5:45 PM

how do i know when a particular zookeeper is ready and good to go

Daniel Lavoie

04/19/2022, 5:45 PM

You can run

zkCli.sh ls

command on it

Daniel Lavoie

04/19/2022, 5:46 PM

You should see the pinot state

Luis Fernandez

04/19/2022, 5:48 PM

getting this in one of the zookeepers when it’s coming back:

Luis Fernandez

04/19/2022, 5:48 PM

Copy code

2022-04-19 17:47:13,503 [myid:3] - ERROR [main:QuorumPeer@955] - Unable to load database on disk
java.io.IOException: No snapshot found, but there are log entries. Something is broken!
	at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:211)
	at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
	at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:919)
	at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:905)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
2022-04-19 17:47:13,505 [myid:3] - ERROR [main:QuorumPeerMain@101] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server 
	at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:956)
	at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:905)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Caused by: java.io.IOException: No snapshot found, but there are log entries. Something is broken!
	at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:211)
	at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
	at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:919)

Daniel Lavoie

04/19/2022, 5:49 PM

Yeah, snapshot dir changed and logs are still present. So it complains

Daniel Lavoie

04/19/2022, 5:49 PM

When you delete the pod, add the delete pvc in the same command. It will reprovision an empty pvc

Luis Fernandez

04/19/2022, 5:50 PM

noob question how do i tell it to remove the pvc as well?

Daniel Lavoie

04/19/2022, 5:50 PM

kubectl delete pod <pod-name> && kubectl delete pvc <pvc-name>

Daniel Lavoie

04/19/2022, 5:51 PM

Make sure you delete the right pvc 😛

Luis Fernandez

04/19/2022, 5:57 PM

i run

zkCli.sh ls

within the box?

Luis Fernandez

04/19/2022, 5:57 PM

it came back now

Daniel Lavoie

04/19/2022, 5:57 PM

`kubectl exec <pod-name> -- zkCli.sh ls`is your friend

Luis Fernandez

04/19/2022, 5:57 PM

Copy code

clientPort=2181
dataDir=/data/snapshot
dataLogDir=/data/log
tickTime=2000
initLimit=10
syncLimit=10
maxClientCnxns=60
minSessionTimeout= 4000
maxSessionTimeout= 40000
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
4lw.commands.whitelist=*
server.1=pinot-zookeeper-0.pinot-zookeeper-headless.pinot-dev.svc.cluster.l
ocal:2888:3888
server.2=pinot-zookeeper-1.pinot-zookeeper-headless.pinot-dev.svc.cluster.l
ocal:2888:3888
server.3=pinot-zookeeper-2.pinot-zookeeper-headless.pinot-dev.svc.cluster.l
ocal:2888:3888

Luis Fernandez

04/19/2022, 5:58 PM

is it okay if i don’t see anything on

data/log

Daniel Lavoie

04/19/2022, 5:59 PM

Does

zkCli.sh ls

show anything?

Luis Fernandez

04/19/2022, 6:00 PM

Copy code

Connecting to localhost:2181
2022-04-19 17:59:27,717 [myid:] - INFO  [main:Environment@109] - Client environment:zookeeper.version=3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT
2022-04-19 17:59:27,720 [myid:] - INFO  [main:Environment@109] - Client environment:host.name=pinot-zookeeper-2.pinot-zookeeper-headless.pinot-dev.svc.cluster.local
2022-04-19 17:59:27,720 [myid:] - INFO  [main:Environment@109] - Client environment:java.version=1.8.0_232
2022-04-19 17:59:27,721 [myid:] - INFO  [main:Environment@109] - Client environment:java.vendor=Oracle Corporation
2022-04-19 17:59:27,721 [myid:] - INFO  [main:Environment@109] - Client environment:java.home=/usr/local/openjdk-8
2022-04-19 17:59:27,721 [myid:] - INFO  [main:Environment@109] - Client environment:java.class.path=/apache-zookeeper-3.5.5-bin/bin/../zookeeper-server/target/classes:/apache-zookeeper-3.5.5-bin/bin/../build/classes:/apache-zookeeper-3.5.5-bin/bin/../zookeeper-server/target/lib/*.jar:/apache-zookeeper-3.5.5-bin/bin/../build/lib/*.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/zookeeper-jute-3.5.5.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/zookeeper-3.5.5.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/slf4j-log4j12-1.7.25.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/slf4j-api-1.7.25.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/netty-all-4.1.29.Final.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/log4j-1.2.17.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/json-simple-1.1.1.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jline-2.11.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jetty-util-9.4.17.v20190418.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jetty-servlet-9.4.17.v20190418.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jetty-server-9.4.17.v20190418.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jetty-security-9.4.17.v20190418.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jetty-io-9.4.17.v20190418.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jetty-http-9.4.17.v20190418.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/javax.servlet-api-3.1.0.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jackson-databind-2.9.8.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jackson-core-2.9.8.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/jackson-annotations-2.9.0.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/commons-cli-1.2.jar:/apache-zookeeper-3.5.5-bin/bin/../lib/audience-annotations-0.5.0.jar:/apache-zookeeper-3.5.5-bin/bin/../zookeeper-*.jar:/apache-zookeeper-3.5.5-bin/bin/../zookeeper-server/src/main/resources/lib/*.jar:/conf:
2022-04-19 17:59:27,724 [myid:] - INFO  [main:Environment@109] - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2022-04-19 17:59:27,724 [myid:] - INFO  [main:Environment@109] - Client environment:java.io.tmpdir=/tmp
2022-04-19 17:59:27,724 [myid:] - INFO  [main:Environment@109] - Client environment:java.compiler=<NA>
2022-04-19 17:59:27,724 [myid:] - INFO  [main:Environment@109] - Client environment:os.name=Linux
2022-04-19 17:59:27,724 [myid:] - INFO  [main:Environment@109] - Client environment:os.arch=amd64
2022-04-19 17:59:27,724 [myid:] - INFO  [main:Environment@109] - Client environment:os.version=5.4.144+
2022-04-19 17:59:27,724 [myid:] - INFO  [main:Environment@109] - Client environment:user.name=zookeeper
2022-04-19 17:59:27,725 [myid:] - INFO  [main:Environment@109] - Client environment:user.home=/home/zookeeper
2022-04-19 17:59:27,725 [myid:] - INFO  [main:Environment@109] - Client environment:user.dir=/apache-zookeeper-3.5.5-bin
2022-04-19 17:59:27,725 [myid:] - INFO  [main:Environment@109] - Client environment:os.memory.free=11MB
2022-04-19 17:59:27,726 [myid:] - INFO  [main:Environment@109] - Client environment:os.memory.max=247MB
2022-04-19 17:59:27,726 [myid:] - INFO  [main:Environment@109] - Client environment:os.memory.total=15MB
2022-04-19 17:59:27,728 [myid:] - INFO  [main:ZooKeeper@868] - Initiating client connection, connectString=localhost:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@3b95a09c
2022-04-19 17:59:27,731 [myid:] - INFO  [main:X509Util@79] - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
2022-04-19 17:59:27,800 [myid:] - INFO  [main:ClientCnxnSocket@237] - jute.maxbuffer value is 4194304 Bytes
2022-04-19 17:59:27,806 [myid:] - INFO  [main:ClientCnxn@1653] - zookeeper.request.timeout value is 0. feature enabled=
ls [-s] [-w] [-R] path
command terminated with exit code 1

Daniel Lavoie

04/19/2022, 6:00 PM

zkCli.sh ls /

Luis Fernandez

04/19/2022, 6:01 PM

Copy code

[pinot, zookeeper]

Luis Fernandez

04/19/2022, 6:02 PM

that means is cool right?

Daniel Lavoie

04/19/2022, 6:02 PM

ZK seems happy

Daniel Lavoie

04/19/2022, 6:02 PM

Can you share the startup logs?

Luis Fernandez

04/19/2022, 6:03 PM

Copy code

2022-04-19 17:54:15,446 [myid:3] - WARN  [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):Learner@282] - Unexpected exception, tries=2, remaining init limit=17995, connecting to pinot-zookeeper-0.pinot-zookeeper-headless.pinot-dev.svc.cluster.local/10.12.177.117:2888
java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:233)
	at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:262)
	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:77)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1271)
2022-04-19 17:54:16,448 [myid:3] - WARN  [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):Learner@282] - Unexpected exception, tries=3, remaining init limit=16993, connecting to pinot-zookeeper-0.pinot-zookeeper-headless.pinot-dev.svc.cluster.local/10.12.177.117:2888
java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:233)
	at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:262)
	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:77)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1271)
2022-04-19 17:54:17,450 [myid:3] - ERROR [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):Learner@277] - Unexpected exception, retries exceeded. tries=4, remaining init limit=15991, connecting to pinot-zookeeper-0.pinot-zookeeper-headless.pinot-dev.svc.cluster.local/10.12.177.117:2888
java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:233)
	at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:262)
	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:77)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1271)
2022-04-19 17:54:17,450 [myid:3] - WARN  [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):Follower@96] - Exception when following the leader
java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:233)
	at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:262)
	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:77)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1271)
2022-04-19 17:54:17,450 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):Follower@201] - shutdown called
java.lang.Exception: shutdown Follower
	at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:201)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1275)
2022-04-19 17:54:17,451 [myid:3] - WARN  [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1318] - PeerState set to LOOKING
2022-04-19 17:54:17,451 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1193] - LOOKING
2022-04-19 17:54:17,451 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] - New election. My id =  3, proposed zxid=0x0
2022-04-19 17:54:17,452 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x2 (n.round), LOOKING (n.state), 3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2022-04-19 17:54:17,453 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0xa00000018 (n.zxid), 0x1 (n.round), FOLLOWING (n.state), 1 (n.sid), 0xb (n.peerEPoch), LOOKING (my state)0 (n.config version)
2022-04-19 17:54:17,454 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0xa00000018 (n.zxid), 0x1 (n.round), LEADING (n.state), 2 (n.sid), 0xb (n.peerEPoch), LOOKING (my state)0 (n.config version)
2022-04-19 17:54:17,454 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1269] - FOLLOWING

Luis Fernandez

04/19/2022, 6:04 PM

Copy code

+ /config-scripts/run
+ exec java -cp '/apache-zookeeper-3.5.5-bin/lib/*:/apache-zookeeper-3.5.5-bin/*jar:/conf:' -Xmx256M -Xms256M org.apache.zookeeper.server.quorum.QuorumPeerMain /conf/zoo.cfg
2022-04-19 17:54:12,120 [myid:] - INFO  [main:QuorumPeerConfig@133] - Reading configuration from: /conf/zoo.cfg
2022-04-19 17:54:12,125 [myid:] - INFO  [main:QuorumPeerConfig@385] - clientPortAddress is 0.0.0.0/0.0.0.0:2181
2022-04-19 17:54:12,125 [myid:] - INFO  [main:QuorumPeerConfig@389] - secureClientPort is not set
2022-04-19 17:54:12,203 [myid:3] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2022-04-19 17:54:12,205 [myid:3] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1
2022-04-19 17:54:12,205 [myid:3] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2022-04-19 17:54:12,206 [myid:3] - INFO  [main:ManagedUtil@46] - Log4j found with jmx enabled.
2022-04-19 17:54:12,213 [myid:3] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2022-04-19 17:54:12,315 [myid:3] - INFO  [main:QuorumPeerMain@141] - Starting quorum peer
2022-04-19 17:54:12,320 [myid:3] - INFO  [main:ServerCnxnFactory@135] - Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory
2022-04-19 17:54:12,321 [myid:3] - INFO  [main:NIOServerCnxnFactory@673] - Configuring NIO connection handler with 10s sessionless connection timeout, 1 selector thread(s), 2 worker threads, and 64 kB direct buffers.
2022-04-19 17:54:12,399 [myid:3] - INFO  [main:NIOServerCnxnFactory@686] - binding to port 0.0.0.0/0.0.0.0:2181
2022-04-19 17:54:12,419 [myid:3] - INFO  [main:Log@193] - Logging initialized @583ms to org.eclipse.jetty.util.log.Slf4jLog
2022-04-19 17:54:12,902 [myid:3] - WARN  [main:ContextHandler@1588] - o.e.j.s.ServletContextHandler@1bce4f0a{/,null,UNAVAILABLE} contextPath ends with /*
2022-04-19 17:54:12,902 [myid:3] - WARN  [main:ContextHandler@1599] - Empty contextPath
2022-04-19 17:54:12,911 [myid:3] - INFO  [main:X509Util@79] - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
2022-04-19 17:54:12,912 [myid:3] - INFO  [main:QuorumPeer@1488] - Local sessions disabled
2022-04-19 17:54:12,912 [myid:3] - INFO  [main:QuorumPeer@1499] - Local session upgrading disabled
2022-04-19 17:54:12,912 [myid:3] - INFO  [main:QuorumPeer@1466] - tickTime set to 2000
2022-04-19 17:54:12,913 [myid:3] - INFO  [main:QuorumPeer@1510] - minSessionTimeout set to 4000
2022-04-19 17:54:12,913 [myid:3] - INFO  [main:QuorumPeer@1521] - maxSessionTimeout set to 40000
2022-04-19 17:54:12,913 [myid:3] - INFO  [main:QuorumPeer@1536] - initLimit set to 10
2022-04-19 17:54:12,921 [myid:3] - INFO  [main:ZKDatabase@117] - zookeeper.snapshotSizeFactor = 0.33
2022-04-19 17:54:12,922 [myid:3] - INFO  [main:QuorumPeer@1781] - Using insecure (non-TLS) quorum communication
2022-04-19 17:54:12,997 [myid:3] - INFO  [main:QuorumPeer@1787] - Port unification disabled
2022-04-19 17:54:12,997 [myid:3] - INFO  [main:QuorumPeer@2154] - QuorumPeer communication is not secured! (SASL auth disabled)
2022-04-19 17:54:12,997 [myid:3] - INFO  [main:QuorumPeer@2183] - quorum.cnxn.threads.size set to 20
2022-04-19 17:54:12,998 [myid:3] - INFO  [main:FileTxnSnapLog@372] - Snapshotting: 0x0 to /data/snapshot/version-2/snapshot.0
2022-04-19 17:54:13,001 [myid:3] - INFO  [main:QuorumPeer@931] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
2022-04-19 17:54:13,006 [myid:3] - INFO  [main:QuorumPeer@946] - acceptedEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
2022-04-19 17:54:13,012 [myid:3] - INFO  [main:Server@370] - jetty-9.4.17.v20190418; built: 2019-04-18T19:45:35.259Z; git: aa1c656c315c011c01e7b21aabb04066635b9f67; jvm 1.8.0_232-b09

Daniel Lavoie

04/19/2022, 6:04 PM

Can you get it from the begining? I’m interested in the bit that shows the startup config

Daniel Lavoie

04/19/2022, 6:04 PM

thank you

Daniel Lavoie

04/19/2022, 6:05 PM

Try moving with the next server now

Luis Fernandez

04/19/2022, 6:07 PM

moving to next one

Luis Fernandez

04/19/2022, 6:09 PM

Copy code

[pinot, zookeeper]

Daniel Lavoie

04/19/2022, 6:09 PM

can you with

ls /data/log/

on pod first and second zk pod?

Luis Fernandez

04/19/2022, 6:10 PM

Copy code

$ kubectl exec pinot-zookeeper-1 -- ls /data/log/
version-2

Luis Fernandez

04/19/2022, 6:11 PM

Copy code

$ kubectl exec pinot-zookeeper-2 -- ls /data/log/
version-2

Daniel Lavoie

04/19/2022, 6:11 PM

then check within

/data/log/version-2

Luis Fernandez

04/19/2022, 6:11 PM

Copy code

$ kubectl exec pinot-zookeeper-2 -- ls /data/log/version-2
log.b000011cd

Luis Fernandez

04/19/2022, 6:11 PM

Copy code

$ kubectl exec pinot-zookeeper-1 -- ls /data/log/version-2
log.c00000001

Daniel Lavoie

04/19/2022, 6:12 PM

alright. Now for the last one

Luis Fernandez

04/19/2022, 6:13 PM

here we go

Luis Fernandez

04/19/2022, 6:15 PM

Copy code

[pinot, zookeeper]

Luis Fernandez

04/19/2022, 6:15 PM

Copy code

$ kubectl exec pinot-zookeeper-0 -- ls /data/log/version-2
log.c000002a3

Luis Fernandez

04/19/2022, 6:15 PM

image.png

Daniel Lavoie

04/19/2022, 6:16 PM

So you had the issue on dev?

Luis Fernandez

04/19/2022, 6:17 PM

prod is gonna get to the same level 😄

Luis Fernandez

04/19/2022, 6:17 PM

but for some reason it hasn’t gotten to the level dev had it

Daniel Lavoie

04/19/2022, 6:17 PM

Sounds like you are ready for surgery

Luis Fernandez

04/19/2022, 6:17 PM

my hands are sweating 😄 haha

Daniel Lavoie

04/19/2022, 6:18 PM

Oh before you proceed

Daniel Lavoie

04/19/2022, 6:18 PM

I would add the helm values and do a helm upgrade to ensure things are not blowing up for you next rollout

Luis Fernandez

04/19/2022, 6:20 PM

i’m not sure i follow 😄

Daniel Lavoie

04/19/2022, 6:20 PM

So you manually added an helm env variable

Daniel Lavoie

04/19/2022, 6:20 PM

Next to you run

helm upgrade

. That variable will be missing

Luis Fernandez

04/19/2022, 6:23 PM

ohh you mean that when there’s an upgrade to helm that i may need to get then it would be gone from here?

Copy code

env:
  - name: ZK_REPLICAS
    value: "3"
  - name: JMXAUTH
    value: "false"
  - name: JMXDISABLE
    value: "false"

Luis Fernandez

04/19/2022, 6:24 PM

on the statefulset.yaml

Daniel Lavoie

04/19/2022, 6:24 PM

you can add the env variable as part of an helm value

Daniel Lavoie

04/19/2022, 6:25 PM

zookeeper.env.ZK_DATA_LOG_DIR=/data/log

zookeeper.env.ZK_DATA_DIR=/data/snapshots

Luis Fernandez

04/19/2022, 6:26 PM

i think we have sort of a different setup we use jsonnet and that’s what i was editing, we regularly don’t use

helm upgrade

Daniel Lavoie

04/19/2022, 6:27 PM

As long as you don’t loose the env var during you next upgrade 😛

Luis Fernandez

04/19/2022, 6:27 PM

i will keep note of that for sure

Harish Bohara

04/20/2022, 10:50 AM

FYI - I had a setup where ZK logs reached to 20GB. This fixed it: I am using “docker.io/bitnami/zookeeper:3.7.0-debian-10-r56” -> setting ZOO_AUTOPURGE_INTERVAL=1 and restart of ZK fixed it

Luis Fernandez

04/20/2022, 5:21 PM

this has been deployed to prod, flawlessly no issues at all

Luis Fernandez

04/20/2022, 5:21 PM

@Daniel Lavoie thank you so much for your support and help on how to best deploy this

Luis Fernandez

04/20/2022, 5:21 PM

i’m gonna keep a watch on the logs and se if things get filled up again

Daniel Lavoie

04/20/2022, 5:22 PM

Sweet, happy this worked out all the way up to production

Daniel Lavoie

04/20/2022, 5:22 PM

🍷

2 Views

Open in Slack

Previous Next