pulsar #dev

Lari Hotari

05/21/2025, 8:44 AM

Please help validate the Apache Pulsar releases and vote: • [VOTE] Release Apache Pulsar 3.0.12 based on 3.0.12-candidate-1 • [VOTE] Release Apache Pulsar 4.0.5 based on 4.0.5-candidate-1 • [VOTE] Release Apache Pulsar 3.3.7 based on 3.3.7-candidate-1 /cc @Zixuan Liu @Tboy @Enrico Olivelli @Nicoló Boschi @Entvex

✅ 1

pulsarlogo 1

Lari Hotari

05/22/2025, 7:54 AM

GitHub Actions degraded status: https://www.githubstatus.com/ "We're investigating delays with the execution of queued GitHub Actions jobs."

Wallace Peng

05/23/2025, 4:45 AM

Got an error in the log but it seems gone after some time , @Lari Hotari or someone has any idea ? bookie failed to write then wrote successfully after some retries, noticed that bookie had some compaction related work done around the same time . is this related ?

Copy code

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: The producer pulsar-5312-2718615 can not send message to the topic <persistent://14748/change/log> within given timeout : createdAt 126.9 seconds ago, firstSentAt 126.9 seconds ago, lastSentAt 126.9 seconds ago, retryCount 1
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2094)
	at org.apache.pulsar.client.impl.ProducerImpl$DefaultSendMessageCallback.onSendComplete(ProducerImpl.java:405)
	at org.apache.pulsar.client.impl.ProducerImpl$DefaultSendMessageCallback.sendComplete(ProducerImpl.java:386)
	at org.apache.pulsar.client.impl.ProducerImpl$OpSendMsg.sendComplete(ProducerImpl.java:1568)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$failPendingMessages$18(ProducerImpl.java:2124)
	at java.base/java.util.ArrayDeque.forEach(ArrayDeque.java:889)
	at org.apache.pulsar.client.impl.ProducerImpl$OpSendMsgQueue.forEach(ProducerImpl.java:1659)
	at org.apache.pulsar.client.impl.ProducerImpl.failPendingMessages(ProducerImpl.java:2114)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$failPendingMessages$19(ProducerImpl.java:2146)
	at org.apache.pulsar.shade.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at org.apache.pulsar.shade.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at org.apache.pulsar.shade.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at org.apache.pulsar.shade.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
	at org.apache.pulsar.shade.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
	at org.apache.pulsar.shade.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.pulsar.shade.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: The producer pulsar-5312-2718615 can not send message to the topic <persistent://14748/change/log> within given timeout : createdAt 126.9 seconds ago, firstSentAt 126.9 seconds ago, lastSentAt 126.9 seconds ago, retryCount 1
	at org.apache.pulsar.client.impl.ProducerImpl$OpSendMsg.sendComplete(ProducerImpl.java:1565)
	... 13 common frames omitted

Broker: ``````

Wallace Peng

05/23/2025, 4:47 AM

image.png

Lari Hotari

05/23/2025, 8:27 AM

When merging bug fix PRs in apache/pulsar, please pay attention to adding necessary release labels before merging. https://pulsar.apache.org/contribute/maintenance-process/#current-process-and-responsibilities

Lari Hotari

05/23/2025, 2:01 PM

Please vote on the Pulsar helm chart release: • [VOTE] Release Apache Pulsar Helm Chart 4.1.0 based on 4.1.0-candidate-1 /cc @Entvex @Zixuan Liu @Tboy @Enrico Olivelli @Nicoló Boschi

✅ 1

Lari Hotari

05/30/2025, 6:01 AM

Please review a refactoring and improvement to the broker entry cache: https://github.com/apache/pulsar/pull/24363 Mailing list: https://lists.apache.org/thread/ddzzc17b0c218ozq9tx0r3rx5sgljfb0 > Broker entry cache refactoring - PR #24363 review requested > Hi all, > > I've submitted PR #24363 that refactors the broker entry cache > eviction algorithm to fix some correctness and performance issues in > the current implementation. > > The current EntryCacheDefaultEvictionPolicy has a few problems. The > size-based eviction doesn't guarantee that the oldest entries are > removed first - it sorts caches by size and evicts proportionally, > which can leave older entries while removing newer ones. The timestamp > eviction iterates through all cache instances every 10ms by default, > which creates unnecessary CPU overhead, especially with many topics. > There are also correctness issues when entries aren't ordered by both > position and timestamp, which happens with catch-up reads and mixed > read patterns. > > The refactored implementation introduces a centralized > RangeCacheRemovalQueue that tracks entry insertion order using JCTools > MpscUnboundedArrayQueue. This ensures both size-based and > timestamp-based eviction process entries from oldest to newest, > guaranteeing proper FIFO behavior. The approach also consolidates > eviction handling to a single thread to avoid contention and removes > some unnecessary generic type parameters from RangeCache. > > The key difference is that instead of the complex size-sorting > algorithm, entries are simply processed in insertion order until the > eviction target is met. This should improve cache hit rates and reduce > CPU overhead from the frequent iterations. > > This work prepares the broker entry cache for further improvements to > efficiently handle catch-up reads and Key_Shared subscription > scenarios. The current cache has limitations around unnecessary > BookKeeper reads during catch-up scenarios, poor cache hit rates for > Key_Shared subscriptions with slow consumers, and cascading > performance issues under high fan-out catch-up reads. The centralized > eviction queue design provides a foundation for addressing these > issues in future work. > > Would appreciate reviews, particularly around the eviction logic and > any potential edge cases I might have missed. > > PR: https://github.com/apache/pulsar/pull/24363 > > Thanks, > > Lari /cc @Tboy @太上玄元道君 @Zixuan Liu @Yubiao Feng @Enrico Olivelli @Andrey @merlimat @孟祥迎 @the tumbled

👀 2

SiNan Liu

06/01/2025, 1:28 PM

Hi all! Please discussion on PIP-423: Add Support for Cancelling Individual Delayed Messages. Proposal link: https://github.com/apache/pulsar/pull/24370 Mailing list:https://lists.apache.org/thread/lo182ztgrkzlq6mbkytj8krd050yvb9w I’m looking forward to hearing from you.👀

SiNan Liu

06/02/2025, 12:54 PM

Hello everyone, I’ve raised a new future regarding delayed messages. https://github.com/apache/pulsar/pull/24372 As we all know, using

ackTimeout

negativeAck

, or

reconsumeLater

all cause data re-delivery, but these do not support long delays. This PR aims to support sending delayed messages on the consumer side, without issues like write amplification or read amplification. I hope everyone can participate in the discussion. If this plan is feasible, I will propose a PIP.

Lari Hotari

06/02/2025, 3:20 PM

Please review https://github.com/apache/pulsar/pull/16651. There's a long time issue in replicated subscriptions, that a replication snapshot could get missed since it's currently processed only after the next mark delete update. Many users have been complaining that subscription replication doesn't work as expected and replication only happens after there's a pause in the message flow. This PR should address many of those issues. It's been open for several years so it would great to finally get the fix merged. One of the reported issues is https://github.com/apache/pulsar/issues/10054

👏 3

Wallace Peng

06/04/2025, 1:06 AM

is there AI agent that help us upgrade pulsar cluster ? and guidance for this as well ?

Lari Hotari

06/04/2025, 12:34 PM

Today I finally started experimenting with Brokk.ai, an AI tool for large code bases created by @Jonathan Ellis and I'm amazed how useful this tool is for working on the Pulsar code base. To learn how to use Brokk, I read https://blog.brokk.ai/brokk-ai-for-large-java-codebases/ and watched this Youtube video:

https://www.youtube.com/watch?v=t_7MqowT638▾

. After this, it's fairly easy to get started, especially if you have used other similar AI tools before. What I wanted to do was a MermaidJS diagram to visualize Replicated Subscription interactions. This is the resulting diagram after a few rounds: https://gist.github.com/lhotari/96fda511a70d7de93744d868b4472b92 I did give some additional instructions in the rounds to get it laid out in the way that I wanted to have it visualized. I also used Claude AI directly to fix some issues in the resulting syntax and some manual edits in VSCode before I was happy with the resulting diagram. Besides asking to create the diagram, I also asked Brokk "Also explain the problems that might lead to not being able to replicate subscription state with the given replicatedSubscriptionsSnapshotFrequencyMillis. The users have the expectation that this would be the case that subscription replication loses at most up to replicatedSubscriptionsSnapshotFrequencyMillis of acknowledgement state (mark delete position) on the remote cluster side." The analysis is a really great starting point for further analysis: > Potential Problems Affecting Subscription State Replication Freshness: > The user expectation that replicated subscription state (specifically, the

markDeletePosition

) on the remote cluster (

c2

) loses at most

replicatedSubscriptionsSnapshotFrequencyMillis

worth of acknowledgements can be unmet due to several factors inherent in the design and operational conditions: > 1. Snapshot Lifecycle Latency: > ◦ Process: A snapshot isn't instantaneous. It involves

c1

sending a request marker,

c2

receiving it,

c2

sending a response marker (with its current end-of-topic position), and

c1

receiving this response to finalize the snapshot. This entire round-trip takes time, potentially affected by replication lag between

c1

and

c2

, and then back from

c2

c1

. > ◦ Impact: Even if snapshots are initiated every

replicatedSubscriptionsSnapshotFrequencyMillis

, the actual completion and availability of a new snapshot for use can be later. If the

markDeletePosition

c1

advances, it can only use the latest completed snapshot. If many messages are acked on

c1

while a snapshot is in progress or delayed, the update to

c2

will be based on an older snapshot, making `c2`'s position appear more stale than just the snapshot frequency. > 2. Snapshot Timeouts (
replicatedSubscriptionsSnapshotTimeoutSeconds
): > ◦ Mechanism: If a remote cluster (e.g.,

c2

) fails to send a

SNAPSHOT_RESPONSE

within

replicatedSubscriptionsSnapshotTimeoutSeconds

(default 30s), the snapshot attempt is considered timed out on

c1

. > ◦ Impact: When snapshots time out, new valid mappings between

c1

and

c2

positions are not established.

ReplicatedSubscriptionsController

c1

will skip creating new snapshots if a pending one has never succeeded (to avoid flooding) or if recent attempts have timed out. It will rely on the

lastCompletedSnapshotId

. If this last successful snapshot is old, the

markDeletePosition

c2

will not reflect recent acknowledgements on

c1

, leading to a staleness far greater than

replicatedSubscriptionsSnapshotFrequencyMillis

. This is a critical point where the expectation can be significantly missed. > 3. Disconnected Replicators or Cluster Unavailability: > ◦ Mechanism: The

ReplicatedSubscriptionsController

explicitly checks if all remote replicators are connected before initiating a new snapshot. If any replicator is down, it skips the snapshot. > ◦ Impact: Similar to timeouts, if

c2

is unreachable or its replicator link is broken, no new snapshots will be formed.

c1

will continue to use the last known good snapshot, causing `c2`'s subscription state to become increasingly stale as messages are acknowledged on

c1

. > 4. No New Message Production on the Topic: > • Mechanism:

ReplicatedSubscriptionsController

has an optimization: if no new user messages have been published to the topic on

c1

since the last completed snapshot (

topic.getLastMaxReadPositionMovedForwardTimestamp()

> ◦

< lastCompletedSnapshotStartTime

), it skips creating a new snapshot. > ◦ Impact: If the topic is idle in terms of new publishes, but consumers on

c1

are catching up and acknowledging older messages (thus advancing

markDeletePosition

), new snapshots correlating these newer

markDeletePosition

states on

c1

with

c2

positions might not be created frequently. The updates to

c2

will use older snapshot data. > 5. Nature of
markDeletePosition
with Individual Acknowledges: > ◦ PIP-33 Limitation: The system replicates the

markDeletePosition

. Individual acknowledgements only advance the

markDeletePosition

when a contiguous block of messages from the current

markDeletePosition

is acknowledged. If there are "holes" due to out-of-order individual acks, the

markDeletePosition

lags. > ◦ Impact: Even if snapshots are frequent and successful, the state replicated to

c2

is the

markDeletePosition

from

c1

. If this position itself is lagging on

c1

due to ack patterns,

c2

will reflect that same lag. This isn't a loss of acknowledged state by the replication mechanism beyond

replicatedSubscriptionsSnapshotFrequencyMillis

but rather a characteristic of what is being replicated. > 6. Initial State and First Snapshot: > ◦ When a replicated subscription is first enabled or a topic is new, it takes time for the first successful snapshot cycle to complete. Acknowledgements on

c1

before this first usable snapshot is available won't be immediately reflected on

c2

. > In essence,

replicatedSubscriptionsSnapshotFrequencyMillis

is the attempt frequency, not a guaranteed bound on state divergence. The actual synchronization delay is subject to the successful and timely completion of the entire snapshotting and update marker propagation process, which can be affected by network conditions, cluster load, and remote cluster responsiveness. The

replicatedSubscriptionsSnapshotTimeoutSeconds

plays a more significant role in how long state can remain divergent if issues arise.

👍 3

pulsarlogo 2

Chris Bono

06/12/2025, 2:30 PM

👋 my pulsarlogo peeps… I am trying to release the next version of the

pulsar-client-reactive

to be included in next week’s Spring Pulsar service release. I have a small PR that updates the version number and begins this process. However, the PR requires at least 1 reviewer and Lari usually handles the reviews but is on PTO. If anyone has the ability to review https://github.com/apache/pulsar-client-reactive/pull/228 it would be greatly appreciated. thankyou

✅ 1

Chris Bono

06/12/2025, 6:17 PM

Release vote started for the Reactive Java client for Apache Pulsar 0.7.0 based on 0.7.0-candidate-1: https://lists.apache.org/thread/4rjybn5zwpdo02gw6j2032n06ghqjmbv Please help validate the release and vote!

👏 2

SiNan Liu

06/13/2025, 12:16 PM

Inspired by @Lari Hotari before, I wrote a system prompt that can enhance the ability of LLM to polish and generate PIP documents. https://github.com/apache/pulsar/pull/24328#issuecomment-2970193088 The system prompt is as follows: https://gist.github.com/Denovo1998/163e55b3a612873364a00cf0df5a1b95

👍 3

👀 1

🥇 1

Fabian Wikstrom

06/18/2025, 3:49 PM

Hi, we are seeing some issues and just submitted a high severity incident ticket. Is anyone around to help out? ❤️

👀 2

Lari Hotari

06/19/2025, 9:25 PM

ExtensibleLoadManager tests are broken in branch-3.0 and branch-3.3 dev mailing list thread: https://lists.apache.org/thread/005qvhjvr5kpcf9qcgbjw82c51of28jy

孟祥迎

06/25/2025, 6:26 AM

@Lari Hotari @Penghui @lin lin @Enrico Olivelli @Yubiao Feng @Tboy The PIP-425 has already been approved by multiple reviewers during the PR review process, but we still need more votes on the mailing list. Could you please help vote when you have a moment? https://github.com/apache/pulsar/pull/24394 https://lists.apache.org/thread/c2zvjwf7bqp8nc2rpzbxd4kdtztk23xp

Yabin m

06/26/2025, 4:18 PM

@Yabin m has left the channel

Thomas MacKenzie

07/12/2025, 4:56 AM

Hi 👋 I have one PR up https://github.com/apache/pulsar-client-go/pull/1390, adding the properties management at the namespace level, we have a usecase for it in our platform It's my first time contributing, feel free to give me feedback here if you need to. (also posted in #CFTLCF52N) Thank you

✅ 1

太上玄元道君

07/16/2025, 10:30 AM

https://github.com/apache/pulsar/pull/24522 @Ran Gao @Penghui @Zixuan Liu @Lari Hotari PTAL

Lari Hotari

07/17/2025, 4:37 PM

[DISCUSS] Pulsar releases 3.0.13, 3.3.8 and 4.0.6 thread: https://lists.apache.org/thread/mt2336hjvntnzvky109x77j6gm6zs6w7

👀 1

👍 1

Lari Hotari

07/23/2025, 1:04 PM

We don't have Pulsar 3.0.x and 3.3.x CI in green state. That's delaying the next releases. latest runs: • branch-3.0 https://github.com/apache/pulsar/actions/runs/16466552585 • branch-3.3 https://github.com/apache/pulsar/actions/runs/16466555539 Since I've cherry-picked more changed, I'll trigger new runs and start investigating. @Zixuan Liu You recently were fixing some problems around failing ExtensibleLoadManager tests. Did tests pass in your env?

Lari Hotari

07/25/2025, 1:05 PM

I need a quick review on https://github.com/apache/pulsar/pull/24562 to get one CVE off the list for next releases. Any committer available for review?

👀 1

Lari Hotari

07/25/2025, 2:30 PM

Another one: https://github.com/apache/pulsar/pull/24564

Lari Hotari

07/26/2025, 8:24 PM

Latest update about the 3.0.13/3.3.8/4.0.6 release progress: https://lists.apache.org/thread/d7zn2sns80o6mqbq6cmnllnpl7kml62x

Hideaki Oguni

07/29/2025, 8:28 AM

Hi, @Yubiao Feng @Lari Hotari, Our company has encountered issues several times when there is a slow consumer in Key_Shared. As one of the potential solutions, I’ve been cherry-picking the backoff feature (https://github.com/apache/pulsar/pull/23226) and verifying its behavior. During this process, I observed a few concerns. First, I ran the reproduction code for the ack holes issue (https://github.com/apache/pulsar/issues/23200) with

dispatcherRetryBackoffInitialTimeInMs=1

and

dispatcherRetryBackoffMaxTimeInMs=1000

. As a result, I observed 27,196 ack holes, which is nearly the same as before the backoff mechanism was introduced.

Copy code

2025-07-25T20:45:04,685+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000
max latency difference of subsequent messages: 58393 ms
max ack holes: 27196
2025-07-25T20:45:04,686+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer1 received 263558 unique messages 0 duplicates in 472 s, max latency difference of subsequent messages 58310 ms
2025-07-25T20:45:04,686+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer2 received 241250 unique messages 0 duplicates in 463 s, max latency difference of subsequent messages 58393 ms
2025-07-25T20:45:04,686+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer3 received 221278 unique messages 0 duplicates in 432 s, max latency difference of subsequent messages 52344 ms
2025-07-25T20:45:04,686+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer4 received 273914 unique messages 0 duplicates in 472 s, max latency difference of subsequent messages 58393 ms

Additionally, when I applied a modification to limit the scope of backoff only to replay messages, the number of ack holes reduced to 835.

Copy code

2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000
max latency difference of subsequent messages: 3309 ms
max ack holes: 835
2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer1 received 263558 unique messages 0 duplicates in 472 s, max latency difference of subsequent messages 2473 ms
2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer2 received 241250 unique messages 0 duplicates in 471 s, max latency difference of subsequent messages 1230 ms
2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer3 received 221278 unique messages 0 duplicates in 471 s, max latency difference of subsequent messages 1230 ms
2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer4 received 273914 unique messages 0 duplicates in 471 s, max latency difference of subsequent messages 3309 ms

This suggests that it is the replay messages that are failing to be delivered, and that the backoff is being reset by the delivery of normal messages. There is a possibility that the same behavior exists on the master branch. This would mean that, with the default settings, the backoff mechanism has little effect. However, I don't consider this a serious issue, since it has introduced a look-ahead limit (https://github.com/apache/pulsar/pull/23231). The default values were changed in https://github.com/apache/pulsar/pull/23340 due to issues caused by backoff under low traffic. However, even when I set

dispatcherRetryBackoffInitialTimeInMs=100

and

dispatcherRetryBackoffMaxTimeInMs=1000

, I was unable to reproduce an increase in consumption latency under low traffic. I couldn't find any mention of this issue in the mailing list or Slack archives. Could you share the specific conditions under which this issue was observed?

👀 1

Lari Hotari

07/30/2025, 12:23 PM

Please help validate the Apache Pulsar releases and vote: • [VOTE] Release Apache Pulsar 3.0.13 based on 3.0.13-candidate-1 • [VOTE] Release Apache Pulsar 3.3.8 based on 3.3.8-candidate-1 • [VOTE] Release Apache Pulsar 4.0.6 based on 4.0.6-candidate-1 /cc @Zixuan Liu @Tboy @Enrico Olivelli @Nicoló Boschi @Entvex @Yunze Xu @孟祥迎

Lari Hotari

08/04/2025, 12:50 PM

Please help validate the Apache Pulsar Helm chart release 4.2.0 and vote: • [VOTE] Release Apache Pulsar Helm Chart 4.2.0 based on 4.2.0-candidate-1 /cc @Zixuan Liu @Tboy @Entvex @孟祥迎

SiNan Liu

08/04/2025, 1:08 PM

Hope everyone can discuss the potential Roadmap of the delayed message module together.🤪 https://github.com/apache/pulsar/issues/24600