https://pulsar.apache.org/ logo
Join Slack
Powered by
# dev
  • l

    Lari Hotari

    05/21/2025, 8:44 AM
    Please help validate the Apache Pulsar releases and vote: • [VOTE] Release Apache Pulsar 3.0.12 based on 3.0.12-candidate-1 • [VOTE] Release Apache Pulsar 4.0.5 based on 4.0.5-candidate-1 • [VOTE] Release Apache Pulsar 3.3.7 based on 3.3.7-candidate-1 /cc @Zixuan Liu @Tboy @Enrico Olivelli @Nicoló Boschi @Entvex
    ✅ 1
    pulsarlogo 1
  • l

    Lari Hotari

    05/22/2025, 7:54 AM
    GitHub Actions degraded status: https://www.githubstatus.com/ "We're investigating delays with the execution of queued GitHub Actions jobs."
  • w

    Wallace Peng

    05/23/2025, 4:45 AM
    Got an error in the log but it seems gone after some time , @Lari Hotari or someone has any idea ? bookie failed to write then wrote successfully after some retries, noticed that bookie had some compaction related work done around the same time . is this related ?
    Copy code
    java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: The producer pulsar-5312-2718615 can not send message to the topic <persistent://14748/change/log> within given timeout : createdAt 126.9 seconds ago, firstSentAt 126.9 seconds ago, lastSentAt 126.9 seconds ago, retryCount 1
    	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
    	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
    	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
    	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2094)
    	at org.apache.pulsar.client.impl.ProducerImpl$DefaultSendMessageCallback.onSendComplete(ProducerImpl.java:405)
    	at org.apache.pulsar.client.impl.ProducerImpl$DefaultSendMessageCallback.sendComplete(ProducerImpl.java:386)
    	at org.apache.pulsar.client.impl.ProducerImpl$OpSendMsg.sendComplete(ProducerImpl.java:1568)
    	at org.apache.pulsar.client.impl.ProducerImpl.lambda$failPendingMessages$18(ProducerImpl.java:2124)
    	at java.base/java.util.ArrayDeque.forEach(ArrayDeque.java:889)
    	at org.apache.pulsar.client.impl.ProducerImpl$OpSendMsgQueue.forEach(ProducerImpl.java:1659)
    	at org.apache.pulsar.client.impl.ProducerImpl.failPendingMessages(ProducerImpl.java:2114)
    	at org.apache.pulsar.client.impl.ProducerImpl.lambda$failPendingMessages$19(ProducerImpl.java:2146)
    	at org.apache.pulsar.shade.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    	at org.apache.pulsar.shade.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    	at org.apache.pulsar.shade.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
    	at org.apache.pulsar.shade.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
    	at org.apache.pulsar.shade.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
    	at org.apache.pulsar.shade.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    	at org.apache.pulsar.shade.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: The producer pulsar-5312-2718615 can not send message to the topic <persistent://14748/change/log> within given timeout : createdAt 126.9 seconds ago, firstSentAt 126.9 seconds ago, lastSentAt 126.9 seconds ago, retryCount 1
    	at org.apache.pulsar.client.impl.ProducerImpl$OpSendMsg.sendComplete(ProducerImpl.java:1565)
    	... 13 common frames omitted
    Broker: ``````
  • w

    Wallace Peng

    05/23/2025, 4:47 AM
    image.png
  • l

    Lari Hotari

    05/23/2025, 8:27 AM
    When merging bug fix PRs in apache/pulsar, please pay attention to adding necessary release labels before merging. https://pulsar.apache.org/contribute/maintenance-process/#current-process-and-responsibilities
  • l

    Lari Hotari

    05/23/2025, 2:01 PM
    Please vote on the Pulsar helm chart release: • [VOTE] Release Apache Pulsar Helm Chart 4.1.0 based on 4.1.0-candidate-1 /cc @Entvex @Zixuan Liu @Tboy @Enrico Olivelli @Nicoló Boschi
    ✅ 1
  • l

    Lari Hotari

    05/30/2025, 6:01 AM
    Please review a refactoring and improvement to the broker entry cache: https://github.com/apache/pulsar/pull/24363 Mailing list: https://lists.apache.org/thread/ddzzc17b0c218ozq9tx0r3rx5sgljfb0 > Broker entry cache refactoring - PR #24363 review requested > Hi all, > > I've submitted PR #24363 that refactors the broker entry cache > eviction algorithm to fix some correctness and performance issues in > the current implementation. > > The current EntryCacheDefaultEvictionPolicy has a few problems. The > size-based eviction doesn't guarantee that the oldest entries are > removed first - it sorts caches by size and evicts proportionally, > which can leave older entries while removing newer ones. The timestamp > eviction iterates through all cache instances every 10ms by default, > which creates unnecessary CPU overhead, especially with many topics. > There are also correctness issues when entries aren't ordered by both > position and timestamp, which happens with catch-up reads and mixed > read patterns. > > The refactored implementation introduces a centralized > RangeCacheRemovalQueue that tracks entry insertion order using JCTools > MpscUnboundedArrayQueue. This ensures both size-based and > timestamp-based eviction process entries from oldest to newest, > guaranteeing proper FIFO behavior. The approach also consolidates > eviction handling to a single thread to avoid contention and removes > some unnecessary generic type parameters from RangeCache. > > The key difference is that instead of the complex size-sorting > algorithm, entries are simply processed in insertion order until the > eviction target is met. This should improve cache hit rates and reduce > CPU overhead from the frequent iterations. > > This work prepares the broker entry cache for further improvements to > efficiently handle catch-up reads and Key_Shared subscription > scenarios. The current cache has limitations around unnecessary > BookKeeper reads during catch-up scenarios, poor cache hit rates for > Key_Shared subscriptions with slow consumers, and cascading > performance issues under high fan-out catch-up reads. The centralized > eviction queue design provides a foundation for addressing these > issues in future work. > > Would appreciate reviews, particularly around the eviction logic and > any potential edge cases I might have missed. > > PR: https://github.com/apache/pulsar/pull/24363 > > Thanks, > > Lari /cc @Tboy @太上玄元道君 @Zixuan Liu @Yubiao Feng @Enrico Olivelli @Andrey @merlimat @孟祥迎 @the tumbled
    👀 2
  • s

    SiNan Liu

    06/01/2025, 1:28 PM
    Hi all! Please discussion on PIP-423: Add Support for Cancelling Individual Delayed Messages. Proposal link: https://github.com/apache/pulsar/pull/24370 Mailing list:https://lists.apache.org/thread/lo182ztgrkzlq6mbkytj8krd050yvb9w I’m looking forward to hearing from you.👀
    • 1
    • 1
  • s

    SiNan Liu

    06/02/2025, 12:54 PM
    Hello everyone, I’ve raised a new future regarding delayed messages. https://github.com/apache/pulsar/pull/24372 As we all know, using
    ackTimeout
    ,
    negativeAck
    , or
    reconsumeLater
    all cause data re-delivery, but these do not support long delays. This PR aims to support sending delayed messages on the consumer side, without issues like write amplification or read amplification. I hope everyone can participate in the discussion. If this plan is feasible, I will propose a PIP.
  • l

    Lari Hotari

    06/02/2025, 3:20 PM
    Please review https://github.com/apache/pulsar/pull/16651. There's a long time issue in replicated subscriptions, that a replication snapshot could get missed since it's currently processed only after the next mark delete update. Many users have been complaining that subscription replication doesn't work as expected and replication only happens after there's a pause in the message flow. This PR should address many of those issues. It's been open for several years so it would great to finally get the fix merged. One of the reported issues is https://github.com/apache/pulsar/issues/10054
    👏 3
  • w

    Wallace Peng

    06/04/2025, 1:06 AM
    is there AI agent that help us upgrade pulsar cluster ? and guidance for this as well ?
    l
    • 2
    • 1
  • l

    Lari Hotari

    06/04/2025, 12:34 PM
    Today I finally started experimenting with Brokk.ai, an AI tool for large code bases created by @Jonathan Ellis and I'm amazed how useful this tool is for working on the Pulsar code base. To learn how to use Brokk, I read https://blog.brokk.ai/brokk-ai-for-large-java-codebases/ and watched this Youtube video:

    https://www.youtube.com/watch?v=t_7MqowT638▾

    . After this, it's fairly easy to get started, especially if you have used other similar AI tools before. What I wanted to do was a MermaidJS diagram to visualize Replicated Subscription interactions. This is the resulting diagram after a few rounds: https://gist.github.com/lhotari/96fda511a70d7de93744d868b4472b92 I did give some additional instructions in the rounds to get it laid out in the way that I wanted to have it visualized. I also used Claude AI directly to fix some issues in the resulting syntax and some manual edits in VSCode before I was happy with the resulting diagram. Besides asking to create the diagram, I also asked Brokk "Also explain the problems that might lead to not being able to replicate subscription state with the given replicatedSubscriptionsSnapshotFrequencyMillis. The users have the expectation that this would be the case that subscription replication loses at most up to replicatedSubscriptionsSnapshotFrequencyMillis of acknowledgement state (mark delete position) on the remote cluster side." The analysis is a really great starting point for further analysis: > Potential Problems Affecting Subscription State Replication Freshness: > The user expectation that replicated subscription state (specifically, the
    markDeletePosition
    ) on the remote cluster (
    c2
    ) loses at most
    replicatedSubscriptionsSnapshotFrequencyMillis
    worth of acknowledgements can be unmet due to several factors inherent in the design and operational conditions: > 1. Snapshot Lifecycle Latency: > ◦ Process: A snapshot isn't instantaneous. It involves
    c1
    sending a request marker,
    c2
    receiving it,
    c2
    sending a response marker (with its current end-of-topic position), and
    c1
    receiving this response to finalize the snapshot. This entire round-trip takes time, potentially affected by replication lag between
    c1
    and
    c2
    , and then back from
    c2
    to
    c1
    . > ◦ Impact: Even if snapshots are initiated every
    replicatedSubscriptionsSnapshotFrequencyMillis
    , the actual completion and availability of a new snapshot for use can be later. If the
    markDeletePosition
    on
    c1
    advances, it can only use the latest completed snapshot. If many messages are acked on
    c1
    while a snapshot is in progress or delayed, the update to
    c2
    will be based on an older snapshot, making `c2`'s position appear more stale than just the snapshot frequency. > 2. Snapshot Timeouts (
    replicatedSubscriptionsSnapshotTimeoutSeconds
    ):
    > ◦ Mechanism: If a remote cluster (e.g.,
    c2
    ) fails to send a
    SNAPSHOT_RESPONSE
    within
    replicatedSubscriptionsSnapshotTimeoutSeconds
    (default 30s), the snapshot attempt is considered timed out on
    c1
    . > ◦ Impact: When snapshots time out, new valid mappings between
    c1
    and
    c2
    positions are not established.
    ReplicatedSubscriptionsController
    on
    c1
    will skip creating new snapshots if a pending one has never succeeded (to avoid flooding) or if recent attempts have timed out. It will rely on the
    lastCompletedSnapshotId
    . If this last successful snapshot is old, the
    markDeletePosition
    on
    c2
    will not reflect recent acknowledgements on
    c1
    , leading to a staleness far greater than
    replicatedSubscriptionsSnapshotFrequencyMillis
    . This is a critical point where the expectation can be significantly missed. > 3. Disconnected Replicators or Cluster Unavailability: > ◦ Mechanism: The
    ReplicatedSubscriptionsController
    explicitly checks if all remote replicators are connected before initiating a new snapshot. If any replicator is down, it skips the snapshot. > ◦ Impact: Similar to timeouts, if
    c2
    is unreachable or its replicator link is broken, no new snapshots will be formed.
    c1
    will continue to use the last known good snapshot, causing `c2`'s subscription state to become increasingly stale as messages are acknowledged on
    c1
    . > 4. No New Message Production on the Topic: > • Mechanism:
    ReplicatedSubscriptionsController
    has an optimization: if no new user messages have been published to the topic on
    c1
    since the last completed snapshot (
    topic.getLastMaxReadPositionMovedForwardTimestamp()
    > ◦
    < lastCompletedSnapshotStartTime
    ), it skips creating a new snapshot. > ◦ Impact: If the topic is idle in terms of new publishes, but consumers on
    c1
    are catching up and acknowledging older messages (thus advancing
    markDeletePosition
    ), new snapshots correlating these newer
    markDeletePosition
    states on
    c1
    with
    c2
    positions might not be created frequently. The updates to
    c2
    will use older snapshot data. > 5. Nature of
    markDeletePosition
    with Individual Acknowledges:
    > ◦ PIP-33 Limitation: The system replicates the
    markDeletePosition
    . Individual acknowledgements only advance the
    markDeletePosition
    when a contiguous block of messages from the current
    markDeletePosition
    is acknowledged. If there are "holes" due to out-of-order individual acks, the
    markDeletePosition
    lags. > ◦ Impact: Even if snapshots are frequent and successful, the state replicated to
    c2
    is the
    markDeletePosition
    from
    c1
    . If this position itself is lagging on
    c1
    due to ack patterns,
    c2
    will reflect that same lag. This isn't a loss of acknowledged state by the replication mechanism beyond
    replicatedSubscriptionsSnapshotFrequencyMillis
    but rather a characteristic of what is being replicated. > 6. Initial State and First Snapshot: > ◦ When a replicated subscription is first enabled or a topic is new, it takes time for the first successful snapshot cycle to complete. Acknowledgements on
    c1
    before this first usable snapshot is available won't be immediately reflected on
    c2
    . > In essence,
    replicatedSubscriptionsSnapshotFrequencyMillis
    is the attempt frequency, not a guaranteed bound on state divergence. The actual synchronization delay is subject to the successful and timely completion of the entire snapshotting and update marker propagation process, which can be affected by network conditions, cluster load, and remote cluster responsiveness. The
    replicatedSubscriptionsSnapshotTimeoutSeconds
    plays a more significant role in how long state can remain divergent if issues arise.
    👍 3
    pulsarlogo 2
    • 1
    • 3
  • c

    Chris Bono

    06/12/2025, 2:30 PM
    👋 my pulsarlogo peeps… I am trying to release the next version of the
    pulsar-client-reactive
    to be included in next week’s Spring Pulsar service release. I have a small PR that updates the version number and begins this process. However, the PR requires at least 1 reviewer and Lari usually handles the reviews but is on PTO. If anyone has the ability to review https://github.com/apache/pulsar-client-reactive/pull/228 it would be greatly appreciated. thankyou
    ✅ 1
    d
    • 2
    • 1
  • c

    Chris Bono

    06/12/2025, 6:17 PM
    Release vote started for the Reactive Java client for Apache Pulsar 0.7.0 based on 0.7.0-candidate-1: https://lists.apache.org/thread/4rjybn5zwpdo02gw6j2032n06ghqjmbv Please help validate the release and vote!
    👏 2
    • 1
    • 1
  • s

    SiNan Liu

    06/13/2025, 12:16 PM
    Inspired by @Lari Hotari before, I wrote a system prompt that can enhance the ability of LLM to polish and generate PIP documents. https://github.com/apache/pulsar/pull/24328#issuecomment-2970193088 The system prompt is as follows: https://gist.github.com/Denovo1998/163e55b3a612873364a00cf0df5a1b95
    👍 3
    👀 1
    🥇 1
  • f

    Fabian Wikstrom

    06/18/2025, 3:49 PM
    Hi, we are seeing some issues and just submitted a high severity incident ticket. Is anyone around to help out? ❤️
    👀 2
    d
    • 2
    • 4
  • l

    Lari Hotari

    06/19/2025, 9:25 PM
    ExtensibleLoadManager tests are broken in branch-3.0 and branch-3.3 dev mailing list thread: https://lists.apache.org/thread/005qvhjvr5kpcf9qcgbjw82c51of28jy
    z
    • 2
    • 1
  • u

    孟祥迎

    06/25/2025, 6:26 AM
    @Lari Hotari @Penghui @lin lin @Enrico Olivelli @Yubiao Feng @Tboy The PIP-425 has already been approved by multiple reviewers during the PR review process, but we still need more votes on the mailing list. Could you please help vote when you have a moment? https://github.com/apache/pulsar/pull/24394 https://lists.apache.org/thread/c2zvjwf7bqp8nc2rpzbxd4kdtztk23xp
  • y

    Yabin m

    06/26/2025, 4:18 PM
    @Yabin m has left the channel
  • t

    Thomas MacKenzie

    07/12/2025, 4:56 AM
    Hi 👋 I have one PR up https://github.com/apache/pulsar-client-go/pull/1390, adding the properties management at the namespace level, we have a usecase for it in our platform It's my first time contributing, feel free to give me feedback here if you need to. (also posted in #CFTLCF52N) Thank you
    ✅ 1
  • u

    太上玄元道君

    07/16/2025, 10:30 AM
    https://github.com/apache/pulsar/pull/24522 @Ran Gao @Penghui @Zixuan Liu @Lari Hotari PTAL
  • l

    Lari Hotari

    07/17/2025, 4:37 PM
    [DISCUSS] Pulsar releases 3.0.13, 3.3.8 and 4.0.6 thread: https://lists.apache.org/thread/mt2336hjvntnzvky109x77j6gm6zs6w7
    👀 1
    👍 1
  • l

    Lari Hotari

    07/23/2025, 1:04 PM
    We don't have Pulsar 3.0.x and 3.3.x CI in green state. That's delaying the next releases. latest runs: • branch-3.0 https://github.com/apache/pulsar/actions/runs/16466552585 • branch-3.3 https://github.com/apache/pulsar/actions/runs/16466555539 Since I've cherry-picked more changed, I'll trigger new runs and start investigating. @Zixuan Liu You recently were fixing some problems around failing ExtensibleLoadManager tests. Did tests pass in your env?
    z
    • 2
    • 11
  • l

    Lari Hotari

    07/25/2025, 1:05 PM
    I need a quick review on https://github.com/apache/pulsar/pull/24562 to get one CVE off the list for next releases. Any committer available for review?
    👀 1
  • l

    Lari Hotari

    07/25/2025, 2:30 PM
    Another one: https://github.com/apache/pulsar/pull/24564
  • l

    Lari Hotari

    07/26/2025, 8:24 PM
    Latest update about the 3.0.13/3.3.8/4.0.6 release progress: https://lists.apache.org/thread/d7zn2sns80o6mqbq6cmnllnpl7kml62x
  • h

    Hideaki Oguni

    07/29/2025, 8:28 AM
    Hi, @Yubiao Feng @Lari Hotari, Our company has encountered issues several times when there is a slow consumer in Key_Shared. As one of the potential solutions, I’ve been cherry-picking the backoff feature (https://github.com/apache/pulsar/pull/23226) and verifying its behavior. During this process, I observed a few concerns. First, I ran the reproduction code for the ack holes issue (https://github.com/apache/pulsar/issues/23200) with
    dispatcherRetryBackoffInitialTimeInMs=1
    and
    dispatcherRetryBackoffMaxTimeInMs=1000
    . As a result, I observed 27,196 ack holes, which is nearly the same as before the backoff mechanism was introduced.
    Copy code
    2025-07-25T20:45:04,685+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000
    max latency difference of subsequent messages: 58393 ms
    max ack holes: 27196
    2025-07-25T20:45:04,686+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer1 received 263558 unique messages 0 duplicates in 472 s, max latency difference of subsequent messages 58310 ms
    2025-07-25T20:45:04,686+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer2 received 241250 unique messages 0 duplicates in 463 s, max latency difference of subsequent messages 58393 ms
    2025-07-25T20:45:04,686+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer3 received 221278 unique messages 0 duplicates in 432 s, max latency difference of subsequent messages 52344 ms
    2025-07-25T20:45:04,686+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer4 received 273914 unique messages 0 duplicates in 472 s, max latency difference of subsequent messages 58393 ms
    Additionally, when I applied a modification to limit the scope of backoff only to replay messages, the number of ack holes reduced to 835.
    Copy code
    2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000
    max latency difference of subsequent messages: 3309 ms
    max ack holes: 835
    2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer1 received 263558 unique messages 0 duplicates in 472 s, max latency difference of subsequent messages 2473 ms
    2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer2 received 241250 unique messages 0 duplicates in 471 s, max latency difference of subsequent messages 1230 ms
    2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer3 received 221278 unique messages 0 duplicates in 471 s, max latency difference of subsequent messages 1230 ms
    2025-07-25T19:04:26,485+0900 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer4 received 273914 unique messages 0 duplicates in 471 s, max latency difference of subsequent messages 3309 ms
    This suggests that it is the replay messages that are failing to be delivered, and that the backoff is being reset by the delivery of normal messages. There is a possibility that the same behavior exists on the master branch. This would mean that, with the default settings, the backoff mechanism has little effect. However, I don't consider this a serious issue, since it has introduced a look-ahead limit (https://github.com/apache/pulsar/pull/23231). The default values were changed in https://github.com/apache/pulsar/pull/23340 due to issues caused by backoff under low traffic. However, even when I set
    dispatcherRetryBackoffInitialTimeInMs=100
    and
    dispatcherRetryBackoffMaxTimeInMs=1000
    , I was unable to reproduce an increase in consumption latency under low traffic. I couldn't find any mention of this issue in the mailing list or Slack archives. Could you share the specific conditions under which this issue was observed?
    👀 1
    l
    • 2
    • 20
  • l

    Lari Hotari

    07/30/2025, 12:23 PM
    Please help validate the Apache Pulsar releases and vote: • [VOTE] Release Apache Pulsar 3.0.13 based on 3.0.13-candidate-1 • [VOTE] Release Apache Pulsar 3.3.8 based on 3.3.8-candidate-1 • [VOTE] Release Apache Pulsar 4.0.6 based on 4.0.6-candidate-1 /cc @Zixuan Liu @Tboy @Enrico Olivelli @Nicoló Boschi @Entvex @Yunze Xu @孟祥迎
    y
    • 2
    • 6
  • l

    Lari Hotari

    08/04/2025, 12:50 PM
    Please help validate the Apache Pulsar Helm chart release 4.2.0 and vote: • [VOTE] Release Apache Pulsar Helm Chart 4.2.0 based on 4.2.0-candidate-1 /cc @Zixuan Liu @Tboy @Entvex @孟祥迎
  • s

    SiNan Liu

    08/04/2025, 1:08 PM
    Hope everyone can discuss the potential Roadmap of the delayed message module together.🤪 https://github.com/apache/pulsar/issues/24600
    l
    • 2
    • 1