Hi team there is a performance issue when using JSONPATH tra Apache Pinot #troubleshooting

Hi team, there is a performance issue when using ‘...

wentao jin

09/06/2021, 8:42 AM

Hi team, there is a performance issue when using ‘JSONPATH’ transformation functions, this is may be caused by the jayway Inefficient LRUCache design. When encountering high concurrency in data ingestion, there will be fierce competition for CPU and all ingestion threads are waiting for the lock, and data consumption will be delay.

👀 1

Richard Startin

09/06/2021, 9:20 PM

Hi @wentao jin, good find 👍 In order to confirm your suspicion that the problem is due to contention, it would help to disable the cache entirely. I'm new to pinot, but I took a look at the jayway library and it looks like you can override the cache implementation by calling

Copy code

com.jayway.jsonpath.spi.cache.CacheProvider.setCache(new NOOPCache());

but you need to do it before anything else in the same

ClassLoader

calls

CacheProvider.getCache

to make sure the registration works. To really confirm this is related to contention, it would be great to get a contention profile from async-profiler without overriding the cache provider. You can run it as a native agent as outlined here, and use this JVM setting:

Copy code

-agentpath:/path/to/libasyncProfiler.so=start,event=lock,file=contention.html

(Or set

Copy code

-agentpath:/path/to/libasyncProfiler.so=start,event=cpu,file=cpu.html

to get a cpu profile which should identify the bottleneck without proving the issue is contention) Getting a look at the flamegraph will really help to understand whether this is contention or just slow JSON path evaluation.

wentao jin

09/07/2021, 7:53 AM

Hi @Richard Startin. I have run the async-profiler tool to analyze CPU and lock.

Pinot Server Flavor: 28vcpu.110mem.2000ssd

Copy code

"transformConfigs": [
  {
    "columnName": "correlationId",
    "transformFunction": "jsonPathString(report,'$.identifiers.correlationId','')"
  }
...
]

I have done several times profiling, each time lasts 1 minute. The result is basically the same. Lock wait of LRUCache and CPU usage of JsonPath are both high.

wentao jin

09/07/2021, 7:57 AM

About replace LRUCache, I think using NOOPCache may not be a very good solution, actually, we still need cache JsonPath to avoid unnecessarily path compile.

wentao jin

09/07/2021, 8:02 AM

We have used ConcurrentHashMap to implement a simple cache to fix this issue, and it works well in our production. maybe I can try to commit it back to open source.

Richard Startin

09/07/2021, 8:09 AM

I wasn't suggesting the NOOPCache as a solution, but as a diagnostic step

Richard Startin

09/07/2021, 8:10 AM

i.e. remove the suspected point of contention and see what happens

Richard Startin

09/07/2021, 8:10 AM

Can you attach the flamegraphs here?

wentao jin

09/07/2021, 8:23 AM

Sure

🙏 1

wentao jin

09/07/2021, 9:25 AM

Copy code

CacheProvider.setCache(new NOOPCache());

Below is flamegraphs after use NOOPCache.

Richard Startin

09/07/2021, 9:26 AM

oh sorry, I meant can I have the actual files, not screenshots?

Richard Startin

09/07/2021, 9:27 AM

I'd like to have a dig around in them because I think we can make some simple but high impact improvements here, and not just by replacing LRUCache with a better implementation

👍 1

Richard Startin

09/07/2021, 9:29 AM

(wow, so much time in prometheus client there!)

wentao jin

09/07/2021, 9:29 AM

cpu-analysis.html,lock-analysis.html,noopcache-cpu-analysis.html,noopcache-lock-analysis.html

noopcache-cpu-analysis.html cpu-analysis.html noopcache-lock-analysis.html lock-analysis.html

wentao jin

09/07/2021, 9:30 AM

Here is the html files.

Richard Startin

09/07/2021, 9:31 AM

yep, this is mostly the LRUCache indeed

Richard Startin

09/07/2021, 9:32 AM

btw you have a bad jsonpath it seems - might be worth fixing (but tiny compared to the cache management)

wentao jin

09/07/2021, 9:32 AM

Prometheus JMX client performance is bad too, It will get better after adding whitelistObjectNames configure.

wentao jin

09/07/2021, 9:35 AM

Some table transformConfigs configurations may not be right in my test env.😅

😅 1

Richard Startin

09/07/2021, 9:42 AM

In the short term, I suggest you keep the

ConcurrentHashMap

cache you have implemented if it solves your problem, but I'm not sure it will be suitable for every pinot user if it caches every jsonpath. Since pinot has a guava dependency, we could instead use guava's cache with a sensible size limit and register it to avoid jayway's own LRUCache. If you're interested, it could be a nice open source contribution to propose.

Richard Startin

09/07/2021, 9:42 AM

There's a lot of interesting information in these profiles, so thanks for sharing them

🙏 1

wentao jin

09/07/2021, 1:39 PM

I’m thinking about this. Do we really need LRU Cache in this scenario. In fact, HashMap still has very high query performance in the order of 100,000, and for big data scenarios, this does not take up a lot of JVM memory, and basically It is impossible to have more than so many json path transformation configurations. In the case of frequent data ingestion, Even if it exceeds 100000(may other sensible size), Not cache json path may be better than frequent swapping in and out of LRU.

Richard Startin

09/07/2021, 1:42 PM

I think you might be right. WDYT @Mayank?

Mayank

09/07/2021, 1:44 PM

Yeah, seems reasonable.

Richard

09/07/2021, 6:29 PM

Hi @wentao jin I created an issue here, you can link to it if you want to contribute your change 👍 https://github.com/apache/pinot/issues/7403

wentao jin

09/08/2021, 12:53 AM

Thanks Richard.

🙌 1

Open in Slack

Previous Next