Apache Pinot #general

Yupeng Fu

04/13/2021, 11:37 PM

hey, is there a plan to add a table creation module in the cluster management UI?

Xiang Fu

04/14/2021, 8:13 AM

Hello Community, We’re happy to announce the release of 🍷 Apache Pinot 0.7.1! This release includes several awesome new features 📃 🌎 🔓 :

Copy code

- JSON index
- Lookup-based join support
- Geospatial support
- TLS support for pinot connections
- Introduced new APIs for segment management and offline table push.
- Various performance optimizations, improvements and bug fixes.

Please also see the full release notes here: https://docs.pinot.apache.org/basics/releases/0.7.1 The release can be downloaded at https://pinot.apache.org/download Additional resources - Project website: https://pinot.apache.org Getting started: https://docs.pinot.apache.org/getting-started Pinot developer blogs: https://medium.com/apache-pinot-developer-blog Intro to Pinot Video:

https://www.youtube.com/watch?v=T70jTTYhYyM▾

Twitter: https://twitter.com/ApachePinot Meetup: https://www.meetup.com/apache-pinot

🚀 5

🎉 8

🥂 5

🔓 2

🥳 4

pinot 5

🌎 4

🍷 12

🙌 1

Gabriel Lucano

04/14/2021, 5:23 PM

image.png

Aaron Wishnick

04/14/2021, 8:11 PM

Is there anything I can do to make batch import faster? It seems like most of the time is spent processing the Parquet files I'm importing, but I still don't see very high CPU usage on my machine (particularly, most cores are not busy). I see stuff like this in the logs:

Copy code

Apr 14, 2021 3:16:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: time spent so far 0% reading (1854 ms) and 99% processing (311813 ms)

Is there a setting to use more cores to process segments in parallel or anything like that?

Neil Teng

04/15/2021, 5:14 PM

Hey, I have a question about the pinot-stream-ingestion. What will happen to the data after flush? Will it be consolidated into disk or just removed?

Ming Liang

04/15/2021, 10:54 PM

Hey, build on master failed with message:

Copy code

[INFO] Reactor Summary for Pinot 0.8.0-SNAPSHOT:
[INFO]
[INFO] Pinot .............................................. SUCCESS [ 15.773 s]
[INFO] Pinot Service Provider Interface ................... SUCCESS [  2.762 s]
[INFO] Pinot Segment Service Provider Interface ........... SUCCESS [  1.093 s]
[INFO] Pinot Plugins ...................................... SUCCESS [  4.414 s]
[INFO] Pinot Metrics ...................................... SUCCESS [  0.145 s]
[INFO] Pinot Yammer Metrics ............................... SUCCESS [  4.240 s]
[INFO] Pinot Common ....................................... SUCCESS [ 16.085 s]
[INFO] Pinot Input Format ................................. SUCCESS [  0.993 s]
[INFO] Pinot Avro Base .................................... SUCCESS [  1.109 s]
[INFO] Pinot Avro ......................................... SUCCESS [  0.902 s]
[INFO] Pinot Csv .......................................... SUCCESS [  0.354 s]
[INFO] Pinot JSON ......................................... SUCCESS [  0.344 s]
[INFO] Pinot local segment implementations ................ SUCCESS [  6.542 s]
[INFO] Pinot Core ......................................... SUCCESS [  9.313 s]
[INFO] Pinot Server ....................................... SUCCESS [  4.548 s]
[INFO] Pinot Segment Uploader ............................. SUCCESS [  1.528 s]
[INFO] Pinot Segment Uploader Default ..................... SUCCESS [ 15.425 s]
[INFO] Pinot Controller ................................... SUCCESS [ 50.838 s]
[INFO] Pinot Broker ....................................... SUCCESS [  4.736 s]
[INFO] Pinot Clients ...................................... SUCCESS [  0.117 s]
[INFO] Pinot Java Client .................................. SUCCESS [  0.474 s]
[INFO] Pinot JDBC Client .................................. SUCCESS [  0.555 s]
[INFO] Pinot Batch Ingestion .............................. SUCCESS [  1.501 s]
[INFO] Pinot Batch Ingestion Common ....................... SUCCESS [  0.393 s]
[INFO] Pinot Minion ....................................... SUCCESS [  1.718 s]
[INFO] Pinot Confluent Avro ............................... FAILURE [  0.586 s]
[INFO] Pinot ORC .......................................... SKIPPED
[INFO] Pinot Parquet ...................................... SKIPPED
[INFO] Pinot Thrift ....................................... SKIPPED
[INFO] Pinot Protocol Buffers ............................. SKIPPED
[INFO] Pluggable Pinot file system ........................ SKIPPED
[INFO] Pinot Azure Data Lake Storage ...................... SKIPPED
[INFO] Pinot Hadoop Filesystem ............................ SKIPPED
[INFO] Pinot Google Cloud Storage ......................... SKIPPED
[INFO] Pinot Amazon S3 .................................... SKIPPED
[INFO] Pinot Batch Ingestion for Spark .................... SKIPPED
[INFO] Pinot Batch Ingestion for Hadoop ................... SKIPPED
[INFO] Pinot Batch Ingestion Standalone ................... SKIPPED
[INFO] Pinot Batch Ingestion .............................. SKIPPED
[INFO] Pinot Ingestion Common ............................. SKIPPED
[INFO] Pinot Hadoop ....................................... SKIPPED
[INFO] Pinot Spark ........................................ SKIPPED
[INFO] Pinot Stream Ingestion ............................. SKIPPED
[INFO] Pinot Kafka Base ................................... SKIPPED
[INFO] Pinot Kafka 0.9 .................................... SKIPPED
[INFO] Pinot Kafka 2.0 .................................... SKIPPED
[INFO] Pinot Minion Tasks ................................. SKIPPED
[INFO] Pinot Minion Built-In Tasks ........................ SKIPPED
[INFO] Pinot Segment Writer ............................... SKIPPED
[INFO] Pinot Segment Writer File Based .................... SKIPPED
[INFO] Pinot Tools ........................................ SKIPPED
[INFO] Pinot Integration Tests ............................ SKIPPED
[INFO] Pinot Perf ......................................... SKIPPED
[INFO] Pinot Distribution ................................. SKIPPED
[INFO] Pinot Connectors ................................... SKIPPED
[INFO] Pinot Spark Connector .............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:26 min
[INFO] Finished at: 2021-04-15T15:49:12-07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project pinot-confluent-avro: Could not resolve dependencies for project org.apache.pinot:pinot-confluent-avro:jar:0.8.0-SNAPSHOT: Failed to collect dependencies at io.confluent:kafka-schema-registry-client:jar:5.3.1: Failed to read artifact descriptor for io.confluent:kafka-schema-registry-client:jar:5.3.1: Could not transfer artifact io.confluent:kafka-schema-registry-client:pom:5.3.1 from/to maven-default-http-blocker (<http://0.0.0.0/>): Blocked mirror for repositories: [confluent (<http://packages.confluent.io/maven/>, default, releases+snapshots)] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] <http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException>
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :pinot-confluent-avro
➜ incubator-pinot git:(master)

Badri Tripathy

04/15/2021, 10:55 PM

Hello guys..I am new to Pinot and going through the architecture . Got a question ,what will be the format of the segments in the deep storage if kept on S3 ?

Gabriel Lucano

04/15/2021, 11:17 PM

Hello guys, what is the correct value for "stream.kafka.decoder.class.name" when decoding an avro message from Schema Registry?

Mohan Pandiyan

04/16/2021, 8:57 PM

Hi folks, I am looking into the Pinot JDBC connector. It looks like the access control is ignored by the driver?

Matt

04/17/2021, 12:27 AM

Hello, I managed to deploy Pinot to one of our AWS Production environment. it’s been up for few days and all looks well so far. Thanks to everyone for their support from this group. Especially @User, @User, @User, @User, @User, @User, @User and many others who replied to my queries. Also special thanks to @User who helped me extensively to get the Text Index up and running which is one of the main feature I am using. There were few issues initially, However, Sidd showed the willingness to jump to zoom and supported me to get all those resolved. Thanks again..!

🎉 2

🥂 2

👍 1

🍷 7

Dileep Reddy

04/18/2021, 4:06 PM

we have heard that Codecov was impacted by a cyber security incident. https://about.codecov.io/security-update/ .. Has the pinot team looked at it?

Charles

04/19/2021, 2:00 AM

Hi all, We want to build a new env for pinot, we need to build some hybrid tables, Can we assign realtime table data to specified vms and assign offline table data to other vms? thx

Charles

04/19/2021, 2:04 AM

Another question, can pinot use hadoop hdfs as storage?

John Knapp

04/20/2021, 3:46 AM

Hi, I'm new to Pinot and have a possibly dumb question - with so many features and horizontal scalability, when should one not use Pinot as a general relational database solution?

Hector D

04/22/2021, 4:53 PM

Why does Pinot exist? It looks like a rip off of Apache Druid and Clickhouse. Does the world really need Pinot too...what are the differences? Or was this just done because LinkedIn and Uber engineers had nothing better to do?

S Das

04/22/2021, 5:17 PM

Hi folks, I noticed that there are some old Pinot docker images sitting here: https://hub.docker.com/r/linkedin/pinot-controller : are these maintained by the community?

Josh Highley

04/22/2021, 5:25 PM

When doing upsert, can the table's time column be a string or does it have to be a long?

Yupeng Fu

04/22/2021, 9:17 PM

is it possible to add the table creation/update time to the cluster management UI (under table)? so we can sort the tables to view the recently added/modified tables?

Pedro Silva

04/23/2021, 10:16 AM

Hello, Does Pinot support updating an existing Schema & Table's definition? I have a dimension which is a string representation of a JSON. The schema of this json payload is dynamic. Some inner fields exist for some rows but not others and will change over time. I have a business requirement to deconstruct the json such that users can use the inner fields in the json for queries. I've seen that it is possible to deconstruct json fields: https://stackoverflow.com/questions/65886253/pinot-nested-json-ingestion but my question is whether pinot allows this deconstruct to change over time. Thank you.

Akash

04/23/2021, 12:17 PM

Couple of Starter Question: Lets assume we have an table in HDFS which get loads every 30 min with following structure. e.g: /tmp/event/dt=2021-01-01/batch_id=2021-01-01_01_00_00 1. How do we incremental load the data in Pinot, atomically ? 2. Let’s assume we have to fix historical data. How do we reload the older batch (which is already loaded into Pinot) for e.g: /tmp/event/dt=2020-01-01/batch_id=2020-01-01_01_00_00 ? 3. Is there a way where we can directly build Pinot Segment from Spark DataFrame, is there any specific Implementation interface i can use in our exiting Spark App ?

Pedro Silva

04/23/2021, 2:00 PM

Hello again, is it normal when trying to create a realtime table in Pinot's UI to receive a popup saying the table has been saved but not seeing an entry in the UI?

Arun Lakshman Ravichandran

04/24/2021, 11:50 AM

Hi All, is apache pinot a good fit for use cases like website traffic analysis (similar to google analytics), where we would need to perform aggregation on multiple dimensions (like browser, os, country, campaign, referrer etc)

Erjan G.

04/26/2021, 12:56 PM

both seem to be about fast scalable olap analytics, but what is the difference?

Jonathan Meyer

04/26/2021, 4:58 PM

Hello Is it possible to benefit from Pinot's Star-Tree index when performing aggregation queries on known time ranges ? For example, if I know in advance that I will get queries like this :

SELECT SUM(value) FROM values WHERE timeString BETWEEN '2021-01-01' AND '2021-01-08'

(ex: rolling week) Can some configuration of StarTree index precompute this sort of query ? (or even part of it) [I know this looks like a TSDB use case, but still, I'm hopeful 😄]

Erjan G.

04/26/2021, 5:14 PM

frankly, i found about Pinot only today, i asked my friend if they use Cassandra, but he told me they use pinot.. 🙂 my question: cassandra is OLTP, but Pinot is olap. did anyone make connection btw pinot and cassandra? my question - how to stress test , load test reading, analyzing from cassandra into pinot?

Erjan G.

04/26/2021, 5:14 PM

i wanna do some benchmarking with this combination: cassandra as storage + analytics on pinot

Amine Chraïbi

04/27/2021, 2:15 PM

Hello all. I’m considering the use of pinot as a backend for a “business intelligence-like” tool for sensor data, that is: user customizable web facing dashboards + some reporting. I’m wondering if Pinot would be a good fit as my requirements are: • a derived metric definition mechanism that would allow developers to define metrics as a function of other metrics either timestamped or static • a fair amount of updates due to errors that might occur in the data transmission chain and due to data being ingested in batches • an “alternative data” mechanism that would set the value of a given metric A based on some criteria (like the existence of B for the same timestamp) • i need to handle approximately 1 million readings a day per production site. These metrics are transmitted each 5 minutes. Useful information can lie within this level of detail but the majority of the time, dashboards will present daily or hourly aggregates • each ingestion must trigger a set of analysis on the data that will perform actions based on business rules (like alerting, creating an issue ticket, etc...) I’m new to big data technologies and despite reading the documentation, I feel I’m missing a building block somewhere between postgresql and pinot. I’m thus seeking for friendly advice that could point me to the right direction. Thanks all.

kauts shukla

04/28/2021, 6:36 AM

But through api its show 2 ?

Copy code

{
  "tenants": {
    "DefaultTenant": [
      {
        "port": 8099,
        "host": "Broker_1",
        "instanceName": "Broker_1"
      },
      {
        "port": 8099,
        "host": "Broker_1",
        "instanceName": "Broker_1"
      }
    ]
  },
  "tables": {}
}

Pedro Silva

04/28/2021, 9:23 AM

Hello, how does Pinot handle processing incomplete messages for realtime tables? If I have the following schema:

Copy code

{
  "schemaName": "hitexecutionview",
  "dimensionFieldSpecs": [
    {
      "name": "id",
      "dataType": "STRING"
    },
    {
      "name": "jobId",
      "dataType": "STRING"
    },
    {
      "name": "crowdMemberId",
      "dataType": "STRING"
    },
    {
      "name": "projectId",
      "dataType": "STRING"
    },
    {
      "name": "result",
      "dataTYpe": "STRING"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "timestamp",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ],
  "primaryKeyColumns": [
    "id"
  ]
}

And the following kafka message:

Copy code

{
  "id": "19281-3123n1283-12312-312",
  "jobId": "245d-2334-fs33-23f4",
  "crowdMemberId": "xxxxxxxxxx",
  "projectId": "49mf-f39f-25v2-989m",
  "timestamp": "1238648237"
}

The field result is not passed, will pinot assume a null value? What happens if there are computed fields based on this non-existing result?

Xiang Fu

04/28/2021, 10:56 PM

Want to bring this up here again on the jdk 11 upgrade and drop jdk8 support moving forward. Also want to collect how many users are still on JDK 8/10 and has no plan for JDK 11 upgrade.