I m tasked with doing a Pinot POC for my organization as we Apache Pinot #troubleshooting

I'm tasked with doing a Pinot POC for my organizat...

David Cyze

09/01/2021, 3:17 PM

I'm tasked with doing a Pinot POC for my organization, as we're considering switching to it as our primary data store for reporting data. I followed the Advanced Pinot Setup guide and was able to create a realtime table ingesting streaming github events. I'm now trying to setup my own realtime table ingesting dummy data with a JSON column and UPSERTs enabled (this will be required for our use case). I have successfully uploaded both a table config and a schema to the pinot controller, and I also created a little app to push dummy data into a Kafka topic. I confirmed that the data is successfully being added to the topic, however my table is not ingesting any records. Can someone help me troubleshoot why that may be happening? I will post the table config and schema in this message's thread

David Cyze

09/01/2021, 3:17 PM

Table config.js

Mayank

09/01/2021, 3:18 PM

Any errors on the controller or serve logs?

David Cyze

09/01/2021, 3:18 PM

Schema

David Cyze

09/01/2021, 3:19 PM

Just a moment @Mayank, I will give it a look. (Thought I was already tailing them, but it turned out I was looking at the kafka server)

Mayank

09/01/2021, 3:20 PM

Also what release of Pinot are you using? You can try the debug table api in swagger with latest 0.8.0

David Cyze

09/01/2021, 3:20 PM

I am on 0.8.0. I was unaware of that API. I'll give that a look too

David Cyze

09/01/2021, 3:23 PM

Copy code

org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata

java.lang.RuntimeException: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata

David Cyze

09/01/2021, 3:29 PM

It appears my payloads to the kafka topic are malformed as well. I will debug that and report back

David Cyze

09/01/2021, 3:29 PM

Copy code

2021/08/31 21:54:12.977 ERROR [JSONMessageDecoder] [simplejson__0__1__20210831T2011Z] Caught exception while decoding row, discarding row. Payload is {"uid":"ad23a2ea-1fac-4a57-8d47-597d3b77a52a","attr_json": {"A": "{"type": "numTickets", "val": 83}","B": "{"type": "numTickets", "val": 51}","C": "{"type": "numTickets", "val": 61}"},"createdDateInEpoch":1570000000247}

shaded.com.fasterxml.jackson.core.JsonParseException: Unexpected character ('t' (code 116)): was expecting comma to separate Object entries

 at [Source: (ByteArrayInputStream); line: 1, column: 70]

Neha Pawar

09/01/2021, 3:38 PM

You're missing ingestion config in your table config

Neha Pawar

09/01/2021, 3:39 PM

You need to set a transform function on attr_json

Neha Pawar

09/01/2021, 3:41 PM

"columnName":"attr_json_str", "transformFunction":"jsonFormat(attr_json)"

and change the column name in schema to attr_json_str

David Cyze

09/01/2021, 4:27 PM

Thank you both. After fixing my data seeding app and adding an

ingestionConfig

, I'm now able to ingest data into the table with a JSON column. I'm seeing some behavior I don't quite understand, however. Prior to adding the

ingestionConfig

, I ingested some rows where

attr_json

was null. After adding the config, I saw new rows where

attr_json

was populated. In my schema, I have defined

uid

as the primary key column. I am seeding 1,000 rows at a time, so I would expect to see

(number of runs prior ingestionConfig * 1,000) + (n runs after config * 1,000)

rows. However, after adding the

ingestionConfig

and seeding 1,000 more rows, my table now has 1,002 rows. My understanding of upserts is that the primary key column + event time are used in conjunction to determine which records should be overwritten. This being the case, how is it that so many of my rows were overwritten / deleted?* It is of course exceedingly unlikely that I managed to generate 998 of the same UIDs during my second round of ingestion .* I'm aware that Pinot does not support deletes. I'm using "Delete" here because I'm not sure how else to explain my n(docs) going from 2000 (prior to fixing the ingestion config) to to 1002

Neha Pawar

09/01/2021, 4:56 PM

@Jackie @Yupeng Fu

Jackie

09/01/2021, 5:38 PM

@David Cyze Pinot overwrite records based on primary key only, and the record with newer timestamp is preserved

Jackie

09/01/2021, 5:39 PM

So the expected behavior should be one record for each different

uid

David Cyze

09/01/2021, 5:58 PM

So there is no explanation for why so many records disappeared? I had run two iterations of my faulty ingestion application (ie: before adding the config, thus generating null

attr_json

values). There were 2,000 records before I ran ingestion with the fixed application. That means that the minimum number of records that should have been present would be 2,000 --- assuming the exceedingly unlikely possibility that every randomly generated UID was a duplicate of a previously randomly generated UID

David Cyze

09/01/2021, 6:00 PM

Note too that if there were an error with the UID generating logic in my application (doubtful -- I used java's

UUID.randomUUID()

) such that each run of my app produced identical

uid

values, the total # of records should never have exceeded 1,000

David Cyze

09/01/2021, 6:28 PM

When adding a

transformConfig

, does Pinot re-process all records with the updated config? This could explain the record loss: • 2k records where JSON is malformed • update

transformConfig

• pinot re-processes these records; they fail the

transformFunction

; pinot writes a new segment with them excluded • 0 records now • ingest records with fixed application • 1k well-formed records are ingested (actually, 1,001, as I had an off-by-one "error" in my app and actually generate 1,001 records each run. This doesn't explain why I saw 1,00*2* records, however)

Jackie

09/01/2021, 6:58 PM

No, pinot won't re-process the already consumed data

Jackie

09/01/2021, 6:59 PM

Since there are not much data, you may re-create the table to get a fresh start

David Cyze

09/01/2021, 7:01 PM

Thanks for the suggestion. As I mentioned, I'm doing a POC. Unexplained data loss has me a bit worried, and I will continue to explore to see if anything else pops up

Jackie

09/01/2021, 7:03 PM

Understood. Once the table is correctly configured, there should be no data loss

David Cyze

09/01/2021, 7:07 PM

Thank you all for your time and help. It is much appreciated 🙂

Open in Slack

Previous Next