I'm tasked with doing a Pinot POC for my organizat...
# troubleshooting
d
I'm tasked with doing a Pinot POC for my organization, as we're considering switching to it as our primary data store for reporting data. I followed the Advanced Pinot Setup guide and was able to create a realtime table ingesting streaming github events. I'm now trying to setup my own realtime table ingesting dummy data with a JSON column and UPSERTs enabled (this will be required for our use case). I have successfully uploaded both a table config and a schema to the pinot controller, and I also created a little app to push dummy data into a Kafka topic. I confirmed that the data is successfully being added to the topic, however my table is not ingesting any records. Can someone help me troubleshoot why that may be happening? I will post the table config and schema in this message's thread
Table config.js
m
Any errors on the controller or serve logs?
d
Schema
Just a moment @Mayank, I will give it a look. (Thought I was already tailing them, but it turned out I was looking at the kafka server)
m
Also what release of Pinot are you using? You can try the debug table api in swagger with latest 0.8.0
d
I am on 0.8.0. I was unaware of that API. I'll give that a look too
Copy code
org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata

java.lang.RuntimeException: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
It appears my payloads to the kafka topic are malformed as well. I will debug that and report back
Copy code
2021/08/31 21:54:12.977 ERROR [JSONMessageDecoder] [simplejson__0__1__20210831T2011Z] Caught exception while decoding row, discarding row. Payload is {"uid":"ad23a2ea-1fac-4a57-8d47-597d3b77a52a","attr_json": {"A": "{"type": "numTickets", "val": 83}","B": "{"type": "numTickets", "val": 51}","C": "{"type": "numTickets", "val": 61}"},"createdDateInEpoch":1570000000247}

shaded.com.fasterxml.jackson.core.JsonParseException: Unexpected character ('t' (code 116)): was expecting comma to separate Object entries

 at [Source: (ByteArrayInputStream); line: 1, column: 70]
n
You're missing ingestion config in your table config
You need to set a transform function on attr_json
"columnName":"attr_json_str", "transformFunction":"jsonFormat(attr_json)"
and change the column name in schema to attr_json_str
d
Thank you both. After fixing my data seeding app and adding an
ingestionConfig
, I'm now able to ingest data into the table with a JSON column. I'm seeing some behavior I don't quite understand, however. Prior to adding the
ingestionConfig
, I ingested some rows where
attr_json
was null. After adding the config, I saw new rows where
attr_json
was populated. In my schema, I have defined
uid
as the primary key column. I am seeding 1,000 rows at a time, so I would expect to see
(number of runs prior ingestionConfig * 1,000) + (n runs after config * 1,000)
rows. However, after adding the
ingestionConfig
and seeding 1,000 more rows, my table now has 1,002 rows. My understanding of upserts is that the primary key column + event time are used in conjunction to determine which records should be overwritten. This being the case, how is it that so many of my rows were overwritten / deleted?* It is of course exceedingly unlikely that I managed to generate 998 of the same UIDs during my second round of ingestion .* I'm aware that Pinot does not support deletes. I'm using "Delete" here because I'm not sure how else to explain my n(docs) going from 2000 (prior to fixing the ingestion config) to to 1002
n
@Jackie @Yupeng Fu
j
@David Cyze Pinot overwrite records based on primary key only, and the record with newer timestamp is preserved
So the expected behavior should be one record for each different
uid
d
So there is no explanation for why so many records disappeared? I had run two iterations of my faulty ingestion application (ie: before adding the config, thus generating null
attr_json
values). There were 2,000 records before I ran ingestion with the fixed application. That means that the minimum number of records that should have been present would be 2,000 --- assuming the exceedingly unlikely possibility that every randomly generated UID was a duplicate of a previously randomly generated UID
Note too that if there were an error with the UID generating logic in my application (doubtful -- I used java's
UUID.randomUUID()
) such that each run of my app produced identical
uid
values, the total # of records should never have exceeded 1,000
When adding a
transformConfig
, does Pinot re-process all records with the updated config? This could explain the record loss: • 2k records where JSON is malformed • update
transformConfig
• pinot re-processes these records; they fail the
transformFunction
; pinot writes a new segment with them excluded • 0 records now • ingest records with fixed application • 1k well-formed records are ingested (actually, 1,001, as I had an off-by-one "error" in my app and actually generate 1,001 records each run. This doesn't explain why I saw 1,00*2* records, however)
j
No, pinot won't re-process the already consumed data
Since there are not much data, you may re-create the table to get a fresh start
d
Thanks for the suggestion. As I mentioned, I'm doing a POC. Unexplained data loss has me a bit worried, and I will continue to explore to see if anything else pops up
j
Understood. Once the table is correctly configured, there should be no data loss
d
Thank you all for your time and help. It is much appreciated 🙂