Nicolas Hennebelle
01/27/2023, 3:54 PMSteven Hall
01/27/2023, 5:47 PMtranscripts_bucket = "<s3://transcript-parquet/1/>"
transcript_df.write.mode("overwrite").parquet(transcripts_bucket)
I run a Spark job to transform into segments and import the data into Pinot
spark_args = {
'master': '<spark://spark-master:7077>',
'deploy_mode': 'cluster',
'name': 'segments-from-parquet',
'class': 'org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand',
'executor_memory': '2G',
'executor_cores': '1',
'total_executor_cores': '2',
'verbose': True,
'conf': [f"spark.driver.extraJavaOptions='{EXTRA_JAVA_OPTIONS}'"],
'main_file_args': '-jobSpecFile=/home/job_specs/transcript_job_spec.yml'
}
main_file = f'{PINOT_DISTRIBUTION_DIR}/lib/pinot-all-{PINOT_VERSION}-jar-with-dependencies.jar'
app = SparkJob(main_file, **spark_args)
app.submit()
All columns populate as expected. Schema is
{
"schemaName": "transcript_indexed",
"dimensionFieldSpecs": [
{
"name": "studentID",
"dataType": "INT"
},
{
"name": "firstName",
"dataType": "STRING"
},
{
"name": "lastName",
"dataType": "STRING"
},
{
"name": "gender",
"dataType": "STRING"
},
{
"name": "subject",
"dataType": "STRING"
}
],
"metricFieldSpecs": [
{
"name": "score",
"dataType": "FLOAT"
}
],
"timeFieldSpec": {
"incomingGranularitySpec": {
"name": "examTime",
"dataType": "LONG",
"timeType": "MILLISECONDS"
}
}
}
If I partition my data by subject when I write the parquet files, I have an unexpected outcome, the subject field in the Pinot segments is null.
transcripts_bucket = "<s3://transcript-parquet/1/>"
transcript_df.write.mode("overwrite").partitionBy("subject").parquet(transcripts_bucket)
Are we thinking about this incorrectly…. in a way that Pinot does not support? Alternately, is there some change we need to make in the configs to work with data lake data that is normally partitioned?
The data on Minio — my S3 service fake looks like this once partitioned by subjectvishal
01/27/2023, 6:03 PMSteven Hall
01/27/2023, 6:39 PMArthur Zhou
01/27/2023, 10:41 PM{
"REALTIME": {
"tableName": "howler_ad_mainst_battlestation_order_updates_REALTIME",
"tableType": "REALTIME",
"segmentsConfig": {
"schemaName": "howler_ad_mainst_battlestation_order_updates",
"replication": "1",
"replicasPerPartition": "1",
"timeColumnName": "time",
"minimizeDataMovement": false
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant",
"tagOverrideConfig": {}
},
"tableIndexConfig": {
"invertedIndexColumns": [],
"noDictionaryColumns": [],
"streamConfigs": {
"streamType": "kafka",
"stream.kafka.topic.name": "howler_ad_mainst_battlestation_order_updates",
"stream.kafka.broker.list": "confluent-broker.roles.service.robinhood:9092",
"stream.kafka.consumer.type": "lowlevel",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "24h",
"realtime.segment.flush.segment.size": "100M"
},
"rangeIndexColumns": [],
"rangeIndexVersion": 2,
"autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false,
"sortedColumn": [],
"bloomFilterColumns": [],
"loadMode": "MMAP",
"onHeapDictionaryColumns": [],
"varLengthDictionaryColumns": [],
"enableDefaultStarTree": false,
"enableDynamicStarTreeCreation": false,
"aggregateMetrics": false,
"nullHandlingEnabled": false,
"optimizeDictionaryForMetrics": false,
"noDictionarySizeRatioThreshold": 0
},
"metadata": {},
"quota": {},
"routing": {
"instanceSelectorType": "strictReplicaGroup"
},
"query": {},
"upsertConfig": {
"mode": "FULL",
"hashFunction": "NONE"
},
"ingestionConfig": {},
"isDimTable": false
}
}
Peter Pringle
01/28/2023, 10:01 AMJosh Black
01/28/2023, 7:26 PMYanwen
01/30/2023, 6:06 PMSathish Kumar
01/30/2023, 6:50 PMMayank
Subbu Subramaniam
01/31/2023, 2:15 AMME: Can you explain pinot to me
CG: Sure! Pinot is a type of wine grape that is used to make a variety of wines, including Pinot Noir, Pinot Grigio, and Pinot Blanc.
Pinot Noir is a red wine grape variety that is know ...
(goes on for half page)
ME:But I thought pinot is a database
CG: I apologize for any confusion. Pinot is a real-time distributed OLAP datastore, which is a type of database designed for real-time analytics. ...
(goes on a bit)
ME: Then why did you say it is a wine grape?
CG: I apologize for any confusion, I made a mistake. Pinot is indeed a database technology, not a type of wine grape. I apologize for any confusion my previous message may have caused.
ME: OK, please note it for the next person who asks this question.
CG: As a language model, I do not have the ability to remember previous interactions or to store information for future use. I apologize for any confusion my previous response may have caused. Each time someone interacts with me, I generate responses based on the input given to me at that specific moment. I am only able to provide information based on the knowledge that I have been trained on, which has a cutoff date of 2021. Please let me know if you have any other question, I'll do my best to assist you.
[This last part is I think a "I give up" answer. I got it a few times today]
vishal
02/01/2023, 7:32 AMNickel Fang
02/03/2023, 9:48 AMselect region, clock,
cast(sum(error_count)/sum(request_count)>=0.2 as int) as "critical_status",
cast(sum(error_count)/sum(request_count) between 0.1 and 0.2 as int) as "warning_status"
from system_health where
clock>=1675239600000 and clock<1675240200000
group by region, clock limit 500000
if I want to define the threshold of “sum(error_count)/sum(request_count)>=0.2” or “sum(error_count)>=5" as “critical_status”, how can I get the correct sql.
cast(sum(error_count)/sum(request_count)>=0.2 or sum(error_count)>=5 as int) as "critical_status"
it doesn’t work.
Thanks!Uday Vallamsetty
02/04/2023, 5:19 AMAnkit Sultana
02/06/2023, 11:42 PMselect * from table where col1 >= 10 and col2 = 'foo' and col3 = 'bar' and col4 < 20
Assuming col1, col4 have range indices and col2, col3 have inverted indices. Would the indices be picked based on the order of the filter conditions?Lakshmanan Velusamy
02/07/2023, 1:21 AMpiby
02/07/2023, 8:41 AMRohit Anilkumar
02/07/2023, 5:29 PMSharayu Gandhi
02/07/2023, 11:23 PMHassan Ait Brik
02/08/2023, 4:36 PMRichard Walker
02/08/2023, 10:14 PMpinot-client-go
necessarily if there's a lower-level option...abhinav wagle
02/09/2023, 10:25 PMchandarasekaran m
02/10/2023, 11:26 AMAbhijeet Kushe
02/13/2023, 4:19 PMAbhijeet Kushe
02/13/2023, 4:20 PMWeixiang Sun
02/14/2023, 4:47 AMShriniket
02/14/2023, 11:48 AMIrtisaam
02/14/2023, 1:38 PMAamish Tariq
02/16/2023, 8:34 AMvishal
02/16/2023, 1:02 PM/segments/{tableName}/startReplaceSegments
API in swagger api and trying to replace segment here i am trying to replace one segment with other but i am not bale to do it. i am passing segmentsFrom
those segment which we want to replace with new one. segmentsTo
is the other segment but before that i have to create and upload that segment. how to create and upload it? i tried to use segment from same table but getting a error as below:
Input:
{
"segmentsTo":["tmptab__4__1__20230214T1305Z"],
"segmentsFrom":["tmptab__0__0__20230214T0611Z"]
}
{
"code": 500,
"error": "Any segments from 'segmentsTo' should not be available in the table at this point. (tableName = 'tmptab_REALTIME', segmentsFrom = '[tmptab__0__0__20230214T0611Z]', segmentsTo = '[tmptab__4__1__20230214T1305Z]', segmentsFromTable = '[tmptab__0__1__20230214T1305Z, tmptab__2__1__20230214T1305Z, tmptab__3__1__20230214T1305Z, tmptab__0__0__20230214T0611Z, tmptab__1__1__20230214T1305Z, tmptab__4__1__20230214T1305Z, tmptab__5__1__20230214T1305Z, tmptab__2__0__20230214T0611Z, tmptab__1__0__20230214T0611Z, tmptab__4__0__20230214T0611Z, tmptab__3__0__20230214T0611Z, tmptab__5__0__20230214T0611Z]')"
}
@saurabh dubey @Xiang Fu @Tim Santos @Xiaobing @Jackie @Kartik Khare