I apologize in advance for my ignorant question but I m stru Apache Pinot #general

I apologize in advance for my ignorant question, b...

Will Briggs

12/30/2020, 10:07 PM

I apologize in advance for my ignorant question, but I’m struggling conceptually a bit with how to handle dateTime column definitions in my table schema and segmentsConfig. I have a millisecond-level epoch field on my incoming realtime data (creatively named

eventTimestamp

). I would like to maintain this when querying / filtering my records at the individual event level. However, I would also like to define an hourly derived timestamp to be used for pre-aggregating with a star tree index. My segments config looks like this:

Copy code

"segmentsConfig": {
        "timeColumnName": "eventTimestamp",
        "timeType": "MILLISECONDS",
        "retentionTimeUnit": "HOURS",
        "retentionTimeValue": "48",
        "segmentPushType": "APPEND",
        "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
        "schemaName": "mySchema",
        "replication": "1",
        "replicasPerPartition": "1"
      },

My star tree index looks like this:

Copy code

"starTreeIndexConfigs": [{
          "dimensionsSplitOrder": [
            "dimension1",
            "dimension2"
          ],
          "skipStarNodeCreationForDimensions": [
          ],
          "functionColumnPairs": [
            "SUM__metric1",
            "SUM__metric2",
            "SUM__metric3",
            "DISTINCT_COUNT_HLL__dimension3",
            "DISTINCT_COUNT_HLL__dimension4"
          ],
          "maxLeafRecords": 10000
        }],

And my dateTimeFieldSpecs:

Copy code

"dateTimeFieldSpecs": [
        {
          "name": "eventTimestamp",
          "dataType": "LONG",
          "format": "1:MILLISECONDS:EPOCH",
          "granularity": "1:HOUR",
          "dateTimeType": "PRIMARY"
        }
      ],

Can anyone confirm that this is the correct approach? Should I be using an ingestion transformation of

toEpochHoursRounded

instead, and specifying that as a DERIVED dateTimeField in the dateTimeFieldSpecs configuration, and manually adding that to the dimensionsSplitOrder of my star tree index?

Xiang Fu

12/30/2020, 10:14 PM

@Jackie I think in this case, we need to add a new column for hour rounded time value then do star tree on it right

Will Briggs

12/30/2020, 10:17 PM

@Xiang Fu Thank you, that makes sense to me, but I was confused as to why the dateTimeFieldSpec allows me to enter a granularity different from the incoming format. Also, the current airport examples all use the deprecated

timeFieldSpec

, which meant I had to go digging in the wiki and read the 0.4.0 release notes talking about deprecating

timeFieldSpec

before I realized I should be using

dateTimeFieldSpecs

instead - I might take a stab at updating the example + docs once I get this all straight in my head, to save other people the pain (as long as I’m on the right track, here).

Xiang Fu

12/30/2020, 10:21 PM

true, we are updating code base with this pr: https://github.com/apache/incubator-pinot/pull/6392/files

Xiang Fu

12/30/2020, 10:21 PM

will update the wiki as well

Xiang Fu

12/30/2020, 10:22 PM

https://docs.pinot.apache.org/configuration-reference/schema#datetimefieldspec

Will Briggs

12/30/2020, 10:22 PM

Heh, awesome - I also made the change locally for the

latest

image for submitting admin commands as jobs 🙂

Xiang Fu

12/30/2020, 10:22 PM

the link you put was outdated wiki

Xiang Fu

12/30/2020, 10:22 PM

let me know if docs.pinot helps

Xiang Fu

12/30/2020, 10:22 PM

we will update in this site

Will Briggs

12/30/2020, 10:23 PM

Thanks

Will Briggs

12/30/2020, 10:25 PM

So it looks like

dateTimeType

(e..g,

PRIMARY

SECONDARY

, or

DERIVED

) is no longer necessary?

Xiang Fu

12/30/2020, 10:26 PM

it’s not

Xiang Fu

12/30/2020, 10:26 PM

you can define multiple dateTimeFields

Xiang Fu

12/30/2020, 10:26 PM

and specify the transform in the table

Xiang Fu

12/30/2020, 10:28 PM

you can set

ingestionConfig

in table, e.g.

Copy code

{
  "tableName": "githubEvents",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "segmentPushType": "APPEND",
    "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
    "schemaName": "githubEvents",
    "replication": "1",
    "timeColumnName": "event_time",
    "timeType": "MILLISECONDS"
  },
  "tenants": {},
  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "type",
          "repo_id"
        ],
        "skipStarNodeCreationForDimensions": [],
        "functionColumnPairs": [
          "SUM__pull_request_additions",
          "SUM__pull_request_deletions",
          "SUM__pull_request_changed_files",
          "COUNT__star",
          "DISTINCT_COUNT_HLL__actor_id"
        ],
        "maxLeafRecords": 1000
      }
    ],
    "enableDynamicStarTreeCreation": true,
    "loadMode": "MMAP",
    "invertedIndexColumns": [],
    "segmentPartitionConfig": {
      "columnPartitionMap": {
        "repo_id": {
          "functionName": "Murmur",
          "numPartitions": 1024
        }
      }
    },
    "noDictionaryColumns": []
  },
  "routing": {
    "segmentPrunerTypes": [
      "partition"
    ]
  },
  "metadata": {
    "customConfigs": {}
  },
  "ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY",
      "batchConfigMaps": [],
      "segmentNameSpec": {},
      "pushSpec": {}
    },
    "transformConfigs": [
      {
        "columnName": "event_time",
        "transformFunction": "fromDateTime(created_at, \"yyyy-MM-dd'T'HH:mm:ssZ\")"
      }
    ]
  }
}

Xiang Fu

12/30/2020, 10:29 PM

here i convert

yyyy-MM-dd

format string column

created_at

in raw data to millis epoch value to

event_time

Xiang Fu

12/30/2020, 10:30 PM

you can specify more time fields and add them into this transformConfigs, fyi: https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations#column-transformation

Will Briggs

12/30/2020, 10:30 PM

Perfect, thank you. One more stupid question (hopefully last one for the day)… what should I look for in the trace in order to verify that my query is using my star tree index? Is there a Pinot equivalent of SQL

EXPLAIN

Xiang Fu

12/30/2020, 10:34 PM

typically from the results, you can see numDocsScanned

Xiang Fu

12/30/2020, 10:34 PM

which should be way less than the total docs

Xiang Fu

12/30/2020, 10:34 PM

e.g.

Xiang Fu

12/30/2020, 10:35 PM

@Jackie might provide more insights here

Will Briggs

12/30/2020, 10:36 PM

Ok. I have inverted indices as well, so I was just trying to figure out how to ensure it was using the star tree index instead - it is definitely showing far fewer scanned than total:

Will Briggs

12/30/2020, 10:37 PM

I just barely started ingestion, so I need to let it build up some more data 🙂

Xiang Fu

12/30/2020, 10:37 PM

Xiang Fu

12/30/2020, 10:38 PM

for consuming segment, i think there is no star-tree built

Xiang Fu

12/30/2020, 10:38 PM

it will go to inv index

Will Briggs

12/30/2020, 10:38 PM

Xiang Fu

12/30/2020, 10:38 PM

once the segment is sealed, star-tree will be built

Will Briggs

12/30/2020, 10:39 PM

That makes sense

Jackie

12/30/2020, 11:09 PM

Another way is to enable the tracing for the query and see if it uses the

StarTreeFilterOperator

Jackie

12/30/2020, 11:11 PM

For the date time fields, is this column already rounded to each hour?

Copy code

"dateTimeFieldSpecs": [
        {
          "name": "eventTimestamp",
          "dataType": "LONG",
          "format": "1:MILLISECONDS:EPOCH",
          "granularity": "1:HOUR"
        }
      ],

Jackie

12/30/2020, 11:12 PM

If so, you can directly use it as the star-tree dimension, if not, then you can create a new rounded time column and use it in the star-tree

Open in Slack

Previous Next