I apologize in advance for my ignorant question, b...
# general
w
I apologize in advance for my ignorant question, but I’m struggling conceptually a bit with how to handle dateTime column definitions in my table schema and segmentsConfig. I have a millisecond-level epoch field on my incoming realtime data (creatively named
eventTimestamp
). I would like to maintain this when querying / filtering my records at the individual event level. However, I would also like to define an hourly derived timestamp to be used for pre-aggregating with a star tree index. My segments config looks like this:
Copy code
"segmentsConfig": {
        "timeColumnName": "eventTimestamp",
        "timeType": "MILLISECONDS",
        "retentionTimeUnit": "HOURS",
        "retentionTimeValue": "48",
        "segmentPushType": "APPEND",
        "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
        "schemaName": "mySchema",
        "replication": "1",
        "replicasPerPartition": "1"
      },
My star tree index looks like this:
Copy code
"starTreeIndexConfigs": [{
          "dimensionsSplitOrder": [
            "dimension1",
            "dimension2"
          ],
          "skipStarNodeCreationForDimensions": [
          ],
          "functionColumnPairs": [
            "SUM__metric1",
            "SUM__metric2",
            "SUM__metric3",
            "DISTINCT_COUNT_HLL__dimension3",
            "DISTINCT_COUNT_HLL__dimension4"
          ],
          "maxLeafRecords": 10000
        }],
And my dateTimeFieldSpecs:
Copy code
"dateTimeFieldSpecs": [
        {
          "name": "eventTimestamp",
          "dataType": "LONG",
          "format": "1:MILLISECONDS:EPOCH",
          "granularity": "1:HOUR",
          "dateTimeType": "PRIMARY"
        }
      ],
Can anyone confirm that this is the correct approach? Should I be using an ingestion transformation of
toEpochHoursRounded
instead, and specifying that as a DERIVED dateTimeField in the dateTimeFieldSpecs configuration, and manually adding that to the dimensionsSplitOrder of my star tree index?
x
@Jackie I think in this case, we need to add a new column for hour rounded time value then do star tree on it right
w
@Xiang Fu Thank you, that makes sense to me, but I was confused as to why the dateTimeFieldSpec allows me to enter a granularity different from the incoming format. Also, the current airport examples all use the deprecated
timeFieldSpec
, which meant I had to go digging in the wiki and read the 0.4.0 release notes talking about deprecating
timeFieldSpec
before I realized I should be using
dateTimeFieldSpecs
instead - I might take a stab at updating the example + docs once I get this all straight in my head, to save other people the pain (as long as I’m on the right track, here).
x
true, we are updating code base with this pr: https://github.com/apache/incubator-pinot/pull/6392/files
will update the wiki as well
w
Heh, awesome - I also made the change locally for the
latest
image for submitting admin commands as jobs 🙂
x
the link you put was outdated wiki
let me know if docs.pinot helps
we will update in this site
w
Thanks
So it looks like
dateTimeType
(e..g,
PRIMARY
,
SECONDARY
, or
DERIVED
) is no longer necessary?
x
it’s not
you can define multiple dateTimeFields
and specify the transform in the table
you can set
ingestionConfig
in table, e.g.
Copy code
{
  "tableName": "githubEvents",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "segmentPushType": "APPEND",
    "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
    "schemaName": "githubEvents",
    "replication": "1",
    "timeColumnName": "event_time",
    "timeType": "MILLISECONDS"
  },
  "tenants": {},
  "tableIndexConfig": {
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "type",
          "repo_id"
        ],
        "skipStarNodeCreationForDimensions": [],
        "functionColumnPairs": [
          "SUM__pull_request_additions",
          "SUM__pull_request_deletions",
          "SUM__pull_request_changed_files",
          "COUNT__star",
          "DISTINCT_COUNT_HLL__actor_id"
        ],
        "maxLeafRecords": 1000
      }
    ],
    "enableDynamicStarTreeCreation": true,
    "loadMode": "MMAP",
    "invertedIndexColumns": [],
    "segmentPartitionConfig": {
      "columnPartitionMap": {
        "repo_id": {
          "functionName": "Murmur",
          "numPartitions": 1024
        }
      }
    },
    "noDictionaryColumns": []
  },
  "routing": {
    "segmentPrunerTypes": [
      "partition"
    ]
  },
  "metadata": {
    "customConfigs": {}
  },
  "ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY",
      "batchConfigMaps": [],
      "segmentNameSpec": {},
      "pushSpec": {}
    },
    "transformConfigs": [
      {
        "columnName": "event_time",
        "transformFunction": "fromDateTime(created_at, \"yyyy-MM-dd'T'HH:mm:ssZ\")"
      }
    ]
  }
}
here i convert
yyyy-MM-dd
format string column
created_at
in raw data to millis epoch value to
event_time
you can specify more time fields and add them into this transformConfigs, fyi: https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations#column-transformation
w
Perfect, thank you. One more stupid question (hopefully last one for the day)… what should I look for in the trace in order to verify that my query is using my star tree index? Is there a Pinot equivalent of SQL
EXPLAIN
?
x
typically from the results, you can see numDocsScanned
which should be way less than the total docs
e.g.
@Jackie might provide more insights here
w
Ok. I have inverted indices as well, so I was just trying to figure out how to ensure it was using the star tree index instead - it is definitely showing far fewer scanned than total:
I just barely started ingestion, so I need to let it build up some more data 🙂
x
ic
for consuming segment, i think there is no star-tree built
it will go to inv index
w
Ah
x
once the segment is sealed, star-tree will be built
w
That makes sense
j
Another way is to enable the tracing for the query and see if it uses the
StarTreeFilterOperator
For the date time fields, is this column already rounded to each hour?
Copy code
"dateTimeFieldSpecs": [
        {
          "name": "eventTimestamp",
          "dataType": "LONG",
          "format": "1:MILLISECONDS:EPOCH",
          "granularity": "1:HOUR"
        }
      ],
If so, you can directly use it as the star-tree dimension, if not, then you can create a new rounded time column and use it in the star-tree