Hi everybody, I have a question for you. I have a ...
# troubleshooting
t
Hi everybody, I have a question for you. I have a table/schema configured like:
Copy code
{
  "OFFLINE": {
    "tableName": "DailyUniqHll_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "timeType": "DAYS",
      "retentionTimeUnit": "DAYS",
      "retentionTimeValue": "365",
      "replication": "1",
      "timeColumnName": "partition",
      "allowNullTimeValue": false
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "enableDefaultStarTree": false,
      "starTreeIndexConfigs": [
        {
          "dimensionsSplitOrder": [
            "partition",
            "fields.1",
            "fields.2",
            "fields.3",
            "fields.4",
            "fields.5",
            "fields.6",
            "fields.7",
            "fields.8",
            "fields.9"
          ],
          "functionColumnPairs": [
            "SUM__counters.c",
            "DISTINCTCOUNTHLL__hllState"
          ],
          "maxLeafRecords": 1000
        }
      ],
      "enableDynamicStarTreeCreation": true,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "rangeIndexVersion": 2,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false
    },
    "metadata": {},
    "ingestionConfig": {
      "batchIngestionConfig": {
        "segmentIngestionType": "APPEND",
        "segmentIngestionFrequency": "DAILY"
      },
      "complexTypeConfig": {
        "fieldsToUnnest": [
          "fields",
          "counters"
        ],
        "delimiter": ".",
        "collectionNotUnnestedToJson": "NON_PRIMITIVE"
      }
    },
    "isDimTable": false
  }
}
Schema:
Copy code
{
  "schemaName": "ViewElementDailyUniqHll",
  "dimensionFieldSpecs": [
    {
      "name": "fields.1",
      "dataType": "STRING"
    },
    {
      "name": "fields.2",
      "dataType": "STRING"
    },
    {
      "name": "fields.3",
      "dataType": "STRING"
    },
    {
      "name": "fields.4",
      "dataType": "STRING"
    },
    {
      "name": "fields.5",
      "dataType": "STRING"
    },
    {
      "name": "fields.6",
      "dataType": "STRING"
    },
    {
      "name": "fields.7",
      "dataType": "STRING"
    },
    {
      "name": "fields.8",
      "dataType": "STRING"
    },
    {
      "name": "fields.9",
      "dataType": "STRING"
    },
    {
      "name": "cubeName",
      "dataType": "STRING"
    },
    {
      "name": "list",
      "dataType": "LONG",
      "singleValueField": false
    },
    {
      "name": "hllState",
      "dataType": "BYTES"
    },
    {
      "name": "counters.c",
      "dataType": "INT"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "partition",
      "dataType": "STRING",
      "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
  ]
}
When I ingest some data I get a ~10x size increase because of
DISTINCTCOUNTHLL__hllState
in the star tree index. Is this expected? Is there something misconfigured?
m
do you mean 10x more than if that field isn't included in the index?
m
What does hll_state contain? From your config, another HLL will be created where hll_state is the element in that set. Is that what you intend?
t
do you mean 10x more than if that field isn't included in the index?
yes
hll state contains a bytes array representing the pre-estimation state of the HLL algorithm.
k
Which column is HLL representing?
m
It is the hll_state column which is serlaized HLL, from what I understand @Kishore G.
You do have split on several dimensions and a low max leaf record value, that may be contributing to some (if not all).
k
The size increase basically means that one of the fields 1 to 9 have very high cardinality
And there is not much aggregation happening when the star tree index is created
👍 1