Hi everybody I have a question for you I have a table schema Apache Pinot #troubleshooting

Hi everybody, I have a question for you. I have a ...

Tommaso Peresson

06/06/2022, 10:23 AM

Hi everybody, I have a question for you. I have a table/schema configured like:

Copy code

{
  "OFFLINE": {
    "tableName": "DailyUniqHll_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "timeType": "DAYS",
      "retentionTimeUnit": "DAYS",
      "retentionTimeValue": "365",
      "replication": "1",
      "timeColumnName": "partition",
      "allowNullTimeValue": false
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "enableDefaultStarTree": false,
      "starTreeIndexConfigs": [
        {
          "dimensionsSplitOrder": [
            "partition",
            "fields.1",
            "fields.2",
            "fields.3",
            "fields.4",
            "fields.5",
            "fields.6",
            "fields.7",
            "fields.8",
            "fields.9"
          ],
          "functionColumnPairs": [
            "SUM__counters.c",
            "DISTINCTCOUNTHLL__hllState"
          ],
          "maxLeafRecords": 1000
        }
      ],
      "enableDynamicStarTreeCreation": true,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "rangeIndexVersion": 2,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false
    },
    "metadata": {},
    "ingestionConfig": {
      "batchIngestionConfig": {
        "segmentIngestionType": "APPEND",
        "segmentIngestionFrequency": "DAILY"
      },
      "complexTypeConfig": {
        "fieldsToUnnest": [
          "fields",
          "counters"
        ],
        "delimiter": ".",
        "collectionNotUnnestedToJson": "NON_PRIMITIVE"
      }
    },
    "isDimTable": false
  }
}

Schema:

Copy code

{
  "schemaName": "ViewElementDailyUniqHll",
  "dimensionFieldSpecs": [
    {
      "name": "fields.1",
      "dataType": "STRING"
    },
    {
      "name": "fields.2",
      "dataType": "STRING"
    },
    {
      "name": "fields.3",
      "dataType": "STRING"
    },
    {
      "name": "fields.4",
      "dataType": "STRING"
    },
    {
      "name": "fields.5",
      "dataType": "STRING"
    },
    {
      "name": "fields.6",
      "dataType": "STRING"
    },
    {
      "name": "fields.7",
      "dataType": "STRING"
    },
    {
      "name": "fields.8",
      "dataType": "STRING"
    },
    {
      "name": "fields.9",
      "dataType": "STRING"
    },
    {
      "name": "cubeName",
      "dataType": "STRING"
    },
    {
      "name": "list",
      "dataType": "LONG",
      "singleValueField": false
    },
    {
      "name": "hllState",
      "dataType": "BYTES"
    },
    {
      "name": "counters.c",
      "dataType": "INT"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "partition",
      "dataType": "STRING",
      "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
  ]
}

When I ingest some data I get a ~10x size increase because of

DISTINCTCOUNTHLL__hllState

in the star tree index. Is this expected? Is there something misconfigured?

Mark Needham

06/06/2022, 12:45 PM

do you mean 10x more than if that field isn't included in the index?

Mayank

06/06/2022, 1:21 PM

What does hll_state contain? From your config, another HLL will be created where hll_state is the element in that set. Is that what you intend?

Tommaso Peresson

06/06/2022, 1:22 PM

do you mean 10x more than if that field isn't included in the index?

yes

Tommaso Peresson

06/06/2022, 1:23 PM

hll state contains a bytes array representing the pre-estimation state of the HLL algorithm.

Kishore G

06/06/2022, 2:20 PM

Which column is HLL representing?

Mayank

06/06/2022, 2:25 PM

It is the hll_state column which is serlaized HLL, from what I understand @Kishore G.

Mayank

06/06/2022, 2:26 PM

You do have split on several dimensions and a low max leaf record value, that may be contributing to some (if not all).

Kishore G

06/06/2022, 2:28 PM

The size increase basically means that one of the fields 1 to 9 have very high cardinality

Kishore G

06/06/2022, 2:29 PM

And there is not much aggregation happening when the star tree index is created

👍 1

2 Views

Open in Slack

Previous Next