Hey everyone, I'm trying to run below query but it...
# pinot-perf-tuning
a
Hey everyone, I'm trying to run below query but it is erroring out with
upstream request timeout
, when using single month data in filter then it is working. Not sure why error message isn't giving valid reason of failure. Can you guys suggest the right and necessary table config
Copy code
SELECT
    DISTINCTCOUNT( CASE WHEN data_source = 'web' THEN master_id ELSE 'a' END) as activity_account_count,
	DISTINCTCOUNT( CASE WHEN data_source in ('b2bn', 'b2bn_excluded') THEN master_id ELSE 'a' END) as kw_account_count,
    DISTINCTCOUNT( CASE WHEN data_source = 'fpm' THEN master_id ELSE 'a' END) as fpm_account_count,
    DISTINCTCOUNT( CASE WHEN data_source = 'tpm' THEN master_id ELSE 'a' END) as tpm_account_count,
    DISTINCTCOUNT( CASE WHEN data_source = 'crm' THEN master_id ELSE 'a' END) as crm_account_count,
    DISTINCTCOUNT( CASE WHEN data_source = 'map' THEN master_id ELSE 'a' END) as map_account_count,
    DISTINCTCOUNT( master_id) as all_account_count
FROM
    six_sense_dapm
WHERE
    dt BETWEEN '2022-01-01' AND '2022-08-30'
    AND data_source IN ('web', 'b2bn', 'b2bn_excluded', 'fpm', 'tpm', 'crm', 'map')
    AND product='__all__'
table config :
Copy code
{
  "OFFLINE": {
    "tableName": "sumologic_dapm_OFFLINE",
    "tableType": "OFFLINE",
    "segmentsConfig": {
      "timeType": "DAYS",
      "schemaName": "sumologic_dapm",
      "replication": "2",
      "segmentPushType": "APPEND",
      "timeColumnName": "dt",
      "allowNullTimeValue": false
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "invertedIndexColumns": [
        "data_source",
        "product",
        "source_activity_name"
      ],
      "noDictionaryColumns": [
        "external_id",
        "master_id",
        "secondary_id",
        "source_activity_desc",
        "source_activity_url",
        "source_activity_referrer_url",
        "source_activity_desc",
        "source_activity_url_r",
        "metric_value",
        "source_id"
      ],
      "rangeIndexColumns": [
        "dt"
      ],
      "optimizeDictionaryForMetrics": true,
      "enableDefaultStarTree": true,
      "enableDynamicStarTreeCreation": true,
      "aggregateMetrics": true,
      "nullHandlingEnabled": true,
      "rangeIndexVersion": 2,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [
        "data_source",
        "master_id"
      ],
      "loadMode": "MMAP"
    },
    "metadata": {},
    "isDimTable": false
  }
}
m
What’s the response you are seeing?
a
upstream request timeout
Screenshot 2022-09-15 at 12.10.11 AM.png
m
Can you run this query:
Copy code
SELECT count(*)
FROM
    six_sense_dapm
WHERE
    dt BETWEEN '2022-01-01' AND '2022-08-30'
    AND data_source IN ('web', 'b2bn', 'b2bn_excluded', 'fpm', 'tpm', 'crm', 'map')
    AND product='__all__'
The query seems to time out, probably because it is either too expensive, and/or you are under provisioned, or missing indexes.
a
yes we are able to run above query.
m
Could you share the response metadata?
a
not sure how to metadata in text.
m
Select json response type
a
Copy code
{
  "resultTable": {
    "dataSchema": {
      "columnNames": [
        "count(*)"
      ],
      "columnDataTypes": [
        "LONG"
      ]
    },
    "rows": [
      [
        28532577
      ]
    ]
  },
  "exceptions": [],
  "numServersQueried": 3,
  "numServersResponded": 3,
  "numSegmentsQueried": 48,
  "numSegmentsProcessed": 48,
  "numSegmentsMatched": 48,
  "numConsumingSegmentsQueried": 0,
  "numDocsScanned": 28532577,
  "numEntriesScannedInFilter": 54159993,
  "numEntriesScannedPostFilter": 0,
  "numGroupsLimitReached": false,
  "totalDocs": 78707892,
  "timeUsedMs": 453,
  "offlineThreadCpuTimeNs": 0,
  "realtimeThreadCpuTimeNs": 0,
  "offlineSystemActivitiesCpuTimeNs": 0,
  "realtimeSystemActivitiesCpuTimeNs": 0,
  "offlineResponseSerializationCpuTimeNs": 0,
  "realtimeResponseSerializationCpuTimeNs": 0,
  "offlineTotalCpuTimeNs": 0,
  "realtimeTotalCpuTimeNs": 0,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 0,
  "numRowsResultSet": 1
}
m
"numEntriesScannedInFilter": 54159993,
Ok, setting range index on time column will help
However, the query is expensive, because you are performing 7 count distincts on 28M records.
You could also re-write this is a single count distinct with order by?
a
yes also let me apply range index as it is not reflecting in relaod status.
I am unable to see range-index on dt column (string) after reloading the data. In doc for rangeIndexColumn : Typically used for numeric columns and mostly on metrics.