This message was deleted.
# general
s
This message was deleted.
s
same latitude? What does your datasource look like? what's the query? Also 0.17?! 😄 26.0 is pretty cool 😉
y
Hello, my data source comes from Hadoop. When I query by hour, it returns multiple records in the same hour. At this time, I sort by date, but when I sort by hour, some of them are normal. I don’t understand how the underlying query is parsed and executed?
Yes, due to some reasons, we are unable to directly upgrade Druid to the new version. At the same time, I am not sure if the new version can avoid this issue. Attached is a result of my query.
s
That does seem like a bug... can you share the EXPLAIN for the query?
y
Copy code
{
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "table_name"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "2023-06-01T00:00:00.000Z/2023-06-01T00:00:00.001Z"
    ]
  },
  "virtualColumns": [],
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "action_day",
        "value": "2023-06-01",
        "extractionFn": null
      },
      {
        "type": "selector",
        "dimension": "adx_name",
        "value": "xxx",
        "extractionFn": null
      },
      {
        "type": "or",
        "fields": [
          {
            "type": "bound",
            "dimension": "impression",
            "lower": "0",
            "upper": null,
            "lowerStrict": true,
            "upperStrict": false,
            "extractionFn": null,
            "ordering": {
              "type": "numeric"
            }
          },
          {
            "type": "bound",
            "dimension": "bid",
            "lower": "0",
            "upper": null,
            "lowerStrict": true,
            "upperStrict": false,
            "extractionFn": null,
            "ordering": {
              "type": "numeric"
            }
          }
        ]
      }
    ]
  },
  "granularity": {
    "type": "all"
  },
  "dimensions": [
    {
      "type": "default",
      "dimension": "action_hour",
      "outputName": "d0",
      "outputType": "LONG"
    },
    {
      "type": "default",
      "dimension": "action_day",
      "outputName": "d1",
      "outputType": "STRING"
    }
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "a0",
      "fieldName": "impression",
      "expression": null
    },
    {
      "type": "longSum",
      "name": "a1",
      "fieldName": "bid",
      "expression": null
    }
  ],
  "postAggregations": [
    {
      "type": "expression",
      "name": "p0",
      "expression": "(\"a0\" * 1)",
      "ordering": null
    },
    {
      "type": "expression",
      "name": "p1",
      "expression": "(\"a1\" * 1)",
      "ordering": null
    }
  ],
  "having": null,
  "limitSpec": {
    "type": "default",
    "columns": [
      {
        "dimension": "d1",
        "direction": "ascending",
        "dimensionOrder": {
          "type": "lexicographic"
        }
      }
    ],
    "limit": 100
  },
  "context": {
    "sqlOuterLimit": 100,
    "sqlQueryId": "c48b6789-4ce7-4157-a2f2-f79e0a2bdfde"
  },
  "descending": false
}
I think it may be because the underlying segments were not merged? My current settings are as follows, `QueryGranularity ‘is set to’ hour ’,
SegmentGranularity ‘is’ day’
s
It shouldn't matter that they were not merged, the groupBy query should aggregate them into a single result per group by dimension values... I'll look for bug fixes that may already resolve this between 0.17 and 26.0 and let you know what I find. In the meantime, is this something you can consistently reproduce?
y
Sorry for not responding to the message in a timely manner. This error can be replicated, and it will occur when I query yesterday’s data, but it was normal when I query the day before yesterday
It’s June 7th now, and when I query the data for June 6th, I will get many duplicate entries, like this
s
This bug says it affects 0.18 but it may also affect 0.17. Does it fit your query pattern? https://github.com/apache/druid/issues/9866
y
Thank you for your help It seems different from mine. I don’t have any joins or sub queries, just like the query plan I replied to earlier. It looks like a regular ‘group by’ query, In Druid version 0.17, join does not seem to be supported https://druid.apache.org/docs/0.17.0/querying/joins.html
s
I’ve been searching for other issue reports that might fit… haven’t found anything specific. In the interest of figuring this out…can you try a few thing separately on the date where you see the problem: • remove the CAST for action hour • Remove the limit • Remove the order by also can you share the segments view for the corresponding segments for the day that fails and for the day that doesn’t?
y
I have tried these methods separately, and removing the case is feasible, while the other methods remain unchanged, including
Remove the limit
Remove the order by
The segment distribution is like this Regarding segment, I have noticed a phenomenon where all the other days were the same segment. Yesterday’s data is always five, and only yesterday’s data can cause this problem. Does it seem like a coincidence? Meanwhile, how can I set it to merge to verify that merging can avoid this problem?