This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

06/06/2023, 7:35 AM

This message was deleted.

John Kowtko

06/06/2023, 10:36 AM

Are you saying you are ingesting with segment granularity of WEEK, and the segment start/end times don't match your Sun-Sat week boundary convention? I don't know of a way to change the start day-of-week for WEEK time periods ... but as far as ingestion and the way the Druid time chunks are built, that is a physical storage mechanism and should not affect your data and/or query results. If something else is going on, e.g. you think time values are changing in the data upon ingestion, then please provide details as that should not happen.

Basayya Swami

06/06/2023, 10:40 AM

Yes, ingesting with segment granularity of WEEK and passing Intervals as 2023-04-09 / 2023-04-15 in timespec

Copy code

"timestampSpec": {
  "column": "cal_wk_start_dt",
  "format": "iso"
}

so cal_wk_start_dt is replacing with __time but values in __time is different than my source data coming for cal_wk_start_dt

Basayya Swami

06/06/2023, 10:41 AM

__time column i was using for filtering while querying if data is not as expected from source how my filters are going to work?

John Kowtko

06/06/2023, 10:43 AM

try running the ingestion without specifying intervals ... you may be forcing the __time value to fit within a Druid week boundary by doing that.

Basayya Swami

06/06/2023, 10:45 AM

tried that.. by default it takes Monday as first day of the calendar week.. but source data has Sunday as first day of the calendar week

John Kowtko

06/06/2023, 10:48 AM

Can you share the granularitySpec you are using?

Basayya Swami

06/06/2023, 11:12 AM

Copy code

{
  "type": "index_hadoop",
  "spec": {
    "dataSchema": {
      "dataSource": "datasource_test",
      "timestampSpec": null,
      "dimensionsSpec": null,
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "WEEK",
        "queryGranularity": "WEEK",
        "rollup": true,
        "intervals": [
          "#dataInterval#"
        ]
      },
      "transformSpec": {
        "filter": null,
        "transforms": []
      },
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "parquet",
          "columns": [
            "start_dt",
            "col1",
            "col12",
            "col3",
            "col4",
            "col5"
          ],
          "timestampSpec": {
            "column": "start_dt",
            "format": "iso"
          },
          "dimensionsSpec": {
            "dimensions": [
              {
                "type": "long",
                "name": "start_dt"
              },
              {
                "type": "string",
                "name": "col1"
              },
              {
                "type": "string",
                "name": "col2"
              },
              {
                "type": "string",
                "name": "col3"
              },
              {
                "type": "string",
                "name": "col4"
              },
              {
                "type": "string",
                "name": "col5"
              }
            ],
            "dimensionExclusions": []
          }
        }
      }
    },
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "granularity",
        "dataGranularity": "week",
        "filePattern": ".*",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "pathFormat": "'wk_nbr='yyyyww/",
        "inputPath": "<gs://gcslocation/gcstable/>"
      },
      "metadataUpdateSpec": null,
      "segmentOutputPath": null
    },
    "tuningConfig": {
      "type": "hadoop",
      "workingPath": null,
      "partitionsSpec": {
        "type": "hashed",
        "numShards": 5,
        "partitionDimensions": [],
        "partitionFunction": "murmur3_32_abs",
        "maxRowsPerSegment": null
      },
      "shardSpecs": {},
      "indexSpec": {
        "bitmap": {
          "type": "concise"
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs",
        "segmentLoader": null
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": {
          "type": "concise"
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs",
        "segmentLoader": null
      },
      "appendableIndexSpec": {
        "type": "onheap"
      },
      "maxRowsInMemory": 1000000,
      "maxBytesInMemory": 0,
      "leaveIntermediate": false,
      "cleanupOnFailure": true,
      "overwriteFiles": false,
      "ignoreInvalidRows": false,
      "jobProperties": {
        "mapreduce.job.classloader": "true",
        "mapreduce.job.user.classpath.first": "true",
        "mapreduce.input.fileinputformat.list-status.num-threads": "8",
        "mapreduce.map.memory.mb": "5461",
        "mapreduce.reduce.memory.mb": "5461",
        "mapreduce.map.output.compress": "true",
        "mapreduce.map.java.opts": "-Xmx4096m",
        "mapreduce.reduce.java.opts": "-Xmx4096m",
        "mapreduce.job.split.metainfo.maxsize": "-1",
        "mapreduce.task.io.sort.mb": "2047",
        "mapred.job.reuse.jvm.num.tasks": "20",
        "io.seqfile.sorter.recordlimit": "10000000",
        "mapred.output.compress": "true",
        "mapreduce.job.reduce.slowstart.completedmaps": "0.5",
        "mapreduce.reduce.shuffle.merge.percent": "0.8"
      },
      "combineText": false,
      "useCombiner": false,
      "buildV9Directly": true,
      "numBackgroundPersistThreads": 0,
      "forceExtendableShardSpecs": false,
      "useExplicitVersion": false,
      "allowedHadoopPrefix": [],
      "logParseExceptions": false,
      "maxParseExceptions": 0,
      "useYarnRMJobStatusFallback": true
    }
  },
  "hadoopDependencyCoordinates": null,
  "classpathPrefix": null,
  "context": {
    "forceTimeChunkLock": true,
    "useLineageBasedSegmentAllocation": true
  }
}

John Kowtko

06/06/2023, 11:30 AM

Okay ... your queryGranularity is set to WEEK ... that is rounding down the __time values to the week baseline (which is Druid's baseline, starting on a Monday). If you need a date field to consistently be set to the start of your "Sunday" week then I'm not sure how to do that other than a transform with some time functions in it to figure out the Sunday start-of-week date. You may be able to use the TIME_FLOOR() function with the 'origin' parameter for this .. or EXTRACT(DOW ...)

John Kowtko

06/06/2023, 11:32 AM

And just to confirm, you are doing rollups as well, so you want your data aggregated during ingestion based on the dimensions listed?

Basayya Swami

06/06/2023, 1:31 PM

actually not rolling up the data during ingestion

Basayya Swami

06/06/2023, 1:33 PM

not able to figure it out how to use 'origin' parameter within ingestion spec

John Kowtko

06/06/2023, 2:08 PM

In. your granularitySpec you have "rollup:true" ... set that to 'false' or remove the parameter altogether (default is false) if you are not rolling up your data during ingestion. Here is a simple expression to get the Sunday start-of-week date in a query:

select TIMESTAMPADD(DAY, -extract(dow from CURRENT_TIMESTAMP), CURRENT_TIMESTAMP)

For native ingestion the functions are timestamp_shift() and timestamp_extract() ... I can't get the ingestion expression to take in my demo dataset ... maybe you can get it working.

Abhishek Balaji Radhakrishnan

06/06/2023, 2:54 PM

A quick side note:

WEEK

granularity can be tricky since it doesn’t always align well with months or years. Consider using

DAY

MONTH

instead - see a recent change that advises against it: https://github.com/apache/druid/pull/14341/files?short_path=2b1d633#diff-2b1d6334204fbf5b1a3bbafb48a34b341caf65e368934447c516f15173226569

Abhishek Balaji Radhakrishnan

06/06/2023, 2:56 PM

also, if you can’t move away from

WEEK

granularity, I think you could also do something like

TIME_FLOOR(__time, 'P1W')

, similar to John’s suggestion above.

Basayya Swami

06/06/2023, 2:58 PM

as per John's suggestion, still data with __time column will be incorrect but can be queried while fetching. sometimes it may create confusions to the end users.

John Kowtko

06/06/2023, 3:42 PM

Hi Basayya, my suggestion for timestamp_shift() and timestamp_extract() was to transform your __time value correctly to the Sunday start of the week. Just don't use queryGranularity in the spec, because that also changes your __time value, and you don't want that (because of the Monday week base) Remember, the segment granularity can be anything you want ... that's a physical storage mechanism, it does not affect data values.

Basayya Swami

06/06/2023, 4:00 PM

Thanks John, let me try transforming __time value and test it

Open in Slack

Previous Next