This message was deleted.
# troubleshooting
s
This message was deleted.
j
I have done compactions with mixed segment overshadowing and did not have a problem. In your compaction spec are you only specifying the target time range to compact, not specific segments?
k
What process is adding the small amount of data? Can you make that process use
appendToExisting=true
?
d
this was my spec
Copy code
{
  "type": "compact",
  "dataSource": "dev.do.overshadow-test",
  "ioConfig": {
    "type": "compact",
    "inputSpec": {
      "type": "segments",
      "segments": [
        "dev.do.overshadow-test_2023-01-01T00:00:00.000Z_2023-02-01T00:00:00.000Z_2023-05-24T13:27:31.235Z",
        "dev.do.overshadow-test_2023-01-01T00:00:00.000Z_2023-02-01T00:00:00.000Z_2023-05-24T13:15:52.256Z"
      ]
    }
  }
}
since the older
*T13:15
segment was unused i got an error tried it now using a time interval
Copy code
{
  "type": "compact",
  "dataSource": "dev.do.overshadow-test",
  "ioConfig": {
    "type": "compact",
    "inputSpec": {
      "type": "interval",
      "interval": "2023-01-01/2023-01-02"
    }
  }
}
that produced a third segment but it is identical to the second one (the one with less data)
j
So your original segments had MONTH granularity, and you overshadowed it with an update with (DAY?) granularity?
Or are you just replacing one MONTH segment with another MONTH segment?
If MONTH -> MONTH, then compaction should merge the two segments. To Kyle's comment -- if you ingested the second segment with append=false, then the contents of the newer segment should replace the contents of the older segment ... and if you ingested the second segment with append=true, then the contents should be unioned together in the resulting segment.
For your second compaction spec, you specified a DAY range, not a MONTH range ... was that a typo?
d
@Kyle Hoondert Thats another thing we are trying to figure out. At first I assumed it was a message that was bouncing around in the web of kafka streams for multiple hours, until the currently segment was published, but there are also some really obscure cases that need further investigation Indeed, the affected datasource did not use
appendToExisting
because it didnt use the
dynamic
partitioning type @John Kowtko the actual data has segmentGranularity=day, this is a smaller example with segmentGranularity=month
j
appendToExisting
is only relevant to the ingestion task, not the existing datasource. So it doesn't matter how the original segments got there, this parameter only affects what happens when you ingest the new data. I guess a small caveat to this ... if the original segments were already a result of multiple ingestions into the same time period and you haven't compacted that time period yet, then yes the existing segments may be marked as having been ingested in overwrite or append mode, and that will affect what queries see, and what compaction will do.
Back to your original question:
I have a datasource where several days=segments with millions of rows are overshadowed by segments that were added later with only a tiny fraction of rows
"overshadowed" implies that the newer segments were ingested with appendToExisting=false ... and if all segments involved were at the DAY level, then compaction (using your second spec) should merge them.
d
yes, its still important to make sure we dont overshadow our data again,after restoring it, but doesnt solve the immediate problem ob being able to access both the original data and the new data which overshadowed it. Are you sure about compaction merging inactive segments by default? Assuming the granularity doesnt change, It is supposed to pack multiple partitions for one interval into fewer new segments, I dont remember anything in the docs about handling multiple versions. And thats what my tests shown. Here is the simple inline ingestion i used: after its finished, i ran it again with different
"data": "time,id,value\n2023-01-01,1.01,1.01\n2023-02-01,2.01,2.01\n2023-03-01,3.01,3.01"
to overshadow the previous segment and then ran the compactions on it
j
On my system the DRUID_SEGMENTS table (in the metadata DB) has only has a "USED" field in it. The other attributes such as is_published, is_available, is_overshadowed, are therefore derived by the Coordinator when the SYS virtual tables are populated. Which also means to me that you may not automatically update the SYS table content by manually updating the USED flag in DRUID_SEGMENTS ... or it might take a while for the Coordinator to refresh it's view of the segment states. === Just to confirm ingestion/compaction behavior, on my local quickstart install I just tested this on a wikipedia datasource ingested with HOUR granularity: • Ingested a second segment of HOUR granularity with REPLACE (append=false) and it completely overshadowed the original existing segment, moments later the original segment was automatically deleted. No need for compaction in the case of full overshadowing. • Ingested a second segment of HOUR granularity using INSERT (append=true) into the table, resulting in two active segments for that hour. After compaction these consolidated to one segment that has the union of data from both segments. === This last spec you posted has segment granularity at MONTH. If you ran two ingestions with different data sets over the same three month period, then you should end up with 6 segments in total, and 6+3 = 9 records of data in it. None of these should be overshadowed ... they should all be additive. And if you run compaction at MONTH granularity (or just set up default auto-compaction with P0D offset and kick it off a few times) you should end up with 3 segments in total with those 9 records still in them. Is this not what you are seeing?