I seem to have issues querying table bach ingested...
# troubleshooting
t
I seem to have issues querying table bach ingested into OFFLINE segments.. I have data for the past year, but can only query data back a few days. When I do a
select count(*)
for the whole table, it says there are 12,391,295 records, yet even in the query stats it says the total number of docs is 333,029,029. The query stats also say that only 344 of the segments match
*
yet that is not even half of the real time segments and I am assuming none of the offline segments
m
Is this a hybrid table?
t
yes
m
Total docs in metadata counts total in both components (which can have overlap). Count(*) should be the accurate one.
select *
without orderby/limit or filter will do an early bailout (as it needs to only return some 10 records.
t
So the bigger issue I think I am having is it does not seem like the offline segments are queryable. The dates for the offline segments should start at oct 6 but the earliest result I see is from oct 18
m
Do you have time overlap in the two tables?
t
I have the same table set up with about 10x less data (100Gb vs 1.6Tb) and it is able to return values from over a year ago
there should not be
m
Ideally, there should be overlap
You can query the min/max time in the two tables independently
use _OFFLINE or _REALTIME suffix to table name
t
hmm okay, I have
2022-03-31T10:00:39.299
for real time (note this has only been running for the last couple of days) and
2022-10-18T20:02:52.759
for offline
I will try an batch ingest more data so that there is guaranteed timestamp overlap in the tables. Querying the REALTIME table for the min timestamp value, I get the value I expect but querying the OFFLINE table directly I get the same unexpected result I was describing before. Will update on this
m
Side note, you probably don’t want to store time in ms granularity, unless you want to query in ms granularity. Ideally store it in the lowest granularity you care about during query
t
Update: I have created a set of data of the missing range which has doubled the amount of data in my offline table but still shows the exact same result.
Copy code
select * from uplinkpayloadevent
order by message_timestamp asc
limit 10

First result = 2022-10-18T20:02:52.759 (1666123372759)
This is the same for the uplinkpayloadevent_OFFLINE table, but not the uplinkpayloadevent_REALTIME table (which has only been online for a few days)
Copy code
select * from uplinkpayloadevent_REALTIME
order by message_timestamp asc
limit 10

First result = 2022-03-31T10:00:39.299 (1648720839299)
To ensure that this data was ETLed, I even downloaded a segment to verify the expected time range is there. I am having trouble making sense of this..