Hi Team, I am trying to read data from parquet fil...
# troubleshooting
a
Hi Team, I am trying to read data from parquet file to Pinot table using spark batch Ingestion. I am facing error for date time STRING datatype. Here the date (‘yyyy-MM-dd ’) is getting loaded in EPOCH format (18234) whereas I need it in original string format with granularity : DAYS (2020-01-02). For now, I am using derived column method and transforming it into string using transformConfigs . With this, I am not longer able to use function like dateTrunc(‘week’ , sql_date_entered_str, ‘DAYS’)
Copy code
{
      "name": "sql_date_entered",
      "dataType": "INT",
      "format": "1:DAYS:EPOCH",
      "granularity": "1:DAYS"
    },
 {
      "name": "sql_date_entered_str",
      "dataType": "STRING",
      "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
Other way to handle is using query transformations:
Copy code
select sql_date_entered ,
DATETIMECONVERT(dateTrunc('week' , sql_date_entered, 'DAYS'), '1:DAYS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd', '1:DAYS') as week
from
kepler_product_pipegen
Is there any way that I can load the date in ( ‘YYYY-mm-dd’) format and still run the transformation like dateTrunc on the top of it ? Pinot version = 0.7.1
r
have you try using
timestamp
dateType instead of STRING?
a
@User I am using pinot 0.7.1 version which doesn’t support timestamp.
n
Could you elaborate a bit more on what you’re trying to achieve? how is it in your source data? i see you’ve already defined
sql_date_entered_str
as
yyyy-MM-dd
, is that not loading the way you want? can you share the transformConfigs in your table configs?
is the problem that you have
yyyy-MM-dd
and want to use dateTrunc on it but cannot?
a
1. yes your understanding is correct. I am not able to run dateTrunc function on sql_date_entered_str. 2. requirement : load the date from source in ” YYYY-mm-dd” format. In the source parquet file , its of type date .
this how it looks in presto when i load same file.
n
looks like in the source,
sql_date_entered
is in format
yyyy-MM-dd
. But in the Pinot schema,
sql_date_entered
is defined as
1:DAYS:EPOCH
. So i think the first correction would be to chage Pinot schema to
Copy code
{
      "name": "sql_date_entered_epoch_ms",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:DAYS"
    },
 {
      "name": "sql_date_entered",
      "dataType": "STRING",
      "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
      "granularity": "1:DAYS"
    }
and add transform function in table config to go with it
Copy code
"columnName":"sql_date_entered_epoch_ms","transformFunction":"fromDateTime(sql_date_entered, 'yyyy-MM-dd')"
a
@User Thanks a lot. Let me give it a try.
n
now you can use dateTrunc on
sql_date_entered_epoch_ms
as
dateTrunc('week', sql_date_entered_epoch_ms)
a
I tired , and it’s reading the date in EPOCH (18234 ) format but with string data type ( as nulls are there) 😞 . Is it because Parquet reads the date internally in INT32 epoch ? Isn’t there any way to read the source date its in original format which is ‘yyyy-MM-dd’ and yet can apply date_Trunc function on top of it.
Screenshot 2022-04-07 at 12.19.07 PM.png
n
Can you share the entire table config, Pinot schema and a sample parquet?
n
i tried ingesting the
sql_date_entered
column as a dimension, and turns out is is indeed in EPOCH days format in the original source data. Not sure if presto does something under the hood to show it in yyyy-MM-dd format. But in the parquet file it is in EPOCH DAYS.
i was able to truncate to week and then display in yyyy-MM-dd using
Copy code
select toDateTime(fromEpochDays(datetrunc('week', sql_date_entered, 'DAYS')), 'yyyy-MM-dd') from kepler_product_pipegen limit 10
it might be better to do this during ingestion, so the query time isn’t expensive