Hi Team I am trying to read data from parquet file to Pinot Apache Pinot #troubleshooting

Hi Team, I am trying to read data from parquet fil...

Aparna Razdan

04/06/2022, 11:28 AM

Hi Team, I am trying to read data from parquet file to Pinot table using spark batch Ingestion. I am facing error for date time STRING datatype. Here the date (‘yyyy-MM-dd ’) is getting loaded in EPOCH format (18234) whereas I need it in original string format with granularity : DAYS (2020-01-02). For now, I am using derived column method and transforming it into string using transformConfigs . With this, I am not longer able to use function like dateTrunc(‘week’ , sql_date_entered_str, ‘DAYS’)

Copy code

{
      "name": "sql_date_entered",
      "dataType": "INT",
      "format": "1:DAYS:EPOCH",
      "granularity": "1:DAYS"
    },
 {
      "name": "sql_date_entered_str",
      "dataType": "STRING",
      "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
      "granularity": "1:DAYS"
    }

Other way to handle is using query transformations:

Copy code

select sql_date_entered ,
DATETIMECONVERT(dateTrunc('week' , sql_date_entered, 'DAYS'), '1:DAYS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd', '1:DAYS') as week
from
kepler_product_pipegen

Is there any way that I can load the date in ( ‘YYYY-mm-dd’) format and still run the transformation like dateTrunc on the top of it ? Pinot version = 0.7.1

Rong R

04/06/2022, 5:56 PM

have you try using

timestamp

dateType instead of STRING?

Aparna Razdan

04/07/2022, 5:02 AM

@User I am using pinot 0.7.1 version which doesn’t support timestamp.

Neha Pawar

04/07/2022, 5:18 AM

Could you elaborate a bit more on what you’re trying to achieve? how is it in your source data? i see you’ve already defined

sql_date_entered_str

yyyy-MM-dd

, is that not loading the way you want? can you share the transformConfigs in your table configs?

Neha Pawar

04/07/2022, 5:19 AM

is the problem that you have

yyyy-MM-dd

and want to use dateTrunc on it but cannot?

Aparna Razdan

04/07/2022, 5:22 AM

1. yes your understanding is correct. I am not able to run dateTrunc function on sql_date_entered_str. 2. requirement : load the date from source in ” YYYY-mm-dd” format. In the source parquet file , its of type date .

Aparna Razdan

04/07/2022, 5:26 AM

this how it looks in presto when i load same file.

Neha Pawar

04/07/2022, 5:31 AM

looks like in the source,

sql_date_entered

is in format

yyyy-MM-dd

. But in the Pinot schema,

sql_date_entered

is defined as

1:DAYS:EPOCH

. So i think the first correction would be to chage Pinot schema to

Copy code

{
      "name": "sql_date_entered_epoch_ms",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:DAYS"
    },
 {
      "name": "sql_date_entered",
      "dataType": "STRING",
      "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
      "granularity": "1:DAYS"
    }

and add transform function in table config to go with it

Copy code

"columnName":"sql_date_entered_epoch_ms","transformFunction":"fromDateTime(sql_date_entered, 'yyyy-MM-dd')"

Aparna Razdan

04/07/2022, 5:33 AM

@User Thanks a lot. Let me give it a try.

Neha Pawar

04/07/2022, 5:34 AM

now you can use dateTrunc on

sql_date_entered_epoch_ms

dateTrunc('week', sql_date_entered_epoch_ms)

Aparna Razdan

04/07/2022, 11:52 AM

I tired , and it’s reading the date in EPOCH (18234 ) format but with string data type ( as nulls are there) 😞 . Is it because Parquet reads the date internally in INT32 epoch ? Isn’t there any way to read the source date its in original format which is ‘yyyy-MM-dd’ and yet can apply date_Trunc function on top of it.

Aparna Razdan

04/07/2022, 1:38 PM

Screenshot 2022-04-07 at 12.19.07 PM.png

Neha Pawar

04/07/2022, 4:22 PM

Can you share the entire table config, Pinot schema and a sample parquet?

Aparna Razdan

04/08/2022, 2:17 AM

Sure. Thanks again for all your help here.

20220328_191841_06144_sqqq9_83622d27-ccdc-4368-9fc2-b7e41739ee74 kepler_product_pipe_table.json kepler_product_pipegen_schema.json kepler_product_ingestion.yaml

Neha Pawar

04/11/2022, 3:18 AM

i tried ingesting the

sql_date_entered

column as a dimension, and turns out is is indeed in EPOCH days format in the original source data. Not sure if presto does something under the hood to show it in yyyy-MM-dd format. But in the parquet file it is in EPOCH DAYS.

Neha Pawar

04/11/2022, 3:19 AM

i was able to truncate to week and then display in yyyy-MM-dd using

Copy code

select toDateTime(fromEpochDays(datetrunc('week', sql_date_entered, 'DAYS')), 'yyyy-MM-dd') from kepler_product_pipegen limit 10

Neha Pawar

04/11/2022, 3:19 AM

it might be better to do this during ingestion, so the query time isn’t expensive

Open in Slack

Previous Next