https://pinot.apache.org/ logo
k

Kamal Chavda

07/12/2021, 6:04 PM
Hello, I am trying to load a csv (generated from Redshift UNLOAD) command and the dates are in this format (in schema file):
Copy code
"name": "created_date",
            "dataType": "STRING",
            "format" : "1:MILLISECONDS:SIMPLE_DATE_FORMAT:YYYY-MM-dd HH24:MI:<http://SS.MS|SS.MS>",
            "granularity": "1:MILLISECONDS"
example from csv file: 2020-03-01 073108.792457. I keep on getting failed to generate pinot segment and java illegal argument exception error
Copy code
java.lang.IllegalArgumentException: Illegal pattern component: I
	at org.joda.time.format.DateTimeFormat.parsePatternTo(DateTimeFormat.java:566) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.joda.time.format.DateTimeFormat.createFormatterForPattern(DateTimeFormat.java:687) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.joda.time.format.DateTimeFormat.forPattern(DateTimeFormat.java:177) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.spi.data.DateTimeFormatPatternSpec.<init>(DateTimeFormatPatternSpec.java:57) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.spi.data.DateTimeFormatSpec.<init>(DateTimeFormatSpec.java:59) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.core.indexsegment.generator.SegmentGeneratorConfig.setTime(SegmentGeneratorConfig.java:212) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.core.indexsegment.generator.SegmentGeneratorConfig.<init>(SegmentGeneratorConfig.java:138) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:95) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$run$0(SegmentGenerationJobRunner.java:199) ~[pinot-batch-ingestion-standalone-0.7.1-shaded.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
Has anyone run into this issue? Do I need to convert the dates to EPOCH when generating CSV?
x

Xiang Fu

07/12/2021, 6:21 PM
what’s your created_date format?
I feel that
:
need to be escaped
k

Kamal Chavda

07/12/2021, 6:23 PM
In Redshift table it's stored a timestamp without time zone
j

Jackie

07/12/2021, 6:25 PM
This is not a valid simple date format. Can you please try
yyyy-MM-dd HH:mm:ss.SSS
?
k

Kamal Chavda

07/12/2021, 6:26 PM
Thanks Jackie, let me try this format instead.
j

Jackie

07/12/2021, 6:28 PM
Seems microseconds are not supported in simple date format. Please double check the value parsed with this format
k

Kamal Chavda

07/12/2021, 6:37 PM
That worked! Moving on to next error. I exported file without header and it's giving me "header contains duplicate name error"
j

Jackie

07/12/2021, 6:41 PM
If you don't have the header within the file, you need to provide a
CSVRecordReaderConfig
which contains the header info
k

Kamal Chavda

07/12/2021, 6:44 PM
I was actually just going to regenerate the file with header.
I do have the following in my jobspec file though
Copy code
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
Do I need to change that to CSVRecordReaderConfig instead of CSVRecordReader?
For a CSV file with string columns containing commas, can I use backslash '\' as escape character?
I'm getting NumberFormatException for a string column but am guessing it's because of possible shifting in data when Pinot is trying to read the file.
j

Jackie

07/13/2021, 6:06 PM
@Kamal Chavda Can you please try wrapping the string value within double quotes? Not 100% sure if pinot CSV reader can handle that though
k

Kamal Chavda

07/13/2021, 7:21 PM
Hi @Jackie, I tried doing that by adding quotes but got an error. It didn't like quotes around the delimiters.
I'm also seeing this line
Copy code
Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to read segment, ignoring configured file format: AVRO
j

Jackie

07/13/2021, 7:30 PM
@Kamal Chavda I did some experiment and Pinot CSV record reader can handle the escaped delimiter with double quotes. One finding is that you cannot have space after the double quote
k

Kamal Chavda

07/13/2021, 9:07 PM
Thank @Jackie! Redshift UNOAD actually adds double quotes automatically to columns with delimiter (',') (ex: in a CSV file. I tested out by exporting a record and load it to Pinot. One strange thing I'm noticing is I'm getting error
Copy code
java.lang.IllegalStateException: Cannot read single-value from Object[]: [check 175, Some Name] for column: reference_id
But when I check the actual file for that value it's "check 175; Some Name". Not sure how that's happening.
Seems like the CSVRecordReader is interpreting ';' as a ','
j

Jackie

07/14/2021, 12:16 AM
I see what is going on here.
;
is preserved as the multi-value delimiter. E.g.
a;b
will be interpreted as
[a, b]
. One work-around would be picking an unused character is the multi-value delimiter if you don't expect multi-value within the csv file
x

Xiang Fu

07/14/2021, 12:28 AM
you can also configure the delimiter in your csvreaderconfig
k

Kamal Chavda

07/14/2021, 12:32 AM
Thanks guys. That's what I was actually going to ask. So I can update this line to change the multi_value or line 28 for the delimiter.
@Jackie, shouldn't it still load the record if it finds a ';' and thinks it's multi_value?
The column is a string type
j

Jackie

07/14/2021, 12:37 AM
The problem is that the column is not defined as multi-value column, and loading single value from an array throws the exception
k

Kamal Chavda

07/14/2021, 12:40 AM
Ah, okay got it. Didn't realize that. It's not mentioned on the dimensionfieldspec nvm it's the singleValueField boolean.
I'll try a different delimiter for the multi-value. I've got some free form text columns in my dataset so will have all kinds of values facepalm
x

Xiang Fu

07/14/2021, 6:56 AM
to config the record reader config
don’t need to change code
just set
delimiter
and
multiValueDelimiter
you can ignore others
k

Kamal Chavda

07/14/2021, 1:33 PM
Thank you Xiang and Jackie for all your help! I used your advice and addressed the ';' value in columns during my csv generation. The processing is finally running! 🙂
The data is loaded however when I tried to filter on year for the date below it gave me an error.
Copy code
{
  "name": "created_date",
  "dataType": "STRING",
  "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss",
  "granularity": "1:SECONDS"
}
What's the best practice for defining date columns when loading from CSV? https://docs.pinot.apache.org/users/user-guide-query/supported-transformations#datetime-functions datetime functions only seem to accept epoch millis
x

Xiang Fu

07/14/2021, 5:09 PM
What the data look like?
k

Kamal Chavda

07/14/2021, 5:36 PM
In the csv file I've removed the microseconds from the timestamp so
2020-03-01 07:31:08
I was able to use the datetime_convert fuction to convert it in my query but was wondering if there's a better way.
Copy code
DATETIME_CONVERT(my_date_field, '1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss', '1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss', '1:HOURS')
x

Xiang Fu

07/14/2021, 6:19 PM
ic, then you can add more columns for your time, also recently @Jackie added a timestamp type, that might help as well
j

Jackie

07/14/2021, 6:47 PM
You can have multiple time fields and use ingestion transform to generate them: https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations
What's the query when you try to filter on year? I think
select ... from table where created_date > '2020-01-01 00:00:00'
should work
k

Kamal Chavda

07/14/2021, 7:11 PM
I was trying to do
select...from table where year(create_date) >= 2020
. What you posted above does work though so will modify query. Thanks for sharing the ingestion transform link, will use it to transform columns!
j

Jackie

07/14/2021, 7:16 PM
If you ingest the data as data type
TIMESTAMP
,
year()
will work
k

Kamal Chavda

07/16/2021, 3:04 PM
Thanks @Jackie, I totally missed your response yesterday. So you're suggesting using the ingestion-level-transformations to load the source timestamp as timestamp in Pinot destination table? Xiang Fu mentioned there is a timestamp type? Instead of SIMPLE_DATE_FORMAT I can use TIMESTAMP?
j

Jackie

07/16/2021, 5:31 PM
Yes, and the source format is already valid timestamp, and you don’t need to do extra transform
You may config the format to 1MILLISECONDSTIMESTAMP
FYI it is only available in the latest master. If you are on 0.7.1 you need to wait for the next release
k

Kamal Chavda

07/16/2021, 5:32 PM
Thanks Jackie. Yes, I'm on 0.7.1 😞
With the next release, in the schema file I generate I can do the following, correct?
Copy code
"datatype": "Timestamp",
"format" : "1:SECONDS:TIMESTAMP"
instead of
Copy code
"datatype": "STRING",
"format" : "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss"
j

Jackie

07/16/2021, 11:34 PM
TIMESTAMP is always in millis, so
1:MILLISECONDS:TIMESTAMP
👍 1