Hello, I am trying to load a csv (generated from R...
# troubleshooting
k
Hello, I am trying to load a csv (generated from Redshift UNLOAD) command and the dates are in this format (in schema file):
Copy code
"name": "created_date",
            "dataType": "STRING",
            "format" : "1:MILLISECONDS:SIMPLE_DATE_FORMAT:YYYY-MM-dd HH24:MI:<http://SS.MS|SS.MS>",
            "granularity": "1:MILLISECONDS"
example from csv file: 2020-03-01 073108.792457. I keep on getting failed to generate pinot segment and java illegal argument exception error
Copy code
java.lang.IllegalArgumentException: Illegal pattern component: I
	at org.joda.time.format.DateTimeFormat.parsePatternTo(DateTimeFormat.java:566) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.joda.time.format.DateTimeFormat.createFormatterForPattern(DateTimeFormat.java:687) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.joda.time.format.DateTimeFormat.forPattern(DateTimeFormat.java:177) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.spi.data.DateTimeFormatPatternSpec.<init>(DateTimeFormatPatternSpec.java:57) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.spi.data.DateTimeFormatSpec.<init>(DateTimeFormatSpec.java:59) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.core.indexsegment.generator.SegmentGeneratorConfig.setTime(SegmentGeneratorConfig.java:212) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.core.indexsegment.generator.SegmentGeneratorConfig.<init>(SegmentGeneratorConfig.java:138) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:95) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$run$0(SegmentGenerationJobRunner.java:199) ~[pinot-batch-ingestion-standalone-0.7.1-shaded.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
Has anyone run into this issue? Do I need to convert the dates to EPOCH when generating CSV?
x
what’s your created_date format?
I feel that
:
need to be escaped
k
In Redshift table it's stored a timestamp without time zone
j
This is not a valid simple date format. Can you please try
yyyy-MM-dd HH:mm:ss.SSS
?
k
Thanks Jackie, let me try this format instead.
j
Seems microseconds are not supported in simple date format. Please double check the value parsed with this format
k
That worked! Moving on to next error. I exported file without header and it's giving me "header contains duplicate name error"
j
If you don't have the header within the file, you need to provide a
CSVRecordReaderConfig
which contains the header info
k
I was actually just going to regenerate the file with header.
I do have the following in my jobspec file though
Copy code
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
Do I need to change that to CSVRecordReaderConfig instead of CSVRecordReader?
For a CSV file with string columns containing commas, can I use backslash '\' as escape character?
I'm getting NumberFormatException for a string column but am guessing it's because of possible shifting in data when Pinot is trying to read the file.
j
@Kamal Chavda Can you please try wrapping the string value within double quotes? Not 100% sure if pinot CSV reader can handle that though
k
Hi @Jackie, I tried doing that by adding quotes but got an error. It didn't like quotes around the delimiters.
I'm also seeing this line
Copy code
Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to read segment, ignoring configured file format: AVRO
j
@Kamal Chavda I did some experiment and Pinot CSV record reader can handle the escaped delimiter with double quotes. One finding is that you cannot have space after the double quote
k
Thank @Jackie! Redshift UNOAD actually adds double quotes automatically to columns with delimiter (',') (ex: in a CSV file. I tested out by exporting a record and load it to Pinot. One strange thing I'm noticing is I'm getting error
Copy code
java.lang.IllegalStateException: Cannot read single-value from Object[]: [check 175, Some Name] for column: reference_id
But when I check the actual file for that value it's "check 175; Some Name". Not sure how that's happening.
Seems like the CSVRecordReader is interpreting ';' as a ','
j
I see what is going on here.
;
is preserved as the multi-value delimiter. E.g.
a;b
will be interpreted as
[a, b]
. One work-around would be picking an unused character is the multi-value delimiter if you don't expect multi-value within the csv file
x
you can also configure the delimiter in your csvreaderconfig
k
Thanks guys. That's what I was actually going to ask. So I can update this line to change the multi_value or line 28 for the delimiter.
@Jackie, shouldn't it still load the record if it finds a ';' and thinks it's multi_value?
The column is a string type
j
The problem is that the column is not defined as multi-value column, and loading single value from an array throws the exception
k
Ah, okay got it. Didn't realize that. It's not mentioned on the dimensionfieldspec nvm it's the singleValueField boolean.
I'll try a different delimiter for the multi-value. I've got some free form text columns in my dataset so will have all kinds of values facepalm
x
to config the record reader config
don’t need to change code
just set
delimiter
and
multiValueDelimiter
you can ignore others
k
Thank you Xiang and Jackie for all your help! I used your advice and addressed the ';' value in columns during my csv generation. The processing is finally running! 🙂
The data is loaded however when I tried to filter on year for the date below it gave me an error.
Copy code
{
  "name": "created_date",
  "dataType": "STRING",
  "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss",
  "granularity": "1:SECONDS"
}
What's the best practice for defining date columns when loading from CSV? https://docs.pinot.apache.org/users/user-guide-query/supported-transformations#datetime-functions datetime functions only seem to accept epoch millis
x
What the data look like?
k
In the csv file I've removed the microseconds from the timestamp so
2020-03-01 07:31:08
I was able to use the datetime_convert fuction to convert it in my query but was wondering if there's a better way.
Copy code
DATETIME_CONVERT(my_date_field, '1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss', '1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss', '1:HOURS')
x
ic, then you can add more columns for your time, also recently @Jackie added a timestamp type, that might help as well
j
You can have multiple time fields and use ingestion transform to generate them: https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations
What's the query when you try to filter on year? I think
select ... from table where created_date > '2020-01-01 00:00:00'
should work
k
I was trying to do
select...from table where year(create_date) >= 2020
. What you posted above does work though so will modify query. Thanks for sharing the ingestion transform link, will use it to transform columns!
j
If you ingest the data as data type
TIMESTAMP
,
year()
will work
k
Thanks @Jackie, I totally missed your response yesterday. So you're suggesting using the ingestion-level-transformations to load the source timestamp as timestamp in Pinot destination table? Xiang Fu mentioned there is a timestamp type? Instead of SIMPLE_DATE_FORMAT I can use TIMESTAMP?
j
Yes, and the source format is already valid timestamp, and you don’t need to do extra transform
You may config the format to 1MILLISECONDSTIMESTAMP
FYI it is only available in the latest master. If you are on 0.7.1 you need to wait for the next release
k
Thanks Jackie. Yes, I'm on 0.7.1 😞
With the next release, in the schema file I generate I can do the following, correct?
Copy code
"datatype": "Timestamp",
"format" : "1:SECONDS:TIMESTAMP"
instead of
Copy code
"datatype": "STRING",
"format" : "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss"
j
TIMESTAMP is always in millis, so
1:MILLISECONDS:TIMESTAMP
👍 1