Hello I am trying to load a csv generated from Redshift UNLO Apache Pinot #troubleshooting

Hello, I am trying to load a csv (generated from R...

Kamal Chavda

07/12/2021, 6:04 PM

Hello, I am trying to load a csv (generated from Redshift UNLOAD) command and the dates are in this format (in schema file):

Copy code

"name": "created_date",
            "dataType": "STRING",
            "format" : "1:MILLISECONDS:SIMPLE_DATE_FORMAT:YYYY-MM-dd HH24:MI:<http://SS.MS|SS.MS>",
            "granularity": "1:MILLISECONDS"

example from csv file: 2020-03-01 073108.792457. I keep on getting failed to generate pinot segment and java illegal argument exception error

Copy code

java.lang.IllegalArgumentException: Illegal pattern component: I
	at org.joda.time.format.DateTimeFormat.parsePatternTo(DateTimeFormat.java:566) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.joda.time.format.DateTimeFormat.createFormatterForPattern(DateTimeFormat.java:687) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.joda.time.format.DateTimeFormat.forPattern(DateTimeFormat.java:177) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.spi.data.DateTimeFormatPatternSpec.<init>(DateTimeFormatPatternSpec.java:57) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.spi.data.DateTimeFormatSpec.<init>(DateTimeFormatSpec.java:59) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.core.indexsegment.generator.SegmentGeneratorConfig.setTime(SegmentGeneratorConfig.java:212) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.core.indexsegment.generator.SegmentGeneratorConfig.<init>(SegmentGeneratorConfig.java:138) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:95) ~[pinot-all-0.7.1-jar-with-dependencies.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$run$0(SegmentGenerationJobRunner.java:199) ~[pinot-batch-ingestion-standalone-0.7.1-shaded.jar:0.7.1-afa4b252ab1c424ddd6c859bb305b2aa342b66ed]

Has anyone run into this issue? Do I need to convert the dates to EPOCH when generating CSV?

Xiang Fu

07/12/2021, 6:21 PM

what’s your created_date format?

Xiang Fu

07/12/2021, 6:22 PM

I feel that

need to be escaped

Kamal Chavda

07/12/2021, 6:23 PM

In Redshift table it's stored a timestamp without time zone

Jackie

07/12/2021, 6:25 PM

This is not a valid simple date format. Can you please try

yyyy-MM-dd HH:mm:ss.SSS

Kamal Chavda

07/12/2021, 6:26 PM

Thanks Jackie, let me try this format instead.

Jackie

07/12/2021, 6:28 PM

Seems microseconds are not supported in simple date format. Please double check the value parsed with this format

Kamal Chavda

07/12/2021, 6:37 PM

That worked! Moving on to next error. I exported file without header and it's giving me "header contains duplicate name error"

Jackie

07/12/2021, 6:41 PM

If you don't have the header within the file, you need to provide a

CSVRecordReaderConfig

which contains the header info

Jackie

07/12/2021, 6:41 PM

Details can be found here: https://docs.pinot.apache.org/configuration-reference/job-specification#record-reader-spec

👍 1

Kamal Chavda

07/12/2021, 6:44 PM

I was actually just going to regenerate the file with header.

Kamal Chavda

07/12/2021, 6:45 PM

I do have the following in my jobspec file though

Copy code

recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'

Do I need to change that to CSVRecordReaderConfig instead of CSVRecordReader?

Kamal Chavda

07/12/2021, 6:46 PM

Nevermind, I do see example of what to do here: https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot#creating-a-segment

Kamal Chavda

07/13/2021, 2:55 PM

For a CSV file with string columns containing commas, can I use backslash '\' as escape character?

Kamal Chavda

07/13/2021, 2:56 PM

I'm getting NumberFormatException for a string column but am guessing it's because of possible shifting in data when Pinot is trying to read the file.

Jackie

07/13/2021, 6:06 PM

@Kamal Chavda Can you please try wrapping the string value within double quotes? Not 100% sure if pinot CSV reader can handle that though

Kamal Chavda

07/13/2021, 7:21 PM

Hi @Jackie, I tried doing that by adding quotes but got an error. It didn't like quotes around the delimiters.

Kamal Chavda

07/13/2021, 7:22 PM

I'm also seeing this line

Copy code

Using class: org.apache.pinot.plugin.inputformat.csv.CSVRecordReader to read segment, ignoring configured file format: AVRO

Jackie

07/13/2021, 7:30 PM

@Kamal Chavda I did some experiment and Pinot CSV record reader can handle the escaped delimiter with double quotes. One finding is that you cannot have space after the double quote

Jackie

07/13/2021, 7:31 PM

Found some useful info here: https://stackoverflow.com/questions/4617935/is-there-a-way-to-include-commas-in-csv-columns-without-breaking-the-formatting

Kamal Chavda

07/13/2021, 9:07 PM

Thank @Jackie! Redshift UNOAD actually adds double quotes automatically to columns with delimiter (',') (ex: in a CSV file. I tested out by exporting a record and load it to Pinot. One strange thing I'm noticing is I'm getting error

Copy code

java.lang.IllegalStateException: Cannot read single-value from Object[]: [check 175, Some Name] for column: reference_id

But when I check the actual file for that value it's "check 175; Some Name". Not sure how that's happening.

Kamal Chavda

07/13/2021, 9:08 PM

Seems like the CSVRecordReader is interpreting ';' as a ','

Jackie

07/14/2021, 12:16 AM

I see what is going on here.

is preserved as the multi-value delimiter. E.g.

a;b

will be interpreted as

[a, b]

. One work-around would be picking an unused character is the multi-value delimiter if you don't expect multi-value within the csv file

Xiang Fu

07/14/2021, 12:28 AM

you can also configure the delimiter in your csvreaderconfig

Kamal Chavda

07/14/2021, 12:32 AM

Thanks guys. That's what I was actually going to ask. So I can update this line to change the multi_value or line 28 for the delimiter.

Kamal Chavda

07/14/2021, 12:34 AM

@Jackie, shouldn't it still load the record if it finds a ';' and thinks it's multi_value?

Kamal Chavda

07/14/2021, 12:34 AM

The column is a string type

Jackie

07/14/2021, 12:37 AM

The problem is that the column is not defined as multi-value column, and loading single value from an array throws the exception

Kamal Chavda

07/14/2021, 12:40 AM

~~Ah, okay got it. Didn't realize that. It's not mentioned on the dimensionfieldspec~~ nvm it's the singleValueField boolean.

Kamal Chavda

07/14/2021, 12:42 AM

I'll try a different delimiter for the multi-value. I've got some free form text columns in my dataset so will have all kinds of values facepalm

Xiang Fu

07/14/2021, 6:56 AM

you can follow this : https://docs.pinot.apache.org/basics/data-import/pinot-input-formats#csv

🙏 1

Xiang Fu

07/14/2021, 6:57 AM

to config the record reader config

Xiang Fu

07/14/2021, 6:57 AM

don’t need to change code

Xiang Fu

07/14/2021, 6:58 AM

just set

delimiter

and

multiValueDelimiter

you can ignore others

Kamal Chavda

07/14/2021, 1:33 PM

Thank you Xiang and Jackie for all your help! I used your advice and addressed the ';' value in columns during my csv generation. The processing is finally running! 🙂

Kamal Chavda

07/14/2021, 2:46 PM

The data is loaded however when I tried to filter on year for the date below it gave me an error.

Copy code

{
  "name": "created_date",
  "dataType": "STRING",
  "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss",
  "granularity": "1:SECONDS"
}

Kamal Chavda

07/14/2021, 2:48 PM

What's the best practice for defining date columns when loading from CSV? https://docs.pinot.apache.org/users/user-guide-query/supported-transformations#datetime-functions datetime functions only seem to accept epoch millis

Xiang Fu

07/14/2021, 5:09 PM

What the data look like?

Kamal Chavda

07/14/2021, 5:36 PM

In the csv file I've removed the microseconds from the timestamp so

2020-03-01 07:31:08

Kamal Chavda

07/14/2021, 5:38 PM

I was able to use the datetime_convert fuction to convert it in my query but was wondering if there's a better way.

Copy code

DATETIME_CONVERT(my_date_field, '1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss', '1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss', '1:HOURS')

Xiang Fu

07/14/2021, 6:19 PM

ic, then you can add more columns for your time, also recently @Jackie added a timestamp type, that might help as well

Jackie

07/14/2021, 6:47 PM

You can have multiple time fields and use ingestion transform to generate them: https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations

Jackie

07/14/2021, 6:51 PM

What's the query when you try to filter on year? I think

select ... from table where created_date > '2020-01-01 00:00:00'

should work

Kamal Chavda

07/14/2021, 7:11 PM

I was trying to do

select...from table where year(create_date) >= 2020

. What you posted above does work though so will modify query. Thanks for sharing the ingestion transform link, will use it to transform columns!

Jackie

07/14/2021, 7:16 PM

If you ingest the data as data type

TIMESTAMP

year()

will work

Kamal Chavda

07/16/2021, 3:04 PM

Thanks @Jackie, I totally missed your response yesterday. So you're suggesting using the ingestion-level-transformations to load the source timestamp as timestamp in Pinot destination table? Xiang Fu mentioned there is a timestamp type? Instead of SIMPLE_DATE_FORMAT I can use TIMESTAMP?

Jackie

07/16/2021, 5:31 PM

Yes, and the source format is already valid timestamp, and you don’t need to do extra transform

Jackie

07/16/2021, 5:31 PM

You may config the format to 1MILLISECONDSTIMESTAMP

Jackie

07/16/2021, 5:32 PM

FYI it is only available in the latest master. If you are on 0.7.1 you need to wait for the next release

Kamal Chavda

07/16/2021, 5:32 PM

Thanks Jackie. Yes, I'm on 0.7.1 😞

Kamal Chavda

07/16/2021, 5:36 PM

With the next release, in the schema file I generate I can do the following, correct?

Copy code

"datatype": "Timestamp",
"format" : "1:SECONDS:TIMESTAMP"

instead of

Copy code

"datatype": "STRING",
"format" : "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss"

Jackie

07/16/2021, 11:34 PM

TIMESTAMP is always in millis, so

1:MILLISECONDS:TIMESTAMP

👍 1

Open in Slack

Previous Next