https://pinot.apache.org/ logo
#general
Title
# general
n

Nizar Hejazi

03/19/2022, 11:01 PM
Hey there, I want to write a transformation to convert a datetime column that has string values (e.g. ‘2022-03-19T110018.789Z’) or nulls into timestamps. Using inbuilt function (FromDateTime) throws
java.lang.NullPointerException
when the value is null:
Copy code
{
  "columnName": "updatedAt_timestamp",
  "transformFunction": "FromDateTime(updatedAt, 'yyyy-MM-dd''T''HH:mm:ss.SSS''Z''')"
}
Trying to use Groovy script as following but I see the following exception in the logs:
MissingPropertyException: No such property: DateTimeFormat for class: Script1
Copy code
{
  "columnName": "col_timestamp",
  "transformFunction": "Groovy({col == null ? null : DateTimeFormat.forPattern('yyyy-MM-dd\\'T\\'HH:mm:ss.SSS\\'Z\\'').withZone(DateTimeZone.forID(DateTimeZone.UTC.getID())).parseMillis(adminAccessGrantedOn)}, col)"
},
Do I need to import joda time classes to Groovy? Can I write a multi-line Groovy script as an ingestion transform? Any other workaround to deal w/ nulls in FromDateTime inbuilt function? (I can submit a PR to update date time functions to handle nulls). Please note that I have
"nullHandlingEnabled"
set to True.
Note: the following Groovy script is valid (tested in a Groovy compiler):
Copy code
@Grab( 'joda-time:joda-time:2.10.5' )
import org.joda.time.DateTimeZone
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter

dt = '2022-03-19T11:00:18.789Z'
print DateTimeFormat.forPattern('yyyy-MM-dd\'T\'HH:mm:ss.SSS\'Z\'').withZone(DateTimeZone.forID(DateTimeZone.UTC.getID())).parseMillis(dt)
m

Mayank

03/19/2022, 11:03 PM
Groovy script cannot call Pinot defined UDFs. But you should be able to do that transform without the Groovy script?
n

Nizar Hejazi

03/19/2022, 11:05 PM
Do you mean using inbuilt functions only? I cannot find inbuilt function for handling nulls (IsNull, IF, etc.)
I need to return null if the input column value is null, otherwise: convert the string datetime to timestamp
Is this time column? If so, you probably don’t want to convert to null
k

Kartik Khare

03/19/2022, 11:07 PM
We allow it to convert to a default value (which can be set by you), in case of nulls. However, returning a null is not supported.
n

Nizar Hejazi

03/19/2022, 11:08 PM
this is one of the three available time columns that I use for filtering (not as timeColumnName in segmentsConfig), maybe I should define it as dimensionSpec?
If I have
"nullHandlingEnabled"
set to True, why I cannot return null in an ingestion transform?
k

Kartik Khare

03/19/2022, 11:10 PM
The functions can't return nulls. When
nullHandlingEnabled
is set to true, whatever data you have uploaded, if the column is empty, we mark it as null in our index, the value is still stored as a default one.
n

Nizar Hejazi

03/19/2022, 11:11 PM
What happens if the column is empty (null) and there is an ingestion transform defined? For example: JSONPATH takes an object and return null if the input is null. I need to do the same for FromDateTime.
@User The logic is as following: • If col is null, set to whatever default value it should take and mark it as null in the index. • If col is not null, transform it from a string datetime to a timestamp. Can I achieve this one way or another?
k

Kartik Khare

03/19/2022, 11:19 PM
Can I ask you why you need to convert from string to timestamp?
n

Nizar Hejazi

03/19/2022, 11:20 PM
for perf reasons .. wanna pay the cost at ingestion time in order for queries to run faster
✔️ 1
Can I submit a PR to check for if dateTimeString is null in
FromDateTime
and
ToDateTime
functions and return null in this case? Source code: link
k

Kartik Khare

03/19/2022, 11:25 PM
@User We had a discussion on this earlier and the conclusion was to raise exceptions in this case. Can we do it some other way?
n

Nizar Hejazi

03/19/2022, 11:31 PM
suggestion: If
nullHandlingEnabled
is set to true, returns null. Otherwise, throw an exception.
k

Kartik Khare

03/19/2022, 11:32 PM
Issue is for columns with DataType other than Object, setting to null with throw NullPointerException in results.
n

Nizar Hejazi

03/19/2022, 11:49 PM
@User @User one simple solution here is as following: If the input column value is null, check
nullHandlingEnabled
setting: • If
nullHandlingEnabled
is set to true, mark the value as null in the null index, and avoid calling the transform ingestion function. • If
nullHandlingEnabled
is set to false, call the transform ingestion function.
j

Jackie

03/20/2022, 4:54 AM
Checking if input column is null might not give expected behavior for certain functions which can take null values
I think we can probably add an annotation for functions that cannot take null, and just return null when the function is annotated
n

Nizar Hejazi

03/20/2022, 9:42 PM
@User I am happy to help do the change for date functions. Should I file a github issue and send a PR?
Question: What to do if end user wrote a Groovy script as an ingestion transform that returns null while setting
nullHandlingEnabled
to false?
m

Mayank

03/21/2022, 3:32 AM
Thanks. If native null support is off, nulls should be replaced by default null value
j

Jackie

03/21/2022, 5:15 PM
nullHandlingEnabled
should be independent of the ingestion transform as it happens after the transform is done. We should preserve the
null
values during the ingestion transform
n

Nizar Hejazi

03/22/2022, 1:00 AM
@User @User @User PR to fix the issue: https://github.com/apache/pinot/pull/8382
m

Mayank

03/22/2022, 1:00 AM
Thanks @User
j

Jackie

03/23/2022, 10:10 PM
@User Let's bring the discussion offline to quickly iterate
The idea for
nullableParameters
is to annotate only the functions that can take
null
and return something meaningful (e.g.
isNull
)
For
fromDateTime
and
dateTimeConvert
, they cannot handle
null
input properly, so we should not annotate them and just let the
InbuildFunctionEvaluator
evaluates the function to
null
.
After the ingestion transform,
null
values will be filled with default values, which is the expected behavior
n

Nizar Hejazi

03/23/2022, 10:36 PM
@User agree on these points and I replied to your comment mentioning that I will only annotate
fromDateTime
and
dateTimeConvert
w/
nullableParameters = true
. Note: I annotated previously other datetime functions (e.g. ago, timeZoneHour, etc.) w/ this annotation since they could be used theoretically as ingestion transform w/ a column that has null values, but agree this is highly unlikely and hence I will remove the annotation. The probability of these functions’ input coming from a data column is almost zero.
j

Jackie

03/23/2022, 10:45 PM
I think there are some confusion here. If we annotate
fromDateTime
with
nullableParameters = true
,
null
will be passed into the function, and cause NPE since the scalar function does not handle
null
properly
Instead, we want it to return
null
when getting
null
input value, so we should not annotate it with
nullableParameters = true
Oh, I see the misunderstanding here. We should perform the
null
check when
nullableParameters
is
false
instead of
true
.
When annotating a function with
nullableParameters = true
, that means the function can take
null
values, and we should pass the arguments as is; when it is not annotated, that means the function should not take
null
values, and we should directly return
null
when there is
null
argument.
👍 1
@User ^^
n

Nizar Hejazi

03/23/2022, 10:58 PM
@User clear now .. I will update
nullableParameters = true
by default (functions can handle nulls by default) and override it for
fromDateTime
and
dateTimeConvert
(set to false) since both methods cannot handle nulls.
j

Jackie

03/23/2022, 11:00 PM
@User Don't change the default for
nullableParameters
. All existing scalar functions cannot handle
null
We keep the annotation for future functions, no need to override it for any function now
n

Nizar Hejazi

03/23/2022, 11:02 PM
but most Json scalar functions were updated by this PR to handle null input (jsonpath)
I think the right change should be to keep the default for
nullableParameters
set to false as you suggested, and only override it for JsonPathXXX functions.
j

Jackie

03/23/2022, 11:05 PM
Good catch!
👍 1
Let me go over the json functions and see which one should be annotated
We should annotate
toJsonMapStr
,
jsonFormat
,
jsonPath*
with default value argument
n

Nizar Hejazi

03/23/2022, 11:17 PM
makes sense
Note:
jsonPathArray
cannot handle null. This is why we have
jsonPathArrayDefaultEmpty
.
j

Jackie

03/24/2022, 12:46 AM
True. Only the one with default value can retry
non-null
value when input is
null
n

Nizar Hejazi

03/24/2022, 8:37 PM
@User addressed comments .. btw, I still see the msg: First-time contributors need a maintainer to approve running workflows.
j

Jackie

03/24/2022, 8:41 PM
Yeah, I'll do it. Next time it will automatically trigger for you once this one is merged
The PR looks good. Will merge once the tests pass. Thanks for the contribution @User!
n

Nizar Hejazi

03/24/2022, 8:46 PM
thanks for reviewing @User