Hi I’m struggling to get a JSON col in a string wa...
# troubleshooting
f
Hi I’m struggling to get a JSON col in a string way to apply a JSONIndex on it. I’ve tried several things and I keep having a null value in the col 😞 Schema extract
{
"name": "fulldata",
"dataType": "STRING",
"maxLength": 2147483647
}
Table config extract
{
"columnName": "fulldata",
"transformFunction": "JSONFORMAT(meta)"
}
m
https://dev.startree.ai/docs/pinot/recipes/ingest-json-files take a look at this example and see if you can spot what's different
f
I’ve start to work based on these examples but no différence on what I tried. I’m looking to extact a full bloc a string
m
so on your table extract config you reference a field in your source data called 'meta' - does that field exist?
f
yes of course my schema for my json event has at root level the meta key
m
ok. I mean the table config/schema look ok to me
at least the extracts
f
I keep have null values in the query console but when filtering on non null value I’ve got stuff back from the query (with null ) it’s strange 😄
The only thing I see it’s that my meta field has complex type with neasted structure but no error anywere 😕
m
so does it load the data correctly for some rows?
f
nope. But I’ve tried to simplify my things and I’ve maybe found the good way to extract my cols
The field I’m looking for is in meta.relation. I was able to get meta as json and by applying another function to the meta i’m now able to access my relation value value 🙂 I was trying to do it in one transformation
k
Can you share Sample input row?
f
Yes sure
{
"meta": {
"expectedValidator": [],
"relation": [
{
"id": "2",
"firstName": "Jules",
"lastName": "HUGHUE"
}
],
"expectedValidatorsForReject": [],
"isEnabled": false,
"version": "somevers",
"isLogged": true,
"language": "fr"
},
"data": {
"id": "4",
"type": "tiypes",
"attributes": {
"term": "2017-02",
"creationDate": "2017-02-01T00:00:00+0100",
"updateDate": "2017-02-01T00:00:00+0100",
"informationComments": "someRandomtext",
"closed": false,
"state": "validated",
"paid": true,
"advance": 0,
"actualExpenses": [],
"fixedExpenses": [
{
<List of elems>
}
]
}
}
}
And I was looking to extract in one transformation relation from the meta bloc as a JSONSTRING
k
that should have worked
so when you do
select fulldata from T
you get all nulls?
f
yes so i’ve added somthing like this in my transormationsSpec :
{
"columnName": "meta_str",
"transformFunction": "JSONFORMAT(meta)"
},
{
"columnName": "relation",
"transformFunction": "JSONPATHARRAY(meta_str,'$.relation')"
}
n
Does it work without the second transform function? I think the relation one is incorrect
f
The 1 plus 2 works as I expect. But I was expecting to get something like 1 and 2 in single line. It was not the the case 😕
a
I am experiencing the exact same issue @francoisa describes. JSON fields show up as null despite following the JSON indexing instructions. The records load with no errors. I had previously tested using ingestion configurations from the same doc and got that to work successfully. However, I didn't really want to flatten the array data because it inflates rowcount.
m
@Aaron Buda can you share your table/schema config + how you're importing the JSON data and I'll try to reproduce
a
Copy code
{
  "sourceEventId": "f95450d9-5cc3-4ec4-8fcb-2b09f60f1535",
  "tenantName": "tenant6",
  "unifiedEventId": "95eb09aa-3c27-4e0b-96f8-83cae515fe45",
  "unifiedEventType": "IE",
  "activityType": [
    "Closed"
  ],
  "browser": [
    "Chrome"
  ],
  "domain": [
    "<http://tenant6.com|tenant6.com>"
  ],
  "service": [
    "Sites"
  ],
  "insertedTimestamp": 1646690621502,
  "source": [
    "GW"
  ],
  "updatedTimestamp": 1646690400000,
  "user": "<mailto:Sam.Green@tenant6.com|Sam.Green@tenant6.com>",
  "userName": "Sam.Green",
  "category": null,
  "country": "Benin",
  "device": [
    "Windows"
  ],
  "documentType": "model/iges",
  "host": [
    "101.164.243.20"
  ],
  "latitude": 62.33521305627646,
  "location": [
    "South Ozellaport (Trinidad and Tobago)"
  ],
  "longitude": 8.5003176578741,
  "objectType": [
    "Account Feed"
  ],
  "region": null,
  "severityName": "Minor",
  "statusName": "Invalid Password",
  "clearedBy": null,
  "clearedTimestamp": null,
  "clearedAs": null,
  "eventCount": null,
  "ioiCode": null,
  "incidentStartTime": 1646690621502,
  "incidentEndTime": 1646690621502,
  "threatScore": null,
  "accountType": "External",
  "accuracyRadius": null,
  "actionTaken": null,
  "activityCount": null,
  "allDomains": [
    "<http://tenant6.com|tenant6.com>"
  ],
  "attachments": null,
  "blockedApps": null,
  "city": "New Bruno",
  "contentChecks": null,
  "createdTimestamp": 1646690400000,
  "deviceManagementStatus": null,
  "display": null,
  "elasticaUser": "<mailto:Sam.Green@tenant6.com|Sam.Green@tenant6.com>",
  "externalRecipients": [
    {
      "identifier": "<mailto:Donny@hotmail.com|Donny@hotmail.com>",
      "ipAddress": "150.113.100.49",
      "port": 7810,
      "type": "TO",
      "external": true
    },
    {
      "identifier": "<mailto:Brandt@gmail.com|Brandt@gmail.com>",
      "ipAddress": "154.45.234.165",
      "port": 2473,
      "type": "TO",
      "external": true
    }
  ],
  "eventSource": null,
  "failureType": null,
  "file": [
    {
      "fileName": "BunnyLebowski.gif",
      "fileSize": 724
    },
    {
      "fileName": "TheDude.bmp",
      "fileSize": 683
    }
  ],
  "fileName": [],
  "fileSize": [],
  "internalRecipients": [
    {
      "identifier": "<mailto:Donny@tenant6.com|Donny@tenant6.com>",
      "ipAddress": "101.252.65.133",
      "port": 3161,
      "type": "TO",
      "external": false
    },
    {
      "identifier": "<mailto:TheBigLebowski@tenant6.com|TheBigLebowski@tenant6.com>",
      "ipAddress": "57.10.204.15",
      "port": 9823,
      "type": "TO",
      "external": false
    }
  ],
  "loginTime": 1646690400000,
  "message": null,
  "modifiedAt": 1646690621502,
  "objectName": "Donny.csv",
  "ownedBy": null,
  "ownedInternally": true,
  "parent": null,
  "platform": null,
  "policyAction": "ALERT",
  "policyType": "NULL",
  "policyViolated": null,
  "reqSize": 183,
  "reqUri": "<http://Jhogo.hotmail.com|Jhogo.hotmail.com>",
  "recordedAt": 1646690621502,
  "respCode": "200",
  "respSize": 2,
  "riskTypes": [
    "PII, ContentIQ Violations"
  ],
  "role": null,
  "scope": null,
  "sender": "<mailto:QarlKenning@gmail.com|QarlKenning@gmail.com>",
  "sharedWith": null,
  "subFeature": "Drive",
  "subject": null,
  "targetAccountType": "Internal",
  "timestamp": 1646690400000,
  "timeZone": "Asia/Novosibirsk",
  "userAgent": null,
  "userType": null,
  "dateSent": null,
  "monitorChannelType": null,
  "protocolID": null,
  "protocolType": null,
  "senderIPAddress": null,
  "senderPort": null,
  "senderIdentifier": null,
  "applicationReportName": null,
  "clientUserId": null,
  "dataType": null,
  "expectActionSack": null,
  "httpCookies": null,
  "httpMethod": null,
  "httpURL": null,
  "httpUserAgent": null,
  "networkDirection": null,
  "userIsInternal": null,
  "blockedStatus": null,
  "blockedStatusSuperseded": null,
  "detectionDate": null,
  "matchCount": null,
  "policyId": null,
  "policyName": null,
  "policyVersion": null,
  "severityi18nKey": null,
  "severityValue": null,
  "statusValue": null,
  "violationConditionId": null,
  "violationConditionName": null,
  "violationMatchCount": null,
  "blockedBytes": null,
  "blockedDownloadedBytes": null,
  "blockedSessionCount": null,
  "blockedUploadedBytes": null,
  "brr": null,
  "destination": null,
  "downloadedBytes": null,
  "duration": null,
  "networkDevice": null,
  "sessionCount": null,
  "tags": [],
  "uploadedBytes": null,
  "userDevice": null,
  "totalBytes": null,
  "totalPackets": null,
  "contentType": null,
  "createdBy": null,
  "eventService": null,
  "exposed": null,
  "exposureCountExternal": null,
  "exposureCountInternal": null,
  "exposedExternally": null,
  "exposedPublicly": null,
  "externalCollaborators": [],
  "fieldName": null,
  "format": null,
  "instance": null,
  "latency": null,
  "internalCollaborators": [],
  "newValue": null,
  "objectURL": [],
  "oldValue": null,
  "operation": null,
  "resource": null,
  "resourceId": null,
  "subscriptionId": null,
  "subscriptionName": null,
  "sysId": null,
  "tableName": null,
  "userCompany": "tenant6",
  "userCountry": "Trinidad and Tobago",
  "userDepartment": "Community-Services",
  "userDomain": "<http://tenant6.com|tenant6.com>",
  "userFirstName": "Sam",
  "userIsActive": true,
  "userIsAdmin": false,
  "userLastName": "Green",
  "userOrgUnit": "Individual & Family Services",
  "userRawRiskScore": 6,
  "userRiskRating": "Medium",
  "userRiskScore": 80,
  "userTitle": "Hospitality Orchestrator",
  "userThreatScore": 10
}
Sorry that's a sample doc and then the schema/table configs I'm using
m
and you ingest using an ingestion job?
a
streaming from Kafka
m
ok cool
a
thanks!
m
ok and just to check: • Which fields are correctly populated? • Which ones aren't?
a
everything except the three json fields (externalRecipients_json, file_json, internalRecipients_json) come through properly
m
Copy code
"transformConfigs": [
    {
      "columnName": "externalRecipients_json",
      "transformFunction": "JSONPATHARRAY(externalRecipients, '$.')"
    },
    {
      "columnName": "file_json",
      "transformFunction": "JSONPATHARRAY(file, '$.)"
    },
    {
      "columnName": "internalRecipients_json",
      "transformFunction": "JSONPATHARRAY(internalRecipients, '$.')"
    }
  ]
I'm wondering if it isn't reading it b/c it's an array
but I dunno what's the correct function to use. Maybe jsonpatharray
a
I'll try that
k
@Neha Pawar ^^
a
tried JSONPATHARRAY as Mark described, same result
n
taking a look as well
@Aaron Weiss, i tried with your table config. The transformConfigs field needs to be within an ingestionConfig field. I made this change in your config, and it worked for me:
Copy code
"ingestionConfig": {
  "transformConfigs": [
    {
      "columnName": "externalRecipients_json",
      "transformFunction": "jsonFormat(externalRecipients)"
    },
    {
      "columnName": "file_json",
      "transformFunction": "jsonFormat(file)"
    },
    {
      "columnName": "internalRecipients_json",
      "transformFunction": "jsonFormat(internalRecipients)"
    }
  ]
  }
@francoisa, your sample worked for me. this is the table config and schema I used. Can you share your whole configs? table:
Copy code
{
  "REALTIME": {
    "tableName": "jsonIssue_REALTIME",
    "tableType": "REALTIME",
    "segmentsConfig": {
      "schemaName": "jsonIssue",
      "replication": "1",
      "timeColumnName": "timestamp",
      "allowNullTimeValue": false,
      "replicasPerPartition": "1"
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant",
      "tagOverrideConfig": {}
    },
    "tableIndexConfig": {
      "invertedIndexColumns": [],
      "noDictionaryColumns": [],
      "rangeIndexColumns": [],
      "rangeIndexVersion": 2,
      "sortedColumn": [],
      "bloomFilterColumns": [],
      "loadMode": "MMAP",
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.topic.name": "jsonFormatIssue2",
        "stream.kafka.broker.list": "localhost:9092",
        "stream.kafka.consumer.type": "lowlevel",
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
        "realtime.segment.flush.threshold.rows": "0",
        "realtime.segment.flush.threshold.time": "24h",
        "realtime.segment.flush.segment.size": "100M"
      },
      "onHeapDictionaryColumns": [],
      "enableDefaultStarTree": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "autoGeneratedInvertedIndex": false,
      "varLengthDictionaryColumns": [],
      "enableDynamicStarTreeCreation": false,
      "createInvertedIndexDuringSegmentGeneration": false
    },
    "metadata": {},
    "quota": {},
    "routing": {},
    "query": {},
    "ingestionConfig": {
      "transformConfigs": [
        {
          "columnName": "fullData",
          "transformFunction": "jsonFormat(meta)"
        },
        {
          "columnName": "timestamp",
          "transformFunction": "now()"
        }
      ]
    },
    "isDimTable": false
  }
}
schema:
Copy code
{
  "schemaName": "jsonIssue",
  "dimensionFieldSpecs": [
    {
      "name": "fullData",
      "dataType": "STRING"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "timestamp",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}
data row in kafka
Copy code
{
  "meta": {
    "expectedValidator": [],
    "relation": [
      {
        "id": "2",
        "firstName": "Jules",
        "lastName": "HUGHUE"
      }
    ],
    "expectedValidatorsForReject": [],
    "isEnabled": false,
    "version": "somevers",
    "isLogged": true,
    "language": "fr"
  },
  "data": {
    "id": "4",
    "type": "tiypes",
    "attributes": {
      "term": "2017-02",
      "creationDate": "2017-02-01T00:00:00+0100",
      "updateDate": "2017-02-01T00:00:00+0100",
      "informationComments": "someRandomtext",
      "closed": false,
      "state": "validated",
      "paid": true,
      "advance": 0,
      "actualExpenses": []
    }
  }
}
a
ugh, thanks @Neha Pawar! Sorry I missed that and took up your time, trying now
n
no worries! i wonder, did you see the transform configs in the table config json in the UI? wondering how we could have done a better job of conveying this back. the json objects are all configured to ignore unknown fields
a
hey, now that you mention it I did notice earlier that I didn't see them, I did see jsonIndexColumns and noDictionaryColumns in the tableIndexConfig because those were correct
That should have clued me in earlier
Just thinking out loud here, but maybe a message somewhere that part of the schema was ignored?
f
Thanks for you feedback. I’ve figured it was a misunderstanding on my side as described previously. By chaining a JSON and then a JSONPATHARRAY as its neasted I’ve managed to get what I expect. Thank a lot for you time 🙂
k
there is a schema validate call and we should probably enable it by default. We explicitly set ignore unknown fields to true in schema/table config parsing.. mostly for backward compatibility purposes
n
Message saying what was ignored might be a good idea.. @Kishore G validate always happens. But in this case, it was just ignoring the transform configs because it was not in the expected place
k
I see.. that will be a good enhancement