Hi again, Another question related to new aspects ...
# advice-metadata-modeling
g
Hi again, Another question related to new aspects added via the PDL files. We managed to make all the changes needed in order to make the new Glossary Term aspects added by us visible in the GUI thanks to extensive help from @echoing-airport-49548 back in December. One thing not working yet is support of attributes from these aspects in filtering. I added the needed annotations in the respective part of the PDL files like this:
Copy code
@Searchable = {
     "/*": {
      "fieldType": "KEYWORD",
      "addToFilters": true,
      "filterNameOverride": "Term Category",
      "queryByDefault": true
     }
  }
  termCategory: optional GlossaryTermCategoryEnum
In another part, I did something similar to a boolean attribute (with
"fieldType": "BOOLEAN"
). Nevertheless, I don't get this attribute offered for filtering, even though my search finds glossary terms having these attributes populated. I know that the general mechanism works, because the following change performed to the Upstream and Downstream aspects of the Dataset entity results in getting possibility to filter based on the related attribute:
Copy code
/**
   * The type of the lineage
   */
  @Searchable = {
    "fieldType": "KEYWORD",
    "addToFilters": true,
    "filterNameOverride": "Lineage Type"
  }
  type: DatasetLineageType
Is the "addToFilters" support restricted to the Dataset entity (thus not working for Glossary Term)? If so, where (which file) do we need to modify to remove this limitation? Or is it possibly restricted to just some predefined list of supported aspects (again, where are those aspects listed in that case)? @bulky-soccer-26729, do you see some obvious issue in the outlined PDL file modification above? Or what else might be the reason?
e
Hi @gentle-portugal-21014 that annotation will lead to the filter being added to the left filter pane on Search, like this
Is that what youโ€™re trying to build for?
g
Hi @echoing-airport-49548, yes, that's exactly what I try to achieve. However, it doesn't work as expected in the case described in the first part of my post (i.e. in case of an attribute in a newly added aspect belonging to the Glossary Term entity).
e
Ah got it
I donโ€™t see an obvious issueโ€ฆ
Would you be able to join office hours tomorrow? Someone might be able to help you pair debug it
g
Thanks, that's a useful suggestion (I don't know much how the "office hours" work, thus I haven't thought about this option). However, I'm afraid that I cannot join today at 9am PST due to a conflicting appointment. ๐Ÿ˜ž
I'll try to join the office hours tomorrow.
e
Awesome! @bulky-soccer-26729 will be there and should be able to help ๐Ÿ™‚
b
hey @gentle-portugal-21014! i think i might actually see the issue with your pdl above and why that annotation isn't working as expected. in your PDL you have:
Copy code
@Searchable = {
     "/*": {
      ...
but the
"/*"
syntax is used when you're adding an annotation to a list field. Here you have a regular enum value field, so I would make your PDL look like this instead:
Copy code
@Searchable = {
    "fieldType": "KEYWORD",
    "addToFilters": true,
    "filterNameOverride": "Term Category",
    "queryByDefault": true
  }
  termCategory: optional GlossaryTermCategoryEnum
without the extra syntax for list fields
g
Indeed, thanks again for helping me with this!
e
Good catch Chris!!
g
Hi @bulky-soccer-26729, unfortunately, it turned out that my yesterday's change wasn't that successful after all. ๐Ÿ˜ž First of all, searching for the enum value doesn't work any longer after that change. The enum mentioned in the PDL fragment above is defined as:
Copy code
enum GlossaryTermCategoryEnum {
  @stringFormat = "Document"
  DOCUMENT

  @stringFormat = "ICT"
  ICT

  @stringFormat = "Role"
  ROLE

  @stringFormat = "Institution"
  INSTITUTION

  @stringFormat = "Other"
  OTHER
}
Searching for "ICT" found glossary terms having "ICT" value in the termCategory defined above before that change, whereas the same search doesn't work any longer after that change. ๐Ÿ˜ž Maybe the
"/*": {}
construct isn't appropriate for boolean fields (I had it there as well before the change), but it seems to be necessary for the enum / KEYWORD attributes according to my testing... Moreover, the change did not help in making "Term Category" available after a search returning terms with the termCategory attribute populated. ๐Ÿ˜ž Any further idea? Maybe there's really some kind of limitation regarding support for filtering based on attributes defined for Glossary Term entity, and/or defined in newly added aspects?
b
hmm very interesting.. I thought for sure that the
"/*"
syntax was for array fields. After you made your change and rebuilt GMS, did you ingest new data to test this out on? any change to Searchable annotations on PDL files is only applied to new data and is not retroactive with existing data.
i don't believe it should matter if it's a Glossary Term entity vs. another entity in this instance
the way it works is that you specify a field to have a Searchable annotation on it, and then when you ingest new data, we look to see if there's any fields with a Searchable annotation and if so, we create an upsert into elastic search for it
g
OK, thanks, I'll recheck that. Is restoring indices supposed to allow proper searching on previously created elements?
b
yes I believe that's the case.. and that's not what you're seeing?
g
Well, I'll have to make sure first that the indices have indeed been restored during the deployment in our CI/CD pipeline. ;-)
I'll let you know.
b
sounds good!
g
Unfortunately, reindexing doesn't seem to work well either. ๐Ÿ˜ž We used the approach for docker-compose deployment ("./docker/datahub-upgrade/datahub-upgrade.sh -u RestoreIndices"), both without and with the additional "-a clean" parameter. While running, it complains about unknown aspects, i.e. it seems to use a ready-made / released image which doesn't know our local changes. I don't know whether these complaints are considered important, or just warnings, but it doesn't help with resolution of the current issue. As of now, searching for anything appearing in Glossary Terms fails (empty output, error in log of the GMS container), whereas searching for strings appearing in dataset names is OK. Interestingly, the failure happens only if I select to have all occurrences of the (term related) search displayed - whispering (i.e. displaying found items as-you-type) works correctly for glossary term names as well (not for the termCategory enum value though, regardless whether it's for a newly created record or an older one)... I'm trying to revert my yesterdays change to see whether it helps (but I saved the errors from the GMS log in case somebody is able to understand the reason from it).
b
gotcha. can you share the error logs from GMS you're getting when searching?
g
Sure, will do in a minute. In the meantime - deployment of the reverted version just finished and searching for strings appearing in glossary term names works again...
The error is:
Copy code
17:15:56.801 [ForkJoinPool.commonPool-worker-9] WARN  c.l.m.s.e.query.ESSearchDAO:68 - Received 400 from Elasticsearch. Returning empty search response
org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]
	at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:187)
	at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1892)
	at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1869)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1626)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1583)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1553)
	at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1069)
	at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:60)
	at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.search(ESSearchDAO.java:100)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:97)
	at com.linkedin.metadata.search.client.CachingEntitySearchService.getRawSearchResults(CachingEntitySearchService.java:196)
	at com.linkedin.metadata.search.client.CachingEntitySearchService.lambda$getCachedSearchResults$0(CachingEntitySearchService.java:117)
	at com.linkedin.metadata.search.cache.CacheableSearcher.getBatch(CacheableSearcher.java:103)
	at com.linkedin.metadata.search.cache.CacheableSearcher.getSearchResults(CacheableSearcher.java:55)
	at com.linkedin.metadata.search.client.CachingEntitySearchService.getCachedSearchResults(CachingEntitySearchService.java:118)
	at com.linkedin.metadata.search.client.CachingEntitySearchService.search(CachingEntitySearchService.java:54)
	at com.linkedin.metadata.search.aggregator.AllEntitiesSearchAggregator.lambda$getSearchResultsForEachEntity$2(AllEntitiesSearchAggregator.java:161)
	at com.linkedin.metadata.utils.ConcurrencyUtils.lambda$transformAndCollectAsync$0(ConcurrencyUtils.java:24)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
	Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [<http://elasticsearch:9200>], URI [/glossarytermindex_v2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 400 Bad Request]
error=[object Object] error=[object Object] status=400
		at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:302)
		at org.elasticsearch.client.RestClient.performRequest(RestClient.java:272)
		at org.elasticsearch.client.RestClient.performRequest(RestClient.java:246)
		at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1613)
		... 21 common frames omitted
At certain point, it complained about something like that it got my search term (let's say "term") instead of a boolean, i.e. it was expecting true or false. However, this part of the log disappeared before I managed to download it.
Anyway - it's quite late here (18:45 local time), I need to go home. I'll try to get back later in the evening in case you come up with some idea. Moreover, I'm afraid that we'll need to find some solution for reindexing if I understand it correctly that the reindexing feature doesn't support custom aspects not available in released versions...
Hi @bulky-soccer-26729, after some additional testing, the outcomes are as follows: 1. The problem with searching the enum value was apparently related to the broken ElasticSearch index combined with the fact that this searching only works correctly for newly created elements and those having been reindexed. In other words, you were completely correct that the additional
"/*"
syntax was not necessary. ๐Ÿ™‚ 2. Unfortunately, this doesn't solve the original issue - filtering based on those attributes is not possible despite the
"addToFilters": true
annotation. ๐Ÿ˜ž I still need to get this problem resolved and have no clue what to do there... 3. On top of that, it seems that reindexing the whole database using the datahub-upgrade docker image is not possible for forked repositories containing metamodel changes (extensions). ๐Ÿ˜ž As part of my testing, I could reindex the individual glossary term records using the API approach (/aspects?action=restoreIndices on GMS), but the datahub-upgrade docker image complains about unknown aspects, etc. Please, let me know if I should open this last point in a different channel - it's kind of related to metadata-modeling, but also to other stuff, and I can imagine that other members of your team might need to be involved.
b
hey! okay gotcha. 1. good news! at least we know what's going on with that piece 2. hmm interesting.. are you seeing this
termCategory
column in your elastic search index? and the glossary terms you're creating have that field filled out in your database? 3. okay yes this is definitely interesting and something that we should bring up as a separate issue! I would suggest posting in #advice-metadata-modeling I believe as like you said it's due to modeling changes
g
Ad 2) Yes, looking into the Elastic Search, I can see multiple records belonging to "_index": "glossarytermindex_v2_1675272639027" containing this column. Moreover, I believe that successful searching for a particular value of this attribute kind of confirms that they're indexed properly, because those glossary terms wouldn't be found otherwise if I understand it correctly.
Ad 3) OK, I'll create another post for that.
b
hmm and just making sure - you added "addToFilters" originally when adding the searchable annotation and when you ingested data with this field?
g
Yes, I had the addToFilters there since the introduction of that new aspect. I changed the annotation based on your suggestion and reindexed all Glossary Term items using the GMS API (individually) since then. Searching for the attribute value didn't work for the particular items until I reindexed those items. I tried to create yet another item anew now and filled in the attribute accordingly just to be sure, but that made no difference either.
b
okay good to know. also I'm just seeing the error log above.. are you still getting that error? and if so, can you capture any more of it? it looks like it might be continuing on and there could be some good info after what was just posted above
g
No, that error is not there any longer. It already disappeared when I reverted the PDL annotation change yesterday. In the meantime, I re-applied that change (i.e. it's in the form proposed by you now) and reindexed glossary terms manually as described above (fortunately, it's just a testing environment with just a few items). There are other interesting lines in the logs, but those deserve their own thread, I guess. ;-)
I asked about those newly discovered interesting lines from the GMS log in https://datahubspace.slack.com/archives/C029A3M079U/p1675359319651589, but I hope that they shouldn't be directly related to the topic discussed here (but I'm not sure).
And btw, there wasn't anything useful in the GMS container log after the lines regarding the search issue yesterday. The following line was:
INFO  c.l.m.filter.RestliLoggingFilter:55 - GET /entitiesV2?ids=List(urn%3Ali%3Acorpuser%3Adatahub) - batchGet - 200 - 3ms
and there wer just a few other similar lines. I understand that those "omitted frames" might have been interesting, but those weren't written to that log.
Hi @bulky-soccer-26729, sorry to bother you again, but do you have any idea how to move forward with this filtering issue? I don't even know where (which layer / which source files) is the logic deciding on the filters being offered for a particular search result... :-(