For both Snowflake and MySQL ingestions, we have n...
# ingestion
a
For both Snowflake and MySQL ingestions, we have noticed that some of our datasets are missing the platform instance. The other datasets from the same UI ingestion run have the platform instance. We are running
0.10.2
server and
0.10.2.2
CLI for the UI ingestion. We looked at the
metadata_aspect_v2
table and noticed the
dataPlatformInstance
is missing the
instance
field in the
metadata
column. We saw the following:
{"platform":"urn:li:dataPlatform:mysql"}
as opposed to
{"platform":"urn:li:dataPlatform:mysql","instance":"<OUR_PLATFORM_INSTANCE_URN"}
We have never had such problem before. Has anyone else seen this?
βœ… 1
πŸ” 1
πŸ“– 1
FYI @gray-airplane-39227
l
Hey there πŸ‘‹ I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: βœ… There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? βœ… button βœ… It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? βœ… button Did you find a solution to your issue? ❌ Sorry you weren't able to find a solution. I'm sending you some tips on info you can provide to help the community troubleshoot. Whenever you feel your issue is solved, please react βœ… to your original message to let us know!
a
CC: @gray-shoe-75895, this could be the snowflake issue we’ve been seeing?
a
We have been investigating this on our end as well for the MySQL ingestion. We saw that there were three client requests related to the DPI aspect. 1. IngestProposal for the container aspect for the dataset entity 2. IngestProposal for the DPI aspect for the dataset entity 3. IngestEntity for the dataset entity itself. In all above, there is logic to check if the aspects in the entity spec for the
dataset
entity type exist, if not, then the code attempts to create the aspect. Because all three requests were running in parallel in GMS, it is a race condition. We observed that whenever request 2 won, the DPI aspect was populated correctly with the
instance
field. Whenever request 1 or 3 won, the DPI was missing the
instance
field. The code for creating the DPI aspect if it doesn't exist seems wrong because it doesn't set the
instance
field. As a result, the inserted aspect is missing the instance field.
Copy code
public static Optional<DataPlatformInstance> buildDataPlatformInstance(String entityType, RecordTemplate keyAspect) {
    try {
      return Optional.ofNullable(getDefaultDataPlatform(entityType, keyAspect))
          .map(platform -> new DataPlatformInstance().setPlatform(platform));
    } catch (URISyntaxException e) {
      log.error("Failed to generate data platform instance for entity {}, keyAspect {}", entityType, keyAspect);
      return Optional.empty();
    }
  }
However, this code has been there for a long time and we have only started seeing this problem in
v0.10.2
, so it is still puzzling why this suddenly started happening. You would expect this problem to have happened given that it is a race condition.
We think the following commit was the trigger of the problems. https://github.com/datahub-project/datahub/commit/589d354a5798d2dd3b46f68451c8f8e36561a459 If we configured the sink to have 1
max_threads
, the problem didn't happen. The server logic as pointed out above doesn't handle the
DataPlatformInstance
logic correctly when there are parallel requests trying to update it.
g
Thanks @able-evening-90828 - we’re looking into this