Racking my brain team, Anyone have ANY idea why t...
# getting-started
r
Racking my brain team, Anyone have ANY idea why this Transformer wont update MongoDB entities I have in datahub.
Copy code
transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:mongo:cicd"
driving me crazy 😞 here is full recipe
Copy code
source:
  type: "mongodb"
  config:
    # Coordinates
    connect_uri: "mongodb://..."

    # Credentials
    username: prod-read
    password: ...
    authMechanism: "SCRAM-SHA-1"

    # Options
    enableSchemaInference: True
    useRandomSampling: True
    maxSchemaSize: 300
    options:
      directConnection: true

sink:
  type: datahub-rest
  config:
    server: 'http://...'

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:mongo:cicd"
Hm it worked with
--preview
in
datahub ingest
command. Very confused. Now I am thinking about when datahub actually ingests workunits...
So it looks like the equivalent of a "flush" is performed at the end of a stream for mongo? My stream was not completing due to inability to serialize a byte sequence to a C int (overflow) in the python mongodb driver. I dont remember this transaction like behavior in bigquery ingestion. I recall datasets appearing in realtime during ingestion.
I went into mongo driver code and added error handling to the failure point. Thats why the above screenshot was possible. We should catch exceptions at the top of the exception chain in
datahub ingest
Also I wonder if this is because the entities exist already vs creating new nodes. This was a followup run so no new entities were being created but instead updated.
m
@ripe-alarm-85320 yeah as you suspected, we apply transforms only at the end of the stream now, because that is the only time where we know that there is no more data coming for this dataset.
This is not related to whether there is new data or existing data being updated.