A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Racking my brain team,

Anyone have ANY idea why this Transformer wont update MongoDB entities I have in datahub.

```transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:mongo:cicd"```

driving me crazy :disappointed:

here is full recipe

```source:
  type: "mongodb"
  config:
    # Coordinates
    connect_uri: "mongodb://..."

    # Credentials
    username: prod-read
    password: ...
    authMechanism: "SCRAM-SHA-1"

    # Options
    enableSchemaInference: True
    useRandomSampling: True
    maxSchemaSize: 300
    options:
      directConnection: true

sink:
  type: datahub-rest
  config:
    server: 'http://...'

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:mongo:cicd"```

Hm it worked with `--preview` in `datahub ingest` command. Very confused.

Now I am thinking about when datahub actually ingests workunits...

image.png

So it looks like the equivalent of a "flush" is performed at the end of a stream for mongo? My stream was not completing due to inability to serialize a byte sequence to a C int (overflow) in the python mongodb driver.

I dont remember this transaction like behavior in bigquery ingestion. I recall datasets appearing in realtime during ingestion.

I went into mongo driver code and added error handling to the failure point. Thats why the above screenshot was possible. We should catch exceptions at the top of the exception chain in `datahub ingest`

Also I wonder if this is because the entities exist already vs creating new nodes. This was a followup run so no new entities were being created but instead updated.

<@U02HBQB79TK> yeah as you suspected, we apply transforms only at the end of the stream now, because that is the only time where we know that there is no more data coming for this dataset.

This is not related to whether there is new data or existing data being updated.