Hi, I’ve just playing with a script which we can ...
# ingestion
w
Hi, I’ve just playing with a script which we can ingest data to Datahub programatically following this link -> https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/programatic_pipeline.py#enroll-beta At some point, I’ve figure it out there is a method called
log_ingestion_stats
in pipeline object. And I wondered if I can get some metrics about the pipeline which is runned. I saw some code block inside this method which sends some statistics data using telemetry object. It is like this:
Copy code
telemetry.telemetry_instance.ping(
    "ingest_stats",
    {
        "source_type": self.config.source.type,
        "sink_type": self.config.sink.type,
        "records_written": stats.discretize(
            self.sink.get_report().total_records_written
        ),
        "source_failures": stats.discretize(source_failures),
        "source_warnings": stats.discretize(source_warnings),
        "sink_failures": stats.discretize(sink_failures),
        "sink_warnings": stats.discretize(sink_warnings),
        "global_warnings": global_warnings,
        "failures": stats.discretize(source_failures + sink_failures),
        "warnings": stats.discretize(
            source_warnings + sink_warnings + global_warnings
        ),
    },
Inside the ping method, the code sends this data to an external api called Mixpanel. It seems you are collecting data about the pipeline from my machine. I don’t like this way of collecting data. Why are you collecting this data?
l
Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: 1️⃣ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? Yes button 2️⃣ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? Yes button
a
Hi, this is opt-out telemetry collection. You can disable it with the instructions in this doc https://datahubproject.io/docs/deploy/telemetry/