DataHub #ingestion

witty-keyboard-20400

11/03/2021, 12:12 PM

Issue in password auth while ingesting from MongoDB: My yml file mongodb.yml has the connect_uri as following:

Copy code

connect_uri: "<mongodb://dbuser1>:xyz123#$$!us@192.168.1.100:27017/mongodbName?authSource=mongodbName"

The above mentioned username and password are correct, which I verified using Robo 3T client. When I try to ingest the metadata using the command

Copy code

datahub ingest -c ./mongodb.yml

It fails with the following auth error:

Copy code

all_credentials = {'mongodbName': MongoCredential ('SCRAM-SHA-1', 'mongodbName', 'dbuser1', 'xyz123#14911!us', None, <pymongo.auth._Cache object at 0x7ff8d7b91d00>, )
...
credentials = MongoCredential ('SCRAM-SHA-1', 'mongodbName', 'dbuser1', 'xyz123#14911!us', None, <pymongo.auth._Cache object at 0x7ff8d7b91d00>, )
...
...
OperationFailure: Authentication failed., full error: {'ok': 0.0, 'errmsg': 'Authentication failed.', 'code': 18, 'codeName': 'AuthenticationFailed', 'operationTime': Timestamp(1635941323, 2), '$clusterTime': {'clusterTime': Timestamp(1635941323, 2), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}}

Notice that the password provided by me in the yml file is: xyz123#$$!us ..while the password in the log statements are shown as: xyz123#14911!us Why is this happening? Is this a bug? Or am I missing something here? @big-carpet-38439 @miniature-tiger-96062

plus1 1

careful-insurance-60247

11/03/2021, 6:32 PM

Whats the best way to run ingestion? Have a server just run multiple recipes with a cronjob?

thank you 1

careful-insurance-60247

11/03/2021, 7:35 PM

For the MSSQL source are we able to set the isolation level?

damp-minister-31834

11/04/2021, 1:40 AM

Hi everyone, can datahub get lineage in hive automatically? I mean I hava two tables in hive,

tableA

is the upstream of

tableB

. When I ingest hive resource to datahub, the two tables are not familiar with each other. Is there a way to get the lineage automatically while ingestging without calling

lineage_emitter_kafka.py

lineage_emitter_rest.py

artificially.

damp-minister-31834

11/04/2021, 6:18 AM

Hi everyone, can datahub delete metadata while ingesting? The situation is I have datahub installed with ingested hive source. When I dropped a table in hive and ingested again, the dropped dataset was still in datahub. So must I call rest Api to delete the dataset in datahub? Is there an automatical way?

orange-flag-48535

11/04/2021, 6:23 AM

Hi everyone, does DataHub provide ingestion libraries in Java? Currently I only see Python modules for that. Thanks.

orange-flag-48535

11/04/2021, 11:05 AM

If I want to store new types of entities in Datahub, should I be extending the metadata model and creating new Entity types? Let's say I want to store schema information for the inventory in my shop (books, gadgets etc). The full list of existing Snapshot types is quite small: https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/Snapshot.pdl I'm also using https://datahubproject.io/docs/metadata-modeling/extending-the-metadata-model as reference If so, are there any examples where people have created their own Entity types and defined corresponding Aspects, Ownership etc?

plain-farmer-27314

11/04/2021, 5:07 PM

hey all, i'm trying to get things running locally using kubernetes/minikube following: https://datahubproject.io/docs/deploy/kubernetes it looks like everything in the prerequisites step is able to build except for elasticsearch-master-0 pod, which logs:

Copy code

"master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [elasticsearch-master-0, elasticsearch-master-1, elasticsearch-master-2] to bootstrap a cluster: have discovered

Did I miss a step here?

echoing-dress-35614

11/04/2021, 6:19 PM

In the last community meeting, @mammoth-bear-12532 mentioned a repo for DataHub recipes and best practices examples, based on a PR by @bland-orange-95847 . I'm interested in exploring this repo in preparation for writing new ingestion plugins for ArcGIS and Tableau. Anyone have a link?

damp-ambulance-34232

11/05/2021, 2:18 AM

Is there any way i can change the schema name when ingestion a dataset to datahub like public.datahub to new_name.datahub

abundant-flag-19546

11/05/2021, 6:01 AM

What role should I give to ingestion-cron’s service account to ingest from bigquery? I gave ‘BigQuery Metadata Viewer’ but got authentication error 😞

polite-flower-25924

11/05/2021, 7:52 AM

Hey folks, In order to catalog the Airflow DAGs in our DataHub, should I follow https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow? I just wonder if there is any simple recipe that I can ingest DAGs from Airflow without interfering with Airflow hooks and operators? As far as I know, Airflow already exposes a REST API for this metadata. (e.g. /api/v1/dags, /api/v1/dags/{dag_id}). Maybe, we can utilize that endpoint. What do you think?

orange-flag-48535

11/05/2021, 9:36 AM

In a Datahub recipe that uses a file source, is it possible to list multiple source files in the same recipe file?

orange-flag-48535

11/05/2021, 10:01 AM

I'm trying to create a custom Dataset and store a somewhat nested JSON schema inside its SchemaMetadata aspect. I'm trying to decide between the following four SchemaMetadata types: KafkaSchema, BinaryJsonSchema, KeyValueSchema, OtherSchema. Am I on the right track regarding this?

rhythmic-sundown-12093

11/08/2021, 3:16 AM

Hello team, I am trying to integrate OIDC authentication, our company has its own OIDC service, but it is accessed using https, how can I add a third party certificate in datahub, Thanks!

eager-answer-71364

11/08/2021, 6:56 AM

I face with an issue of duplicating keys (logs throw by datahub-gms) because of lower and upper case. I believe that I already delete those cases in elasticsearch and mysql, but those errors are still there. I already using search api on elasticsearch and querying in mysql but didn't hit keys which throw by gms. Is data only stored in elasticsearch and mysql?

rough-zoo-50278

11/08/2021, 7:25 AM

Hey I'm using a bare minimum config for postgres

Copy code

07:21:46.830 [qtp544724190-12] ERROR c.l.m.filter.RestliLoggingFilter - java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
07:21:47.228 [qtp544724190-10] INFO  c.l.m.filter.RestliLoggingFilter - POST /entities?action=ingest - ingest - 500 - 1ms

is there any additional config needed to make ingestion work? EDIT: Seems to fail here:

Copy code

: Invalid URN Parameter: 'No enum constant com.linkedin.common.FabricType.dev: urn:li:dataset:(urn:li:dataPlatform:postgres,...,...)

EDIT: It was wrong env (which didn't exist)

damp-ambulance-34232

11/08/2021, 9:04 AM

How do i ingest data queries/Stats

freezing-teacher-87574

11/08/2021, 10:33 AM

Hi! Please, Could someone supply me with an example of how to ingest feast data? I'm very confuse with this: Note: Feast ingestion requires Docker to be installed. Extracts: • List of feature tables (modeled as `MLFeatureTable`s), features (`MLFeature`s), and entities (`MLPrimaryKey`s) • Column types associated with each feature and entity Note: this uses a separate Docker container to extract Feast's metadata into a JSON file, which is then parsed to DataHub's native objects. This was done because of a dependency conflict in the

feast

module.

Copy code

source:
  type: feast
  config:
    core_url: localhost:6565 # default
    env: "PROD" # Optional, default is "PROD"
    use_local_build: False # Whether to build Feast ingestion image locally, default is False

Thanks

damp-ambulance-34232

11/08/2021, 10:40 AM

How to keep pre existed Tags and Owners when re-ingestion a datasource to datahub? Or how to ignored all existed dataset from a datasource when re-ingestion datily datasource to datahub

little-france-72098

11/08/2021, 10:43 AM

Hey guys, I was updating the ingestion library from 0.8.15.4 to 0.8.16.4 and suddenly some schemas can't be ingested by the kafka source with the error:

Copy code

File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 141, in run
    for wu in self.source.get_workunits():
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/kafka.py", line 84, in get_workunits
    mce = self._extract_record(t)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/kafka.py", line 115, in _extract_record
    fields = schema_util.avro_schema_to_mce_fields(schema.schema_str)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/extractor/schema_util.py", line 443, in avro_schema_to_mce_fields
    return list(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/extractor/schema_util.py", line 427, in to_mce_fields
    yield from converter._to_mce_fields(avro_schema)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/extractor/schema_util.py", line 408, in _to_mce_fields
    yield from self._avro_type_to_mce_converter_map[type(avro_schema)](avro_schema)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/extractor/schema_util.py", line 393, in _gen_from_non_field_nested_schemas
    yield from self._to_mce_fields(sub_schema)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/extractor/schema_util.py", line 408, in _to_mce_fields
    yield from self._avro_type_to_mce_converter_map[type(avro_schema)](avro_schema)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/extractor/schema_util.py", line 328, in _gen_nested_schema_from_field
    yield from self._to_mce_fields(sub_schema)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/extractor/schema_util.py", line 408, in _to_mce_fields
    yield from self._avro_type_to_mce_converter_map[type(avro_schema)](avro_schema)
KeyError: <class 'avro.schema.UUIDSchema'>

The schemas in question indeed have fields with the logicalType uuid and this case doesn't seem to be handled in schema_util.

brief-lizard-77958

11/08/2021, 3:05 PM

Hey everyone. I'm having trouble ingesting properties - they don't get added through a successful ingestion. For example ingesting a dashboard with customProperties. (Posting the code of the json I ingest in the thread) The ingestion is successful but I can't find the properties in the UI. Ingesting through rest - this exact code worked on previous versions and is taken from bootstrap_mce.json file from the latest github version.

agreeable-hamburger-38305

11/09/2021, 1:28 AM

Hi all, I have a question about the

Queries

tab. Since the default timeframe for ingesting usage (I am working with BigQuery) is the past day, if the database is not frequently used, all the

Queries

tab has is the ingestion queries, which is not very useful. I was wondering if there is a way to modify the default timeframe (instead of hard-coding `start_time`and

end_time

) to like 5 days. Also wondering if there is a way to ignore the queries ran by the ingestion job

orange-flag-48535

11/09/2021, 10:27 AM

I'm trying to generate JSON files containing MCE events from Java code. I'm modelling the MCE structure using Java classes and using Jackson to actually generate the JSON. It would be nice to be able to re-use any classes that Datahub exposes in order to create MCE objects. Does Datahub provide any Java libraries for this? Thanks.

square-activity-64562

11/09/2021, 4:53 PM

question regarding adding a pipeline in datahub. We have a pipeline which consists of AWS Kinesis, S3, Lamdba, redis channels, python processing applications etc. Probably a few more things. I was thinking of documenting this pipeline through adding the lineage in datahub. I think the following would be required • some way to represent redis channels • some way to represent Kinesis streams • some way to represent generic processors - currently only airflow seems to be shown in pipelines. I was thinking we can model redis channels and kinesis streams as datasets. Is there any limitation currently in the model to represent generic processors - like AWS lambda or python apps? I was thinking these could be tasks in datahub. Has anyone tried representing processing jobs outside airflow in datahub? Any feedback of how that went would be helpful.

plus1 4

dazzling-appointment-34954

11/10/2021, 10:07 AM

Hey everyone, I am still very new to the datahub community but stumbled across the following question: We work with a couple of technologies that are not supported (yet), e.g. Talend as a Pipeline tool, Databricks or also SAP R3. Is there a generic way to ingest metadata for those into the catalog or is the only way to build new individual connectors? Thanks in advance for some feedback 🙂

orange-flag-48535

11/10/2021, 11:11 AM

What is the Datahub recommended way to version metadata (let's say the SchemaMetadata aspect)? Currently I'm not able to view multiple versions on the UI no matter what value I set for version in the MCE json file. But I know that multiple versions are getting stored, because I'm able to rollback using "datahub ingest rollback" and it goes to previous version

orange-flag-48535

11/10/2021, 11:35 AM

This should probably go in as a bug report, but still checking here once: When providing SchemaMetadata items in a MCE file, it seems Datahub generates different graphs depending on the order of the fields. When I created a file with nested fields listed before the parent fields (as discernible from its fieldPath value), it didn't nest the inner fields within the outer fields correctly. Is this known behaviour?

brave-forest-5974

11/10/2021, 3:05 PM

❓ small question on some confusing language in the BigQuery source docs Looking at the source code, it seems that references to "schema" is what BigQuery refers to as "dataset" is that right? If so I'm happy to open a PR that adds a note is this doc that schema == dataset

numerous-guitar-35145

11/10/2021, 3:09 PM

Hi everyone, I'm having an issue similar to Andrew S, when ingesting properties, but with a business glossary. I created a business glossary, but now I need to add some properties to each term. I tried to follow the examples in Andrew thead, but it didn't work for me. Can you help me? This is the error: ValidationError: 1 validation error for BusinessGlossaryConfig nodes -> 0 -> terms -> 0 -> customProperties extra fields not permitted (type=value_error.extra)