DataHub #ingestion

adventurous-scooter-52064

08/11/2021, 6:02 AM

Question, I realized ingestion doesn’t overwrite edited actions on UI. For instance, someone edited the table description, but we want to overwrite the table description forcefully through our daily ingestion(managing all module through datahub transformer), what do I have to add on my transformer to initialize a specific property that was changed through the UI, for instance the table description?

bumpy-activity-74405

08/11/2021, 6:07 AM

Hi, is there an easy way to “reset” my datahub instance? By reset I mean remove all the ingested metadata. I’ve (probably foolishly) deleted all the indices on ES and truncated tables in mysql and now I get when I try to browse datasets/dashboards/charts/pipelines. I’m running

v0.8.6

bland-easter-53873

08/11/2021, 11:17 AM

Hi, is it possible to customize the greatexpectations to run some custom validations

careful-insurance-60247

08/11/2021, 1:11 PM

Trying to use the profiling feature for mssql but running into an error.

Copy code

[2021-08-10 19:33:35,758] INFO     {great_expectations.data_context.data_context:2932} -        Profiled 9 columns using 0 rows from None (2.204 sec)
[2021-08-10 19:33:35,758] INFO     {great_expectations.data_context.data_context:2944} -
Profiled the data asset, with 0 total rows and 9 columns in 2.20 seconds.
Generated, evaluated, and stored 51 Expectations during profiling. Please review results using data-docs.
[2021-08-10 19:33:35,759] INFO     {root:1140} - Sending ROLLBACK TRAN
Traceback (most recent call last):
  File "/usr/local/bin/datahub", line 8, in <module>
    sys.exit(datahub())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/cli/ingest_cli.py", line 58, in run
    pipeline.run()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 108, in run
    for wu in self.source.get_workunits():
  File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/sql/sql_common.py", line 319, in get_workunits
    inspector, profiler, schema, sql_config
  File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/sql/sql_common.py", line 515, in loop_profiler
    **self.prepare_profiler_args(schema=schema, table=table),
  File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 118, in generate_profile
    profile = self._convert_evrs_to_profile(evrs, pretty_name=pretty_name)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 166, in _convert_evrs_to_profile
    profile, col, evrs_for_col, pretty_name=pretty_name
  File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 222, in _handle_convert_column_evrs
    column_profile.nullProportion = res["unexpected_percent"] / 100
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

wooden-sunset-90925

08/11/2021, 1:24 PM

<!here> I have tried to ingest the metadata from a mysql table to datahub and it was ingested successfully but I am doubting it because I am not able to see any data there. I am just able to see what columns are there and the data types (basically the values that are there in header of the SQL table but not the rows that are there). Also if I try to search any data that is there in a row for particular column, I get No data results in UI. I am just wondering whether DataHub provides this feature to search data there or it just ingests the table, columns, column data type or description only. Can anyone help me ?

colossal-account-65055

08/11/2021, 3:38 PM

Hello DataHub team! Can anyone help me to troubleshoot an error I'm getting when running metadata ingestion unit tests? (I'd like to set up our CI/CD to run the tests automatically.) Details in thread 🧵

careful-insurance-60247

08/11/2021, 4:29 PM

<!here> Has anyone used Apache Nifi with Datahub yet? Looking for ways to pull in some of our data pipelines to track lineage.

bumpy-activity-74405

08/12/2021, 6:30 AM

Hi, I’m trying to ingest dashboard data from looker and it seems like gms chokes on utf-8 characters in dashboard titles:

Copy code

Sink (datahub-rest) report:
{'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
               'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'message': "javax.persistence.PersistenceException: Error[(conn=2227952) Incorrect string value: '\\xC4\\x97l pi...' for "
                                   "column 'metadata' at row 1]",

Is this something that can be easily fixed or are utf-8 characters are just not supported?

better-orange-49102

08/12/2021, 12:41 PM

I'm planning to create remote instances of datahub ingestion containers and have them send the data to gms on another network, is there a way to authenticate?

nice-branch-87277

08/12/2021, 2:25 PM

I managed to run

python3 -m datahub docker quickstart

and managed to get everything running ok. I then took down the docker container (I ran

datahub docker nuke

docker/nuke.sh

, and

docker system prune --all

) and tried to rerun

python3 -m datahub docker quickstart

. Im getting the following error:

handsome-football-66174

08/12/2021, 9:06 PM

Trying to use Glue to ingest metadata. What are the configuration options present for Glue (for the .yml file ) to connect?

gray-autumn-29372

08/13/2021, 3:05 AM

Hello, tried to ingest metadata from hive, but got below error for a table. It complaints a SerDe not found

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class com.mongodb.hadoop.hive.BSONSerDe not found

square-activity-64562

08/13/2021, 5:28 AM

Regarding profiling, what would be a good way to override behavior for getting count of rows for tables. The current

count(*)

that is fired takes time and is not practical for large tables. For large tables it might be much better to use metadata tables when possible.

square-activity-64562

08/13/2021, 7:05 AM

@mammoth-bear-12532 Is there a list of platforms supported? I ingested a mariadb using mysql connector. It worked. But it is showing mysql in datahub which I would like to correct to "mariadb". I was thinking adding

underlying_platform

as an option in mysql source. What would be the correct thing here "mariadb" or something else?

square-activity-64562

08/13/2021, 8:21 AM

The default timeout of 2sec added for the rest emitter is too small. The smoke tests are bit flaky due to this change.

polite-flower-25924

08/13/2021, 9:08 AM

Hello all, I’ve started to ingest data from Kafka & Hive. All dataset origin is set to “PROD”. Is it possible to adjust that in ingestion recipes and how can I change that after ingestion? Thank you

witty-butcher-82399

08/13/2021, 11:05 AM

I see there are already a couple of transformers to set dataset ownership https://datahubproject.io/docs/metadata-ingestion/transformers#change-owners However, none of them allows for specifying the role of the ownership. Any plans for that?

curved-jordan-15657

08/13/2021, 3:22 PM

Hello! We’ve deployed datahub using k8s and aws tools which are aws rds,es and MSK. I’ve configured recipe.yml and for sink method used datahub-kafka method. Even bootstrap and schema registry urls are proper( i tried to connect and produce and consume with kafka-cli, it’s ok) , i get an error like:

Copy code

KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"} and info {'error': KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}, 'msg': <cimpl.Message object at 0x10ea17ac0>}

I think somehow, we cannot connect to kafka broker on MSK side from datahub. I wonder, can you please tell what are running k8s pods after deployment?

cool-iron-6335

08/15/2021, 1:56 PM

Hi, i have a problem when i ingest metadata. That is the permission of reading table or view.Sometime there are too many tables to deny. If i don't deny them, the ingestion will corrupt right way regardless a lot meta data doesn't ingest yet. Maybe we can ignore that problem by exception handle function in python. we simple continue the process if we encounter the permission problem

polite-flower-25924

08/16/2021, 6:53 AM

Hey team, I’m able to ingest metadata from several data sources (Kafka, Hive etc.). They are ingested through pull-based approach and I define the ingestion pod as k8s Job. If I want to do this operation periodically, I can easily convert k8s Job to CronJob and set proper schedule (e.g. every day). I just wonder if we can use push-based approach instead of pull-based? If so, how can we do that with the current architecture? I think this question is answered here. As far as I understand, there is no out-of-the-box solutions except Airflow. If we want to ingest metadata from Hive, Kafka, Superset, Looker we need to pull the metadata periodically? Please correct me if I’m wrong.

✅ 1

plus1 1

handsome-football-66174

08/16/2021, 8:52 PM

Trying to use Glue to ingest metadata. it is able to connect and ingest . But when I specify the following database_pattern: allow: - "covid_19" table_pattern: allow: - "covid.*" It is still pulling the tables names whose names do not match covid.*

bumpy-activity-74405

08/17/2021, 11:02 AM

hey, not sure if this is the right channel, but here it goes: i’ve ingested data from hive, lookml, looker using the cli tool. I’ve also prepared and ingested some custom

com.linkedin.dataset.UpstreamLineage

aspects for the datasets via the rest api. However I see that some pages do not load when trying to batchLoad (I think) the upstream/downstream dependency datasets. The UI looks like this:

magnificent-camera-71872

08/18/2021, 6:31 AM

Hi... I'm trying to ingest metadata from Redshift into datahub. It appears that the ingestion works ok for regular tables, but those tables defined in external schemas (and accessed using Redshift Spectrum) are simply skipped - and not even mentioned in the logs ? Is this a known restriction ?

modern-nail-74015

08/18/2021, 7:06 AM

Does it will inspect mysql db called ruicore.app and ruicore.parsed_app?

witty-butcher-82399

08/18/2021, 8:30 AM

Hi. I’m doing a simple ingestion of a couple of

kafka

topics as datasets + a

dataProcess

in-between consuming one and producing the other. While there are no errors during the ingestion, the UI fails as shown in the second screenshoot. Is that a sort of bug? Or there is something wrong in my mce json file (see thread)? Thanks!

modern-nail-74015

08/18/2021, 8:59 AM

Can I ingest multiple Postgres databases in a single yaml file？

witty-airline-46094

08/18/2021, 12:36 PM

Hey folks! We are looking into adopting DataHub as our backend for a data catalog. Our data pipelines heavily rely on Kafka with Schema Registry as transport layer. DataHub amazingly displays auto-ingested topics with schemas that follow the

TopicNameStrategy

strategy (basically the schema name is generated from the topic name), however it lacks support for the other two (

RecordNameStrategy

TopicRecordNameStrategy

). Are there any plans to support these formats in the future? E.g. if anybody is working on this or not (we might be able to help on this if the answer is no). Thanks for the awesome work, so far we like the product a lot!

wonderful-quill-11255

08/18/2021, 2:16 PM

Hello. I've got a question about the RBAC feature on the roadmap. Will this include a solution for preventing different teams of overwriting each others metadata which might otherwise happen in a well federated metadata production landscape?

curved-jordan-15657

08/18/2021, 3:18 PM

Hello! I have a question about dataset updates. If I delete a table in a dataset, how do I get it to be deleted or not visible on the UI side? I know about “status: removed” transformer, but if i have a scheduled ingestion in Airflow, is there a way to apply every changes automaticly without manually updating status or something else? I mean like commiting a code.

colossal-furniture-76714

08/18/2021, 3:58 PM

Has the format changed with the latest datahub release from nested / structured data fields? I had to upgrade to the latest version to get lineage ingested by airflow running. Now my json file produces a weird schema / table entry in datahub. I still had to prefix the upper/outer names of arrays and struct to get the order right. Is this now differenty implemented? Maybe even supporting ingesting from hive directly with deeply nested tables. Thanks for the feedback.