DataHub #ingestion

powerful-telephone-71997

06/28/2021, 5:27 AM

Any way to ingest data from Tableau, Metabase and Redash and bring in the lineage? I would love to collaborate, but will need pointers to start with…thanks

boundless-student-48844

06/29/2021, 10:21 AM

Hi team, i encountered below error when ingesting from hive. It seems to be issue when ingesting views. Do you have some idea how to troubleshoot?

Copy code

Traceback (most recent call last):
  File "/home/hadoop/.pyenv/versions/3.7.2/bin/datahub", line 8, in <module>
    sys.exit(main())
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/entrypoints.py", line 93, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/entrypoints.py", line 81, in ingest
    pipeline.run()
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 108, in run
    for wu in self.source.get_workunits():
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/ingestion/source/sql_common.py", line 239, in get_workunits
    yield from self.loop_views(inspector, schema, sql_config)
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/ingestion/source/sql_common.py", line 319, in loop_views
    view_definition = inspector.get_view_definition(view)
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 338, in get_view_definition
    self.bind, view_name, schema, info_cache=self.info_cache
  File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/sqlalchemy/engine/interfaces.py", line 363, in get_view_definition
    raise NotImplementedError()
NotImplementedError

brief-lizard-77958

06/29/2021, 12:15 PM

In the Charts section under Sources, I want to add some ingested Charts. For example, I want to have Baz Chart 1 in the Sources for Baz chart 2 (see picture). It is currently only possible to add datasets under Sources - if I try setting anything else the ingestion breaks the application. What would I have to do to make this possible? Add another tab like "Chart sources" where I would only have the charts separately listed?

boundless-student-48844

06/29/2021, 12:52 PM

Hi team, I found that the

hive

plugin fails to ingest tables with table names starting with underscore (),_ such as

crm_.__test

. Upon drilling down, it is because pyhive’s

_get_table_columns()

doesn’t escape such table names with backtick (`), as can be seen here https://github.com/dropbox/PyHive/blob/master/pyhive/sqlalchemy_hive.py#L283 The DESCRIBE query for above case in Hive should be

Copy code

describe `crm._test`;

instead of

Copy code

describe crm._test;

future-waitress-970

06/29/2021, 1:55 PM

Sink (datahub-rest) report:

{'failures': [{'e': JSONDecodeError('Expecting value: line 1 column 1 (char 0)',)},

is how the error is showing below it

astonishing-yak-92682

07/01/2021, 4:30 AM

Hello Team, I am facing some weird issue while overriding data. in v0.8.1 Tried ingesting one dataset with urn - urnlidataset:(urnlidataPlatform:kafka,SampleKafkaDataset,PROD) using pushed based architecture, everything worked fine (proper entry in es and mysql) now added one new searchable aspect in the same urn and then tried to ingest expected -> new aspect should present in mysql and es actual result -> new aspect in only in es but not in mysql Same case if we delete any aspect - expected -> deleted aspect should not be in mysql and es actual result -> removed aspect got deleted from es but still present in mysql I also verified above steps using quickstart and ingestion docker scripts Is this something expected or its an issue ?

crooked-librarian-97951

07/01/2021, 1:51 PM

Hi - from reading previous threads, I understand DataHub requires Kafka and that there are plans to support other pub-sub systems in the future as well. We're currently running POCs with various open source data discovery tools, and DataHub is definitely a great candidate. But... my company is using the Google Cloud Platform and wants to use as much as possible the "standard" GCP components. Our engineers are reluctant to choose a solution that needs Kafka, and would much prefer to work with Google Cloud Pub/Sub. Are there other people who are facing the same challenge? Are there plans to support Google Cloud Pub/Sub specifically? Any idea of what it would take to contribute Pub/Sub support the project ourselves?

✅ 1

future-waitress-970

07/01/2021, 6:58 PM

Hey, I am having a new issue. When i try to ingest, the ingestion starts, goes halfway, and then I get two errors, one being:

Failed to establish a new connection: [Errno 111] Connection refused',))"})

And

datahub-gms exited with code 255

faint-wolf-61232

07/02/2021, 9:00 AM

Hi Team, Can we import the Metadata in to datahub from qlik sense report instead of superset report; if so what changes needs to be done in the configuration. Thanks in advance ! regards,

cool-iron-6335

07/02/2021, 9:22 AM

Copy code

[2021-07-02 16:17:51,963] INFO     {datahub.entrypoints:75} - Using config: {'source': {'type': 'hive', 'config': {'host_port': 'localhost:10000', 'database': 'test'}}, 'sink': {'type': 'datahub-rest', 'config': {'server': '<http://localhost:8080>'}}}
[2021-07-02 16:17:52,210] ERROR    {datahub.ingestion.run.pipeline:52} - failed to write record with workunit test.test.test1 with Expecting value: line 1 column 1 (char 0) and info {}

Source (hive) report:
{'failures': {},
 'filtered': [],
 'tables_scanned': 1,
 'views_scanned': 0,
 'warnings': {},
 'workunit_ids': ['test.test.test1'],
 'workunits_produced': 1}
Sink (datahub-rest) report:
{'failures': [{'e': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}], 'records_written': 0, 'warnings': []}

colossal-furniture-76714

07/02/2021, 2:02 PM

Hello, I hope you all are well! What means the field auditHeader in a mce.json file? I have to create customs jsons due to deeply nested data and I try to align my files to files that I receive if I sink for example from hive to a json file.

square-activity-64562

07/06/2021, 9:45 AM

Trying ingestion from postgresql to datahub on local. The logs don't do any masking for passwords when doing ingestion. Am I missing some configuration?

square-activity-64562

07/06/2021, 10:09 AM

Is it possible to specify a custom name for a database? We have some old DBs which are in

postgres

DB itself. Multiple hosts with same database name. And nobody knows them as

postgres

but instead as something which is business specific. I would like to have them be business specific. Looking at transformations it might be possible https://datahubproject.io/docs/metadata-ingestion/#transformations. Am I missing some option here?

white-beach-27328

07/06/2021, 6:43 PM

FYI @early-lamp-41924, I’m trying to figure out how to integrate our airflow jobs with datahub’s lineage but I’m running into issues trying to figure out how to get the kafka ssl configuration passed. I’m getting errors like

Copy code

ssl.ca.location
  extra fields not permitted (type=value_error.extra)

for the extra json configuration I’m trying to put together. I tried using a similar pattern with keys from the ingestion recipes I’m using to get Datasets, but I’m getting errors like

Copy code

schema_registry_url
  extra fields not permitted (type=value_error.extra)

Kind of at a loss since most of the links like the following aren’t working: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/configuration/kafka.py#L56 Any tips on how to do this configuration?

better-orange-49102

07/07/2021, 9:19 AM

hadoop noob here, just wondering, was impyla considered as a alternative to pyhive? it can pull the meta data for both kudu and hive tables (assuming Hive Metastore integration is enabled) as compared to pyhive, which can only get hive tables

calm-sunset-28996

07/07/2021, 12:58 PM

This more granular implementation and update of aspects. Is this something which is on the roadmap? Because I remembered some posts about this? Because from my understanding whenever 2 independent projects push for example extra properties to an entity, the last one wins. So this is kind of a blocker for a distributed push based approach where each team could push some properties (and have ownership over that part of the data) without overwriting the other team's properties. (Or take that we as a central team push some properties and we want to allow other teams to enhance it with whatever information they think is relevant.) Or is there another approach we can take wrt this? (Or is this an anti pattern?)

crooked-leather-44416

07/07/2021, 3:06 PM

This probably relates to the previous post. Is there a way to update a value for just one custom property without overriding the entire aspect (DatasetProperties)?

Copy code

curl --location --request POST '<http://localhost:8080/datasets?action=ingest>' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
  "snapshot": {
    "aspects": [     
      {
        "com.linkedin.dataset.DatasetProperties":  {
            "customProperties": {                
                "ValidThroughDate": "2021-03-15T11:40:49Z"
            }
        }
      }     
    ],
    "urn": "urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)"
  }
}'

If I send this request, it will remove all existing custom properties, unless I stuff them in the same request.

tall-monitor-59941

07/08/2021, 9:05 AM

Hello, I've couple of questions regarding data modelling and ingestion: How can I register a custom entity to Datahub instance ? I'm running an instance locally via docker, Is there a way to register new entity via APIs ? In my use case, I want to store following attributes related to topic metadata: a. topicName b. clusterType c. dtoName d. packageName One way to solve this is to define custom entity / aspect but this needs rebuilding and redeployment afaik. Is there a way to dynamically load new entities without re-deployment ? I would like to know recommended way to solve such use-case.

better-orange-49102

07/08/2021, 9:21 AM

im ingesting datasets from sources using ingest recipes, and am interested in making each source the root directory in the UI, as opposed to PROD. for instance, browsepath should be "/hive/my_dataset" instead of "PROD/hive/my_dataset" Is implementing a transformer to change the browse path the easiest way or is there a even more simple solution.

square-activity-64562

07/08/2021, 9:42 AM

Tags added via the transformation

simple_add_dataset_tags

do not show up in the search autocomplete in the UI. They are present in the system and show up in search results and UI. But not in autocomplete. Tags added via the UI show up in search autocomplete in the UI.

witty-butcher-82399

07/08/2021, 10:42 AM

Hi! Having a look to the doc of the transforms for dataset ownership, it mentions:

If you’d like to add more complex logic for assigning ownership, you can use the more generic `add_dataset_ownership` transformer, which calls a user-provided function to determine the ownership of each dataset.

Is there any example of this? I’m not sure how to set such a function in the yaml. Also, does the function need to be registered somewhere? Thanks!

better-orange-49102

07/09/2021, 10:04 AM

if we were to use airflow or cron to schedule periodic ingest of tables in databases, and if there are no changes to the tables, the mysql will accumulate nearly identical rows of SchemaMetaData, (except for the timestamp which keeps updating). Whats the best practice here, to periodically delete away older versions of the aspects? can we directly delete away older versions of the aspects on the DB?

cool-iron-6335

07/12/2021, 9:46 AM

Hi, is it possible to ingest data from hdfs database ? If it is, can you give me a recipe example ? The metadata ingestion document seems not to mention to this aspect.

rich-policeman-92383

07/12/2021, 12:58 PM

Hi Please help me debug this problem. While ingesting metadata from mongodb to datahub-gms rest sink i am getting below error. acryl-datahub cli is 0.8.3 and datahub is running on docker swarm with version 0.8.3. Also is there a way to verify that datahub-gms is indeed running version 0.8.3.

rich-policeman-92383

07/13/2021, 3:26 PM

Copy code

File "datahub_v_0_8_6/metadata-ingestion/dhubv086/lib64/python3.6/site-packages/pyhive/hive.py", line 479, in execute
    _check_status(response)
File "datahub_v_0_8_6/metadata-ingestion/dhubv086/lib64/python3.6/site-packages/pyhive/hive.py", line 609, in _check_status
    raise OperationalError(response)
OperationalError: (pyhive.exc.OperationalError) TExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class com.LeapSerde not found:17:16', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:400', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:238', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:274', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:337', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:439', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:405', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:257', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:503', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor:process:HadoopThriftAuthBridge.java:747', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*org.apache.hadoop.hive.metastore.api.MetaException:java.lang.ClassNotFoundException Class com.LeapSerde not found:28:12', 'org.apache.hadoop.hive.metastore.MetaStoreUtils:getDeserializer:MetaStoreUtils.java:406', 'org.apache.hadoop.hive.ql.metadata.Table:getDeserializerFromMetaStore:Table.java:274', 'org.apache.hadoop.hive.ql.metadata.Table:getDeserializer:Table.java:267', 'org.apache.hadoop.hive.ql.exec.DDLTask:describeTable:DDLTask.java:3184', 'org.apache.hadoop.hive.ql.exec.DDLTask:execute:DDLTask.java:380', 'org.apache.hadoop.hive.ql.exec.Task:executeTask:Task.java:214', 'org.apache.hadoop.hive.ql.exec.TaskRunner:runSequential:TaskRunner.java:99', 'org.apache.hadoop.hive.ql.Driver:launchTask:Driver.java:2054', 'org.apache.hadoop.hive.ql.Driver:execute:Driver.java:1750', 'org.apache.hadoop.hive.ql.Driver:runInternal:Driver.java:1503', 'org.apache.hadoop.hive.ql.Driver:run:Driver.java:1287', 'org.apache.hadoop.hive.ql.Driver:run:Driver.java:1282', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:236'], sqlState='08S01', errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class com.LeapSerde not found'), operationHandle=None)
[SQL: DESCRIBE `default.leap_flume_prod_new`]

Hi Guys Can you please help me with this error while ingesting hive metadata. datahub version: v0.8.6

faint-hair-91313

07/14/2021, 8:22 AM

Hi guys, I've manually ingested a chart and dashboard with the REST API, but fail to find them in the GUI. They do appear in searches and lineage. Should I do something else?

salmon-cricket-21860

07/15/2021, 1:00 AM

Hi, i am trying to use druid ingestion library. But getting this error. (Python Version 3.8 / Datahub Version 0.8.6)

Copy code

File "/home/jovyan/conda-envs/catalog/lib/python3.8/site-packages/datahub/ingestion/source/sql_common.py", line 62, in make_sqlalchemy_uri
    40   def make_sqlalchemy_uri(
    41       scheme: str,
    42       username: Optional[str],
    43       password: Optional[str],
    44       at: Optional[str],
    45       db: Optional[str],
    46       uri_opts: Optional[Dict[str, Any]] = None,
    47   ) -> str:
 (...)
    58       if uri_opts is not None:
    59           if db is None:
    60               url += "/"
    61           params = "&".join(
--> 62               f"{key}={quote_plus(value)}" for (key, value) in uri_opts.items() if value
    63           )

AttributeError: 'DruidConfig' object has no attribute 'items'

salmon-cricket-21860

07/15/2021, 2:50 PM

To register airflow dags in datahub's 'Pipelines' menu like the screenshot above, What should be done? Just using datahub lineage backend in airflow is enough?

square-activity-64562

07/15/2021, 9:59 PM

After running ingestion I have the following message. I wrote a single table in my recipe so this is as expected

Copy code

Sink (datahub-rest) report:
{'failures': [], 'records_written': 1, 'warnings': []}

But when I go to the UI there is no dataset. Is there supposed to be some delay? Should I check the errors of some service. If yes, which one?

salmon-cricket-21860

07/16/2021, 1:00 AM

Hi. I am trying to update all descriptions of a dataset by re-ingesting the table from Hive. But after re-ingestion, modified descriptions don't change even w/ Status(removed=True) MCE event. How can I use the original table's description?