https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • p

    powerful-telephone-71997

    06/28/2021, 5:27 AM
    Any way to ingest data from Tableau, Metabase and Redash and bring in the lineage? I would love to collaborate, but will need pointers to start with…thanks
    h
    l
    +2
    • 5
    • 14
  • b

    boundless-student-48844

    06/29/2021, 10:21 AM
    Hi team, i encountered below error when ingesting from hive. It seems to be issue when ingesting views. Do you have some idea how to troubleshoot?
    Copy code
    Traceback (most recent call last):
      File "/home/hadoop/.pyenv/versions/3.7.2/bin/datahub", line 8, in <module>
        sys.exit(main())
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/entrypoints.py", line 93, in main
        sys.exit(datahub(standalone_mode=False, **kwargs))
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
        return self.main(*args, **kwargs)
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 1062, in main
        rv = self.invoke(ctx)
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/click/core.py", line 763, in invoke
        return __callback(*args, **kwargs)
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/entrypoints.py", line 81, in ingest
        pipeline.run()
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 108, in run
        for wu in self.source.get_workunits():
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/ingestion/source/sql_common.py", line 239, in get_workunits
        yield from self.loop_views(inspector, schema, sql_config)
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/datahub/ingestion/source/sql_common.py", line 319, in loop_views
        view_definition = inspector.get_view_definition(view)
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 338, in get_view_definition
        self.bind, view_name, schema, info_cache=self.info_cache
      File "/home/hadoop/.pyenv/versions/3.7.2/lib/python3.7/site-packages/sqlalchemy/engine/interfaces.py", line 363, in get_view_definition
        raise NotImplementedError()
    NotImplementedError
    g
    • 2
    • 6
  • b

    brief-lizard-77958

    06/29/2021, 12:15 PM
    In the Charts section under Sources, I want to add some ingested Charts. For example, I want to have Baz Chart 1 in the Sources for Baz chart 2 (see picture). It is currently only possible to add datasets under Sources - if I try setting anything else the ingestion breaks the application. What would I have to do to make this possible? Add another tab like "Chart sources" where I would only have the charts separately listed?
    g
    • 2
    • 4
  • b

    boundless-student-48844

    06/29/2021, 12:52 PM
    Hi team, I found that the
    hive
    plugin fails to ingest tables with table names starting with underscore (),_ such as
    crm_.__test
    . Upon drilling down, it is because pyhive’s
    _get_table_columns()
    doesn’t escape such table names with backtick (`), as can be seen here https://github.com/dropbox/PyHive/blob/master/pyhive/sqlalchemy_hive.py#L283 The DESCRIBE query for above case in Hive should be
    Copy code
    describe `crm._test`;
    instead of
    Copy code
    describe crm._test;
    g
    • 2
    • 1
  • f

    future-waitress-970

    06/29/2021, 1:55 PM
    Sink (datahub-rest) report:
    {'failures': [{'e': JSONDecodeError('Expecting value: line 1 column 1 (char 0)',)},
    is how the error is showing below it
    g
    • 2
    • 33
  • a

    astonishing-yak-92682

    07/01/2021, 4:30 AM
    Hello Team, I am facing some weird issue while overriding data. in v0.8.1 Tried ingesting one dataset with urn - urnlidataset:(urnlidataPlatform:kafka,SampleKafkaDataset,PROD) using pushed based architecture, everything worked fine (proper entry in es and mysql) now added one new searchable aspect in the same urn and then tried to ingest expected -> new aspect should present in mysql and es actual result -> new aspect in only in es but not in mysql Same case if we delete any aspect - expected -> deleted aspect should not be in mysql and es actual result -> removed aspect got deleted from es but still present in mysql I also verified above steps using quickstart and ingestion docker scripts Is this something expected or its an issue ?
    e
    b
    • 3
    • 10
  • c

    crooked-librarian-97951

    07/01/2021, 1:51 PM
    Hi - from reading previous threads, I understand DataHub requires Kafka and that there are plans to support other pub-sub systems in the future as well. We're currently running POCs with various open source data discovery tools, and DataHub is definitely a great candidate. But... my company is using the Google Cloud Platform and wants to use as much as possible the "standard" GCP components. Our engineers are reluctant to choose a solution that needs Kafka, and would much prefer to work with Google Cloud Pub/Sub. Are there other people who are facing the same challenge? Are there plans to support Google Cloud Pub/Sub specifically? Any idea of what it would take to contribute Pub/Sub support the project ourselves?
    ✅ 1
    m
    • 2
    • 2
  • f

    future-waitress-970

    07/01/2021, 6:58 PM
    Hey, I am having a new issue. When i try to ingest, the ingestion starts, goes halfway, and then I get two errors, one being:
    Failed to establish a new connection: [Errno 111] Connection refused',))"})
    And
    datahub-gms exited with code 255
    g
    • 2
    • 10
  • f

    faint-wolf-61232

    07/02/2021, 9:00 AM
    Hi Team, Can we import the Metadata in to datahub from qlik sense report instead of superset report; if so what changes needs to be done in the configuration. Thanks in advance ! regards,
    g
    • 2
    • 1
  • c

    cool-iron-6335

    07/02/2021, 9:22 AM
    Copy code
    [2021-07-02 16:17:51,963] INFO     {datahub.entrypoints:75} - Using config: {'source': {'type': 'hive', 'config': {'host_port': 'localhost:10000', 'database': 'test'}}, 'sink': {'type': 'datahub-rest', 'config': {'server': '<http://localhost:8080>'}}}
    [2021-07-02 16:17:52,210] ERROR    {datahub.ingestion.run.pipeline:52} - failed to write record with workunit test.test.test1 with Expecting value: line 1 column 1 (char 0) and info {}
    
    Source (hive) report:
    {'failures': {},
     'filtered': [],
     'tables_scanned': 1,
     'views_scanned': 0,
     'warnings': {},
     'workunit_ids': ['test.test.test1'],
     'workunits_produced': 1}
    Sink (datahub-rest) report:
    {'failures': [{'e': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}], 'records_written': 0, 'warnings': []}
    g
    • 2
    • 2
  • c

    colossal-furniture-76714

    07/02/2021, 2:02 PM
    Hello, I hope you all are well! What means the field auditHeader in a mce.json file? I have to create customs jsons due to deeply nested data and I try to align my files to files that I receive if I sink for example from hive to a json file.
    g
    • 2
    • 6
  • s

    square-activity-64562

    07/06/2021, 9:45 AM
    Trying ingestion from postgresql to datahub on local. The logs don't do any masking for passwords when doing ingestion. Am I missing some configuration?
    g
    e
    • 3
    • 7
  • s

    square-activity-64562

    07/06/2021, 10:09 AM
    Is it possible to specify a custom name for a database? We have some old DBs which are in
    postgres
    DB itself. Multiple hosts with same database name. And nobody knows them as
    postgres
    but instead as something which is business specific. I would like to have them be business specific. Looking at transformations it might be possible https://datahubproject.io/docs/metadata-ingestion/#transformations. Am I missing some option here?
    l
    g
    • 3
    • 5
  • w

    white-beach-27328

    07/06/2021, 6:43 PM
    FYI @early-lamp-41924, I’m trying to figure out how to integrate our airflow jobs with datahub’s lineage but I’m running into issues trying to figure out how to get the kafka ssl configuration passed. I’m getting errors like
    Copy code
    ssl.ca.location
      extra fields not permitted (type=value_error.extra)
    for the extra json configuration I’m trying to put together. I tried using a similar pattern with keys from the ingestion recipes I’m using to get Datasets, but I’m getting errors like
    Copy code
    schema_registry_url
      extra fields not permitted (type=value_error.extra)
    Kind of at a loss since most of the links like the following aren’t working: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/configuration/kafka.py#L56 Any tips on how to do this configuration?
    m
    e
    g
    • 4
    • 8
  • b

    better-orange-49102

    07/07/2021, 9:19 AM
    hadoop noob here, just wondering, was impyla considered as a alternative to pyhive? it can pull the meta data for both kudu and hive tables (assuming Hive Metastore integration is enabled) as compared to pyhive, which can only get hive tables
    g
    • 2
    • 3
  • c

    calm-sunset-28996

    07/07/2021, 12:58 PM
    This more granular implementation and update of aspects. Is this something which is on the roadmap? Because I remembered some posts about this? Because from my understanding whenever 2 independent projects push for example extra properties to an entity, the last one wins. So this is kind of a blocker for a distributed push based approach where each team could push some properties (and have ownership over that part of the data) without overwriting the other team's properties. (Or take that we as a central team push some properties and we want to allow other teams to enhance it with whatever information they think is relevant.) Or is there another approach we can take wrt this? (Or is this an anti pattern?)
    b
    a
    +3
    • 6
    • 27
  • c

    crooked-leather-44416

    07/07/2021, 3:06 PM
    This probably relates to the previous post. Is there a way to update a value for just one custom property without overriding the entire aspect (DatasetProperties)?
    Copy code
    curl --location --request POST '<http://localhost:8080/datasets?action=ingest>' \
    --header 'X-RestLi-Protocol-Version: 2.0.0' \
    --header 'Content-Type: application/json' \
    --data-raw '{
      "snapshot": {
        "aspects": [     
          {
            "com.linkedin.dataset.DatasetProperties":  {
                "customProperties": {                
                    "ValidThroughDate": "2021-03-15T11:40:49Z"
                }
            }
          }     
        ],
        "urn": "urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)"
      }
    }'
    If I send this request, it will remove all existing custom properties, unless I stuff them in the same request.
    m
    • 2
    • 1
  • t

    tall-monitor-59941

    07/08/2021, 9:05 AM
    Hello, I've couple of questions regarding data modelling and ingestion: How can I register a custom entity to Datahub instance ? I'm running an instance locally via docker, Is there a way to register new entity via APIs ? In my use case, I want to store following attributes related to topic metadata: a. topicName b. clusterType c. dtoName d. packageName One way to solve this is to define custom entity / aspect but this needs rebuilding and redeployment afaik. Is there a way to dynamically load new entities without re-deployment ? I would like to know recommended way to solve such use-case.
    g
    l
    b
    • 4
    • 7
  • b

    better-orange-49102

    07/08/2021, 9:21 AM
    im ingesting datasets from sources using ingest recipes, and am interested in making each source the root directory in the UI, as opposed to PROD. for instance, browsepath should be "/hive/my_dataset" instead of "PROD/hive/my_dataset" Is implementing a transformer to change the browse path the easiest way or is there a even more simple solution.
    g
    • 2
    • 2
  • s

    square-activity-64562

    07/08/2021, 9:42 AM
    Tags added via the transformation
    simple_add_dataset_tags
    do not show up in the search autocomplete in the UI. They are present in the system and show up in search results and UI. But not in autocomplete. Tags added via the UI show up in search autocomplete in the UI.
    g
    • 2
    • 1
  • w

    witty-butcher-82399

    07/08/2021, 10:42 AM
    Hi! Having a look to the doc of the transforms for dataset ownership, it mentions:
    If you’d like to add more complex logic for assigning ownership, you can use the more generic `add_dataset_ownership` transformer, which calls a user-provided function to determine the ownership of each dataset.
    Is there any example of this? I’m not sure how to set such a function in the yaml. Also, does the function need to be registered somewhere? Thanks!
    b
    s
    +2
    • 5
    • 7
  • b

    better-orange-49102

    07/09/2021, 10:04 AM
    if we were to use airflow or cron to schedule periodic ingest of tables in databases, and if there are no changes to the tables, the mysql will accumulate nearly identical rows of SchemaMetaData, (except for the timestamp which keeps updating). Whats the best practice here, to periodically delete away older versions of the aspects? can we directly delete away older versions of the aspects on the DB?
    d
    b
    m
    • 4
    • 7
  • c

    cool-iron-6335

    07/12/2021, 9:46 AM
    Hi, is it possible to ingest data from hdfs database ? If it is, can you give me a recipe example ? The metadata ingestion document seems not to mention to this aspect.
    b
    g
    +2
    • 5
    • 9
  • r

    rich-policeman-92383

    07/12/2021, 12:58 PM
    Hi Please help me debug this problem. While ingesting metadata from mongodb to datahub-gms rest sink i am getting below error. acryl-datahub cli is 0.8.3 and datahub is running on docker swarm with version 0.8.3. Also is there a way to verify that datahub-gms is indeed running version 0.8.3.
    b
    • 2
    • 11
  • r

    rich-policeman-92383

    07/13/2021, 3:26 PM
    Copy code
    File "datahub_v_0_8_6/metadata-ingestion/dhubv086/lib64/python3.6/site-packages/pyhive/hive.py", line 479, in execute
        _check_status(response)
    File "datahub_v_0_8_6/metadata-ingestion/dhubv086/lib64/python3.6/site-packages/pyhive/hive.py", line 609, in _check_status
        raise OperationalError(response)
    OperationalError: (pyhive.exc.OperationalError) TExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class com.LeapSerde not found:17:16', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:400', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:238', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:274', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:337', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:439', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:405', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:257', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:503', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor:process:HadoopThriftAuthBridge.java:747', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*org.apache.hadoop.hive.metastore.api.MetaException:java.lang.ClassNotFoundException Class com.LeapSerde not found:28:12', 'org.apache.hadoop.hive.metastore.MetaStoreUtils:getDeserializer:MetaStoreUtils.java:406', 'org.apache.hadoop.hive.ql.metadata.Table:getDeserializerFromMetaStore:Table.java:274', 'org.apache.hadoop.hive.ql.metadata.Table:getDeserializer:Table.java:267', 'org.apache.hadoop.hive.ql.exec.DDLTask:describeTable:DDLTask.java:3184', 'org.apache.hadoop.hive.ql.exec.DDLTask:execute:DDLTask.java:380', 'org.apache.hadoop.hive.ql.exec.Task:executeTask:Task.java:214', 'org.apache.hadoop.hive.ql.exec.TaskRunner:runSequential:TaskRunner.java:99', 'org.apache.hadoop.hive.ql.Driver:launchTask:Driver.java:2054', 'org.apache.hadoop.hive.ql.Driver:execute:Driver.java:1750', 'org.apache.hadoop.hive.ql.Driver:runInternal:Driver.java:1503', 'org.apache.hadoop.hive.ql.Driver:run:Driver.java:1287', 'org.apache.hadoop.hive.ql.Driver:run:Driver.java:1282', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:236'], sqlState='08S01', errorCode=1, errorMessage='Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class com.LeapSerde not found'), operationHandle=None)
    [SQL: DESCRIBE `default.leap_flume_prod_new`]
    Hi Guys Can you please help me with this error while ingesting hive metadata. datahub version: v0.8.6
    g
    • 2
    • 9
  • f

    faint-hair-91313

    07/14/2021, 8:22 AM
    Hi guys, I've manually ingested a chart and dashboard with the REST API, but fail to find them in the GUI. They do appear in searches and lineage. Should I do something else?
    b
    • 2
    • 34
  • s

    salmon-cricket-21860

    07/15/2021, 1:00 AM
    Hi, i am trying to use druid ingestion library. But getting this error. (Python Version 3.8 / Datahub Version 0.8.6)
    Copy code
    File "/home/jovyan/conda-envs/catalog/lib/python3.8/site-packages/datahub/ingestion/source/sql_common.py", line 62, in make_sqlalchemy_uri
        40   def make_sqlalchemy_uri(
        41       scheme: str,
        42       username: Optional[str],
        43       password: Optional[str],
        44       at: Optional[str],
        45       db: Optional[str],
        46       uri_opts: Optional[Dict[str, Any]] = None,
        47   ) -> str:
     (...)
        58       if uri_opts is not None:
        59           if db is None:
        60               url += "/"
        61           params = "&".join(
    --> 62               f"{key}={quote_plus(value)}" for (key, value) in uri_opts.items() if value
        63           )
    
    AttributeError: 'DruidConfig' object has no attribute 'items'
    b
    g
    m
    • 4
    • 17
  • s

    salmon-cricket-21860

    07/15/2021, 2:50 PM
    To register airflow dags in datahub's 'Pipelines' menu like the screenshot above, What should be done? Just using datahub lineage backend in airflow is enough?
    m
    • 2
    • 2
  • s

    square-activity-64562

    07/15/2021, 9:59 PM
    After running ingestion I have the following message. I wrote a single table in my recipe so this is as expected
    Copy code
    Sink (datahub-rest) report:
    {'failures': [], 'records_written': 1, 'warnings': []}
    But when I go to the UI there is no dataset. Is there supposed to be some delay? Should I check the errors of some service. If yes, which one?
    e
    b
    a
    • 4
    • 40
  • s

    salmon-cricket-21860

    07/16/2021, 1:00 AM
    Hi. I am trying to update all descriptions of a dataset by re-ingesting the table from Hive. But after re-ingestion, modified descriptions don't change even w/ Status(removed=True) MCE event. How can I use the original table's description?
    g
    • 2
    • 42
1...678...144Latest