https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • a

    adventurous-scooter-52064

    08/11/2021, 6:02 AM
    Question, I realized ingestion doesn’t overwrite edited actions on UI. For instance, someone edited the table description, but we want to overwrite the table description forcefully through our daily ingestion(managing all module through datahub transformer), what do I have to add on my transformer to initialize a specific property that was changed through the UI, for instance the table description?
    m
    • 2
    • 2
  • b

    bumpy-activity-74405

    08/11/2021, 6:07 AM
    Hi, is there an easy way to “reset” my datahub instance? By reset I mean remove all the ingested metadata. I’ve (probably foolishly) deleted all the indices on ES and truncated tables in mysql and now I get when I try to browse datasets/dashboards/charts/pipelines. I’m running
    v0.8.6
    .
    l
    m
    r
    • 4
    • 10
  • b

    bland-easter-53873

    08/11/2021, 11:17 AM
    Hi, is it possible to customize the greatexpectations to run some custom validations
    m
    • 2
    • 2
  • c

    careful-insurance-60247

    08/11/2021, 1:11 PM
    Trying to use the profiling feature for mssql but running into an error.
    Copy code
    [2021-08-10 19:33:35,758] INFO     {great_expectations.data_context.data_context:2932} -        Profiled 9 columns using 0 rows from None (2.204 sec)
    [2021-08-10 19:33:35,758] INFO     {great_expectations.data_context.data_context:2944} -
    Profiled the data asset, with 0 total rows and 9 columns in 2.20 seconds.
    Generated, evaluated, and stored 51 Expectations during profiling. Please review results using data-docs.
    [2021-08-10 19:33:35,759] INFO     {root:1140} - Sending ROLLBACK TRAN
    Traceback (most recent call last):
      File "/usr/local/bin/datahub", line 8, in <module>
        sys.exit(datahub())
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/cli/ingest_cli.py", line 58, in run
        pipeline.run()
      File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 108, in run
        for wu in self.source.get_workunits():
      File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/sql/sql_common.py", line 319, in get_workunits
        inspector, profiler, schema, sql_config
      File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/sql/sql_common.py", line 515, in loop_profiler
        **self.prepare_profiler_args(schema=schema, table=table),
      File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 118, in generate_profile
        profile = self._convert_evrs_to_profile(evrs, pretty_name=pretty_name)
      File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 166, in _convert_evrs_to_profile
        profile, col, evrs_for_col, pretty_name=pretty_name
      File "/home/ec2-user/.local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 222, in _handle_convert_column_evrs
        column_profile.nullProportion = res["unexpected_percent"] / 100
    TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
    m
    • 2
    • 9
  • w

    wooden-sunset-90925

    08/11/2021, 1:24 PM
    <!here> I have tried to ingest the metadata from a mysql table to datahub and it was ingested successfully but I am doubting it because I am not able to see any data there. I am just able to see what columns are there and the data types (basically the values that are there in header of the SQL table but not the rows that are there). Also if I try to search any data that is there in a row for particular column, I get No data results in UI. I am just wondering whether DataHub provides this feature to search data there or it just ingests the table, columns, column data type or description only. Can anyone help me ?
    b
    m
    b
    • 4
    • 23
  • c

    colossal-account-65055

    08/11/2021, 3:38 PM
    Hello DataHub team! Can anyone help me to troubleshoot an error I'm getting when running metadata ingestion unit tests? (I'd like to set up our CI/CD to run the tests automatically.) Details in thread 🧵
    b
    m
    • 3
    • 21
  • c

    careful-insurance-60247

    08/11/2021, 4:29 PM
    <!here> Has anyone used Apache Nifi with Datahub yet? Looking for ways to pull in some of our data pipelines to track lineage.
    b
    • 2
    • 5
  • b

    bumpy-activity-74405

    08/12/2021, 6:30 AM
    Hi, I’m trying to ingest dashboard data from looker and it seems like gms chokes on utf-8 characters in dashboard titles:
    Copy code
    Sink (datahub-rest) report:
    {'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
                   'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                            'message': "javax.persistence.PersistenceException: Error[(conn=2227952) Incorrect string value: '\\xC4\\x97l pi...' for "
                                       "column 'metadata' at row 1]",
    Is this something that can be easily fixed or are utf-8 characters are just not supported?
    b
    e
    +2
    • 5
    • 8
  • b

    better-orange-49102

    08/12/2021, 12:41 PM
    I'm planning to create remote instances of datahub ingestion containers and have them send the data to gms on another network, is there a way to authenticate?
    b
    • 2
    • 3
  • n

    nice-branch-87277

    08/12/2021, 2:25 PM
    I managed to run
    python3 -m datahub docker quickstart
    and managed to get everything running ok. I then took down the docker container (I ran
    datahub docker nuke
    ,
    docker/nuke.sh
    , and
    docker system prune --all
    ) and tried to rerun
    python3 -m datahub docker quickstart
    . Im getting the following error:
    e
    m
    • 3
    • 3
  • h

    handsome-football-66174

    08/12/2021, 9:06 PM
    Trying to use Glue to ingest metadata. What are the configuration options present for Glue (for the .yml file ) to connect?
    c
    • 2
    • 7
  • g

    gray-autumn-29372

    08/13/2021, 3:05 AM
    Hello, tried to ingest metadata from hive, but got below error for a table. It complaints a SerDe not found
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class com.mongodb.hadoop.hive.BSONSerDe not found
    m
    • 2
    • 12
  • s

    square-activity-64562

    08/13/2021, 5:28 AM
    Regarding profiling, what would be a good way to override behavior for getting count of rows for tables. The current
    count(*)
    that is fired takes time and is not practical for large tables. For large tables it might be much better to use metadata tables when possible.
    m
    • 2
    • 7
  • s

    square-activity-64562

    08/13/2021, 7:05 AM
    @mammoth-bear-12532 Is there a list of platforms supported? I ingested a mariadb using mysql connector. It worked. But it is showing mysql in datahub which I would like to correct to "mariadb". I was thinking adding
    underlying_platform
    as an option in mysql source. What would be the correct thing here "mariadb" or something else?
    l
    m
    • 3
    • 14
  • s

    square-activity-64562

    08/13/2021, 8:21 AM
    The default timeout of 2sec added for the rest emitter is too small. The smoke tests are bit flaky due to this change.
    • 1
    • 1
  • p

    polite-flower-25924

    08/13/2021, 9:08 AM
    Hello all, I’ve started to ingest data from Kafka & Hive. All dataset origin is set to “PROD”. Is it possible to adjust that in ingestion recipes and how can I change that after ingestion? Thank you
    b
    s
    m
    • 4
    • 12
  • w

    witty-butcher-82399

    08/13/2021, 11:05 AM
    I see there are already a couple of transformers to set dataset ownership https://datahubproject.io/docs/metadata-ingestion/transformers#change-owners However, none of them allows for specifying the role of the ownership. Any plans for that?
    g
    m
    • 3
    • 6
  • c

    curved-jordan-15657

    08/13/2021, 3:22 PM
    Hello! We’ve deployed datahub using k8s and aws tools which are aws rds,es and MSK. I’ve configured recipe.yml and for sink method used datahub-kafka method. Even bootstrap and schema registry urls are proper( i tried to connect and produce and consume with kafka-cli, it’s ok) , i get an error like:
    Copy code
    KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"} and info {'error': KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}, 'msg': <cimpl.Message object at 0x10ea17ac0>}
    I think somehow, we cannot connect to kafka broker on MSK side from datahub. I wonder, can you please tell what are running k8s pods after deployment?
    r
    e
    • 3
    • 2
  • c

    cool-iron-6335

    08/15/2021, 1:56 PM
    Hi, i have a problem when i ingest metadata. That is the permission of reading table or view.Sometime there are too many tables to deny. If i don't deny them, the ingestion will corrupt right way regardless a lot meta data doesn't ingest yet. Maybe we can ignore that problem by exception handle function in python. we simple continue the process if we encounter the permission problem
    l
    • 2
    • 2
  • p

    polite-flower-25924

    08/16/2021, 6:53 AM
    Hey team, I’m able to ingest metadata from several data sources (Kafka, Hive etc.). They are ingested through pull-based approach and I define the ingestion pod as k8s Job. If I want to do this operation periodically, I can easily convert k8s Job to CronJob and set proper schedule (e.g. every day). I just wonder if we can use push-based approach instead of pull-based? If so, how can we do that with the current architecture? I think this question is answered here. As far as I understand, there is no out-of-the-box solutions except Airflow. If we want to ingest metadata from Hive, Kafka, Superset, Looker we need to pull the metadata periodically? Please correct me if I’m wrong.
    ✅ 1
    plus1 1
    l
    b
    • 3
    • 12
  • h

    handsome-football-66174

    08/16/2021, 8:52 PM
    Trying to use Glue to ingest metadata. it is able to connect and ingest . But when I specify the following database_pattern:    allow:     - "covid_19"   table_pattern:    allow:     - "covid.*" It is still pulling the tables names whose names do not match covid.*
    w
    l
    • 3
    • 6
  • b

    bumpy-activity-74405

    08/17/2021, 11:02 AM
    hey, not sure if this is the right channel, but here it goes: i’ve ingested data from hive, lookml, looker using the cli tool. I’ve also prepared and ingested some custom
    com.linkedin.dataset.UpstreamLineage
    aspects for the datasets via the rest api. However I see that some pages do not load when trying to batchLoad (I think) the upstream/downstream dependency datasets. The UI looks like this:
    g
    • 2
    • 18
  • m

    magnificent-camera-71872

    08/18/2021, 6:31 AM
    Hi... I'm trying to ingest metadata from Redshift into datahub. It appears that the ingestion works ok for regular tables, but those tables defined in external schemas (and accessed using Redshift Spectrum) are simply skipped - and not even mentioned in the logs ? Is this a known restriction ?
    l
    • 2
    • 4
  • m

    modern-nail-74015

    08/18/2021, 7:06 AM
    Does it will inspect mysql db called ruicore.app and ruicore.parsed_app?
    m
    • 2
    • 17
  • w

    witty-butcher-82399

    08/18/2021, 8:30 AM
    Hi. I’m doing a simple ingestion of a couple of
    kafka
    topics as datasets + a
    dataProcess
    in-between consuming one and producing the other. While there are no errors during the ingestion, the UI fails as shown in the second screenshoot. Is that a sort of bug? Or there is something wrong in my mce json file (see thread)? Thanks!
    g
    • 2
    • 6
  • m

    modern-nail-74015

    08/18/2021, 8:59 AM
    Can I ingest multiple Postgres databases in a single yaml file?
    b
    • 2
    • 2
  • w

    witty-airline-46094

    08/18/2021, 12:36 PM
    Hey folks! We are looking into adopting DataHub as our backend for a data catalog. Our data pipelines heavily rely on Kafka with Schema Registry as transport layer. DataHub amazingly displays auto-ingested topics with schemas that follow the
    TopicNameStrategy
    strategy (basically the schema name is generated from the topic name), however it lacks support for the other two (
    RecordNameStrategy
    ,
    TopicRecordNameStrategy
    ). Are there any plans to support these formats in the future? E.g. if anybody is working on this or not (we might be able to help on this if the answer is no). Thanks for the awesome work, so far we like the product a lot!
    b
    m
    • 3
    • 9
  • w

    wonderful-quill-11255

    08/18/2021, 2:16 PM
    Hello. I've got a question about the RBAC feature on the roadmap. Will this include a solution for preventing different teams of overwriting each others metadata which might otherwise happen in a well federated metadata production landscape?
    b
    s
    • 3
    • 25
  • c

    curved-jordan-15657

    08/18/2021, 3:18 PM
    Hello! I have a question about dataset updates. If I delete a table in a dataset, how do I get it to be deleted or not visible on the UI side? I know about “status: removed” transformer, but if i have a scheduled ingestion in Airflow, is there a way to apply every changes automaticly without manually updating status or something else? I mean like commiting a code.
    g
    b
    • 3
    • 24
  • c

    colossal-furniture-76714

    08/18/2021, 3:58 PM
    Has the format changed with the latest datahub release from nested / structured data fields? I had to upgrade to the latest version to get lineage ingested by airflow running. Now my json file produces a weird schema / table entry in datahub. I still had to prefix the upper/outer names of arrays and struct to get the order right. Is this now differenty implemented? Maybe even supporting ingesting from hive directly with deeply nested tables. Thanks for the feedback.
    g
    b
    h
    • 4
    • 21
1...91011...144Latest