DataHub #ingestion

freezing-london-39671

12/08/2022, 5:10 PM

Hi guys! Is there a ready recipe for ingestion postgres with table lineages. I read a lot of messages in this chat, everyone was recommended to write something of their own ... Perhaps there is something ready somewhere?

cuddly-state-92920

12/08/2022, 5:39 PM

Hello everyone, I am new in Datahub. Actually, I just created a server because we are checking its applicability in our company. Now I am discovering an Oracle database. At this moment we created a connection with only one table just to be quick. The discovery works fine, but in the Stats aba some information are “unknown” such as Min, Max and Median. Could someone tell me why, and how to solve this issue? Here are my settings: source: type: oracle config: host_port: '192.168.0.xxx:1521' username: myuser password: mypass service_name: mdm table_pattern: allow: - 'owner.my_table*' profiling: enabled: true limit: 1000 report_dropped_profiles: false turn_off_expensive_profiling_metrics: false profile_table_level_only: false include_field_null_count: true include_field_distinct_count: true include_field_min_value: true include_field_max_value: true include_field_mean_value: true include_field_median_value: true include_field_stddev_value: true include_field_quantiles: true include_field_distinct_value_frequencies: false include_field_histogram: false include_field_sample_values: true max_number_of_fields_to_profile: 10000 profile_if_updated_since_days: 1 profile_table_size_limit: 5 profile_table_row_limit: 5000000 max_workers: 10 query_combiner_enabled: true catch_exceptions: true partition_profiling_enabled: true pipeline_name: 'urnlidataHubIngestionSource:ef782f8f-ee0e-42c9-8673-c2e2c6266ecd' Regards, Amanda Lima

lively-dusk-19162

12/08/2022, 6:21 PM

Hello team, I ingested the column level lineage to datahub using datahub rest emitter and got the perfect output. Thanks everyone for your support. Now I am trying to ingest the metadata to datahub using kafka. But data is not showing up in Datahub UI.Have anyone tried with kafka? Can anyone please help me out with this?

✅ 1

full-chef-85630

12/09/2022, 8:59 AM

Hi all, Java sdk supports query data via graphql，get dataset info

✅ 1

👀 1

lemon-cat-72045

12/09/2022, 10:20 AM

Hi all, I am using the lookml source to extract the column lineage between BigQuery tables and Looker views. I found out that when the dimension name in Looker differs from the field in BigQuery, it cannot be linked. here is the column lineage screenshot and dimension definition

👀 1

✅ 1

alert-fall-82501

12/09/2022, 12:56 PM

Hi Team - Does anyone knows about the integration of datahub with Aiven ? I will appreciate if get any help with this .

👀 1

✅ 1

silly-boots-14314

12/09/2022, 1:00 PM

Does your spark agent work with a glue job? https://datahubproject.io/docs/metadata-integration/java/spark-lineage/#environments-tested-with

✅ 1

cuddly-state-92920

12/09/2022, 1:27 PM

Hi everyone, Could someone tell me where is the default datahub directory? I am asking that because in some manuals they told us to edit some files but they did not tell us where is that file.

👀 1

✅ 1

worried-chef-87127

12/09/2022, 4:04 PM

I am trying to use the Looker ingestion process through the UI. Whether I execute manually or via schedule I am getting a timeout error as shown below. Is there some way to change the duration?

Copy code

packages/looker_sdk/rtl/api_methods.py", line 87, in _return\n'
           '    raise error.SDKError(response.value.decode(encoding=encoding))\n'
           "looker_sdk.error.SDKError: HTTPSConnectionPool(host='looker.#####.com', port=443): Read timed out. (read timeout=120)\n"
           '[2022-12-09 06:02:03,395] ERROR    {datahub.entrypoints:195} - Command failed: \n'
           "\tHTTPSConnectionPool(host='looker.#####.com', port=443): Read timed out. (read timeout=120).\n"
           '\tRun with --debug to get full stacktrace.\n'
           "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/dd7bf998-80b7-433c-add2-ec77e3103e8d/recipe.yml --report-to "
           "/tmp/datahub/ingest/dd7bf998-80b7-433c-add2-ec77e3103e8d/ingestion_report.json'\n",
           "2022-12-09 06:02:03.643273 [exec_id=dd7bf998-80b7-433c-add2-ec77e3103e8d] INFO: Failed to execute 'datahub ingest'",
           '2022-12-09 06:02:03.643439 [exec_id=dd7bf998-80b7-433c-add2-ec77e3103e8d] INFO: Caught exception EXECUTING '
           'task_id=dd7bf998-80b7-433c-add2-ec77e3103e8d, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'

cuddly-state-92920

12/09/2022, 8:24 PM

Is it possible to enable the Automated PII Classification for an Oracle data source? I tryed to add the bellow lines in my YAML. But I am receiving error classification: enabled: true classifiers: - type: datahub Error: '[2022-12-09 202317,531] ERROR {datahub.entrypoints:183} - Failed to configure source (oracle): 1 validation error for OracleConfig\n' 'classification\n' ' extra fields not permitted (type=value_error.extra)\n', "2022-12-09 202318.573273 [exec_id=4c3cb13f-d814-462e-a88e-e25ecbb55ac4] INFO: Failed to execute 'datahub ingest'", '2022-12-09 202318.573557 [exec_id=4c3cb13f-d814-462e-a88e-e25ecbb55ac4] INFO: Caught exception EXECUTING '

gifted-knife-16120

12/10/2022, 8:37 PM

hi guys. tried to ingest S3 Datalake. here is my path spec. is anything wrong here? i got issue include: ‘s3://sit-date-data/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.json’

important-night-50346

12/11/2022, 5:25 PM

Hi. I’m trying to apply Domain on entities during Redshift ingest. Looks like this allow pattern only applies to the tables correctly, but not schema itself? Could you please advise which pattern should I use to apply Domain to a schema? See below part from recipe:

Copy code

"domain": {
    "urn:li:domain:test_domain": {
        "allow": [
            "test_database.test_schema.*",
        ]
    }
}

Tried to achieve the same with transformer and it seems that it also does not work for container level. Could you please clarify this as my intention was to set GlobalTags, Ownership and Domain aspects for both container and datasets…

👀 1

colossal-sandwich-50049

12/11/2022, 9:09 PM

Hello, w.r.t. the java REST emitter: is there a plan to move from the apache`HttpClient 4.x` to

5.x

? Why I ask is because 5.x has quite a lot of beneficial features, like being able to set retry https://hc.apache.org/httpcomponents-client-5.2.x/index.html cc: @great-toddler-2251

microscopic-machine-90437

12/12/2022, 11:58 AM

Hello Everyone, I'm trying to ingest Tableau metadata to datahub, but getting the below error:

Copy code

' \'failures\': {\'tableau-login\': ["Unable to login (check your Tableau connection and credentials): Invalid version: \'Unknown\'"]},

Attached is the complete error stack trace.

exec-urn_li_dataHubExecutionRequest_1fd03016-3d27-4fde-b1d9-3f8492e498b0.log

few-tent-75240

12/12/2022, 1:30 PM

Hi eveyone, is it possible to show lineage (object relationship) when ingesting Salesforce metadata? Thanks

✅ 1

cuddly-state-92920

12/12/2022, 2:25 PM

Hi everyone, I´m trying to install DataHub Classifier. In the documentation the command to install is:

python3 -m pip install --upgrade datahub-classify

But after to execute this command it returns this message: ERROR: Could not find a version that satisfies the requirement datahub-classify (from versions: none) ERROR: No matching distribution found for datahub-classify My version datahub is: datahub --version acryl-datahub, version 0.9.3.2 Does anyone know what is going on in this scenario?

✅ 1

rhythmic-church-10210

12/12/2022, 6:12 PM

Hey everyone. Is it possible to prepopulate documentation from an underlying system? For example, Google BQ has decent documentation at the column level. Ideally we would want to have the system reuse that documentation so that Data Engineers can benefit from best practices in documentation...

👀 1

✅ 1

proud-ice-24189

12/12/2022, 8:28 PM

Hi all, I have used file-based lineage upload which was more or less successful. However, instead of associating the lineage to the existing tables in datahub, the ingestion created new tables on the upper level (prod/mysql). I tried to add a path to the table names in my file (schema.table) and it seems that the ingestion recognized the hierarchy. Nonetheless, it still created a new table under the schema but with the full name "schema.table" - so again, not the actually intended table. Does anyone have an idea what I am doing wrong here? thx my version: DataHub CLI version: 0.9.3.2 --> I resolved it on my own 🙂 the solution is to use "platform_instance" in the yml file in order to specify the mysql schema!

🎉 1

✅ 2

happy-twilight-91685

12/13/2022, 1:30 AM

@little-megabyte-1074 / @gray-shoe-75895 Do we have fix for issue reported?

thankyou 1

✅ 1

silly-butcher-31834

12/13/2022, 4:35 AM

Hi everyone, is there any method to specify

manifest_path

catalog_path

sources_path

test_results_path

location to ingest dbt metadata to datahub in google cloud storage or google cloud environment. because i have difference machine between datahub instance and dbt instance. thanks

✅ 1

brave-lunch-64773

12/13/2022, 4:42 AM

What all access needed for database id to import from oracle,mysql etc database ? Its read only access to which all schema ?

✅ 1

late-ability-59580

12/13/2022, 9:10 AM

Hi everyone, I ingest

dbt

with

Snowflake

as the

target_platform

. While incremental

dbt

models (for example) are mapped perfectly to their underlying

Snowflake

tables,

dbt

sources are left separate. Any idea how to overcome this?

brave-pencil-21289

12/13/2022, 10:09 AM

Can we have platform instance name like DataManagement(DMA). For us if we are giving this name recipe is not supporting for (). How to make it work or do we have any documentation on naming standards.

✅ 1

magnificent-lock-58916

12/13/2022, 11:03 AM

Hello! We have trouble with setting up stateful ingestion for our ClickHouse integration in DataHub Specifically, entities that were deleted in ClickHouse database aren’t deleted in Datahub after ingestion Here’s configuration file piece:

Copy code

stateful_ingestion:
            enabled: true
            remove_stale_metadata: true

Frankly, we didn’t use

ignore_old_state

and

ignore_new_state

because we have trouble understanding what these options actually do. It’d be really nice if you would help us understand them too But yeah, the main question is how do we set up configuration so that our ingestion would be actually stateful? Currently it’s not working as desired

✅ 1

rapid-city-92351

12/13/2022, 11:18 AM

Hi everyone we have used datahub in version 0.9.1 for our PoC, now we want to start a pilot phase with some users. I set up a clean new instance of datahub in k8s in version 0.9.3 with the same ingestion jobs that have worked on 0.9.1. Sadly now i get an error message. I have looked up the github issues and this issue could be related to my error message https://github.com/datahub-project/datahub/pull/6404 and it was merged for 0.9.3. Which brings me to the question why it was working for 0.9.1. This is the error message. We are trying to ingest data from Snowflake with lineage

Copy code

packages/datahub/ingestion/source/snowflake/snowflake_lineage.py", '
           'line 503, in _populate_view_downstream_lineage\n'
           '    json.loads(db_row["DOWNSTREAM_TABLE_COLUMNS"]),\n'
           '  File "/usr/local/lib/python3.10/json/__init__.py", line 339, in loads\n'
           "    raise TypeError(f'the JSON object must be str, bytes or bytearray, '\n"
           'TypeError: the JSON object must be str, bytes or bytearray, not NoneType\n'
           '[2022-12-13 11:04:07,215] ERROR    {datahub.entrypoints:195} - Command failed: \n'
           '\tthe JSON object must be str, bytes or bytearray, not NoneType.\n'

Now i moved back to 0.9.1 and here it is also not working anymore. Doesn´t make sense for me at all. Has someone an idea and could help Thanks

tall-father-13753

12/13/2022, 2:34 PM

Hi, I have a question regarding ingestion of Hive. I’ve tried to run it, but it fails, with messages like:

Copy code

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 770, in loop_tables
    yield from self._process_table(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 812, in _process_table
    description, properties, location_urn = self.get_table_properties(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 910, in get_table_properties
    table_info: dict = inspector.get_table_comment(table, schema)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 558, in get_table_comment
    return self.dialect.get_table_comment(
  File "/usr/local/lib/python3.10/site-packages/pyhive/sqlalchemy_hive.py", line 376, in get_table_comment
    rows = self._get_table_columns(connection, table_name, schema, extended=True)
  File "/usr/local/lib/python3.10/site-packages/pyhive/sqlalchemy_hive.py", line 290, in _get_table_columns
    rows = connection.execute('DESCRIBE{} {}'.format(extended, full_table)).fetchall()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1003, in execute
    return self._execute_text(object_, multiparams, params)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1172, in _execute_text
    ret = self._execute_context(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.10/site-packages/pyhive/hive.py", line 479, in execute
    _check_status(response)
  File "/usr/local/lib/python3.10/site-packages/pyhive/hive.py", line 609, in _check_status
    raise OperationalError(response)
sqlalchemy.exc.OperationalError: (pyhive.exc.OperationalError) TExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while compiling statement: FAILED: SemanticException [Error 10072]: Database does not exist: `test`:28:27', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:380', 'org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:206', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:290', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:320', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:530', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:506', 'sun.reflect.GeneratedMethodAccessor66:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1729', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy36:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:280', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:531', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*org.apache.hadoop.hive.ql.parse.SemanticException:Database does not exist: `test`:34:7', 'org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer:validateDatabase:DDLSemanticAnalyzer.java:1954', 'org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer:analyzeDescribeTable:DDLSemanticAnalyzer.java:2013', 'org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer:analyzeInternal:DDLSemanticAnalyzer.java:343', 'org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer:analyze:BaseSemanticAnalyzer.java:258', 'org.apache.hadoop.hive.ql.Driver:compile:Driver.java:512', 'org.apache.hadoop.hive.ql.Driver:compileInternal:Driver.java:1317', 'org.apache.hadoop.hive.ql.Driver:compileAndRespond:Driver.java:1295', 'org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:204', '*org.apache.hadoop.hive.ql.parse.SemanticException:Database does not exist: `test`:34:0', 'org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer:validateDatabase:DDLSemanticAnalyzer.java:1951'], sqlState='42000', errorCode=10072, errorMessage='Error while compiling statement: FAILED: SemanticException [Error 10072]: Database does not exist: `test`'), operationHandle=None)
[SQL: DESCRIBE FORMATTED `test`.`test_es_entity`]
(Background on this error at: <http://sqlalche.me/e/13/e3q8>)

After some digging up, it turned out that problem is with... backtics used in describe query. In our setup we have disabled support for quoted identifiers (https://issues.apache.org/jira/browse/HIVE-6013). So, is it possible to configure ingestor in such a way that it won’t be using backtics in queries?

✅ 1

proud-memory-42381

12/13/2022, 3:47 PM

Hi! Should the dbt ingestion module be able to read from a private github repository - and in that case, how does one make that work? Might it be necessary to develop token support for reading from private repositories? Thanks in advance!

✅ 1

best-umbrella-88325

12/13/2022, 4:11 PM

Hey Community! We're trying to ingest metadata from IBM DB2 into Datahub. Since this ingestion is not supported directly, we are planning to use SQLAlchemy source for the same. On reading the docs here https://datahubproject.io/docs/generated/ingestion/sources/sqlalchemy/, the documentation reads "In order to use this, you must

pip install

the required dialect packages yourself.". We are curios to know where do these be installed? I mean if we install it on the machine that runs the CLI, does it mean we cannot ingest using SQLAlchemy from the UI? Please let us know if we are on the right track. Any help would be appreciated. Thanks in advance.

✅ 1

cuddly-dinner-641

12/13/2022, 7:26 PM

is there a place in the metadata model to store unique constraints/indexes from source platforms? we have some Snowflake tables where there is a Unique Constraint on fields that are not part of the PrimaryKey... but I'm not seeing that information show up anywhere after ingestion

👀 1

✅ 1

witty-motorcycle-52108

12/13/2022, 7:59 PM

hi all! is there any way to have two ingestion sources contribute to one entity? like if we use the glue ingestion source to pull schemas and s3 lineage, but then want to use the athena data source for some additional features like profiling or table/column lineage?

👀 1

✅ 1