DataHub #ingestion

prehistoric-yak-75049

08/26/2021, 8:59 PM

Hi Please share sample json for Business Glossary and Dataset Relation. Surprisingly last weekly this requirement came up in my internal presentation and now we have in 0.8.11 😃

wonderful-quill-11255

08/27/2021, 6:09 AM

Hi. Regarding the number of partitions of the kafka topics, are there any requirements documented somewhere? If global message ordering is required then a topic can only have a single partition (with the drawback of reduced throughput ofc). The normal metadata topics are probably low throughput enough so 1 partition is enough but what about this new timeseries topic for example?

elegant-toddler-36093

08/27/2021, 5:38 PM

Hi datahub team! Currently I'm working to create a ingest of lookml into datahub. But when I have some view defined as "derived table" because is created with a join of different tables, the linage of data is not show it. In other hand, all the views defined as direct relation of only one table, show its linage. Could you help me to understand how map correctly to show all the relations?

wonderful-quill-11255

08/27/2021, 6:12 PM

Is there any feature for automatic housekeeping of the rdbms to avoid bloat? Something like "retain only last X versions"?

bright-window-42671

08/27/2021, 7:06 PM

hey everyone! Quick question, regarding dbt ingestion, does datahub pull in anything in the meta variable in the schema.yml? Right now it isn't even pulling in the column descriptions from this file. I know they are written correctly because they are showing up in the dbt docs but not in the metadata that datahub pulls in (code in thread)

clever-ocean-10572

08/27/2021, 9:16 PM

Since passwords need to go into these configuration files, how are people securing the passwords to ensure they are not leaked?

salmon-cricket-21860

08/28/2021, 2:52 AM

Hi, i am testing the redash datasource which is recently added. 1) How can I set owner of charts, dashboards for redash automatically while ingesting them. (Our redash users login using LDAP too.) 2) If fix is required for the

datahub[redash]

pip package., is it possible to use customized version? For example, • customize datahub ingestion library (python file) locally and use it.

careful-insurance-60247

08/28/2021, 5:09 PM

Does the kafka connect source create linage for sinks using jdbc?

high-hospital-85984

08/29/2021, 1:23 PM

I made a small PR to make the kafka connector source a bit more resilient: https://github.com/linkedin/datahub/pull/3148. A review would be appreciated 🙏

bumpy-activity-74405

08/30/2021, 6:40 AM

Hey there’s a bug with

looker

ingestion source - I’ve made a small PR to resolve the issue - https://github.com/linkedin/datahub/pull/3158

colossal-furniture-76714

08/31/2021, 3:24 PM

Has the idea already popped up to ingest ES indices and Kibana dashboards as well? I could not find anything on that. We have quite a few ES indices and kibana dashboards, so this would be nice to have.

lemon-lion-66467

08/31/2021, 7:02 PM

Hi all, I am experimenting with running Datahub as the internal data catalog for my company. Our data sets are all in Trino, but the sqlalchemy trino source doesn't cut it for us. We have structured fields in trino. Has there been any attempt to create a Trino source which is not SQLalchemy based?

silly-dress-39732

09/01/2021, 8:09 AM

Hello, I am trying to configuaration lineage data.excutor this command " airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080'" but receiving the following error. [2021-09-01 153229,340] {cli_action_loggers.py:105} WARNING - Failed to log action with (sqlite3.OperationalError) no such table: log [SQL: INSERT INTO log (dttm, dag_id, task_id, event, execution_date, owner, extra) VALUES (?, ?, ?, ?, ?, ?, ?)] [parameters: ('2021-09-01 073229.337103', None, None, 'cli_connections_add', None, 'hadoop', '{"host_name": "localhost", "full_command": "[\'/home/hadoop/.local/bin/airflow\', \'connections\', \'add\', \'--conn-type\', \'datahub_rest\', \'datahub_rest_default\', \'--conn-host\', \'http://localhost:8080\']"}')] (Background on this error at: http://sqlalche.me/e/13/e3q8) Traceback (most recent call last): File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context cursor, statement, parameters, context File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute cursor.execute(statement, parameters) sqlite3.OperationalError: no such table: connection The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/hadoop/.local/bin/airflow", line 8, in <module> sys.exit(main()) File "/home/hadoop/.local/lib/python3.6/site-packages/airflow/__main__.py", line 40, in main args.func(args) File "/home/hadoop/.local/lib/python3.6/site-packages/airflow/cli/cli_parser.py", line 48, in command return func(*args, **kwargs) File "/home/hadoop/.local/lib/python3.6/site-packages/airflow/utils/cli.py", line 91, in wrapper return f(*args, **kwargs) File "/home/hadoop/.local/lib/python3.6/site-packages/airflow/cli/commands/connection_command.py", line 196, in connections_add if not session.query(Connection).filter(Connection.conn_id == new_conn.conn_id).first(): File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/orm/query.py", line 3429, in first ret = list(self[0:1]) File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/orm/query.py", line 3203, in getitem return list(res) File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/orm/query.py", line 3535, in iter return self._execute_and_instances(context) File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/orm/query.py", line 3560, in _execute_and_instances result = conn.execute(querycontext.statement, self._params) File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1011, in execute return meth(self, multiparams, params) File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1130, in _execute_clauseelement distilled_params, File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context e, statement, parameters, cursor, context File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception sqlalchemy_exception, with_traceback=exc_info[2], from_=e File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 182, in raise_ raise exception File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context cursor, statement, parameters, context File "/home/hadoop/.local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute cursor.execute(statement, parameters) sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: connection [SQL: SELECT connection.password AS connection_password, connection.extra AS connection_extra, connection.id AS connection_id, connection.conn_id AS connection_conn_id, connection.conn_type AS connection_conn_type, connection.description AS connection_description, connection.host AS connection_host, connection.schema AS connection_schema, connection.login AS connection_login, connection.port AS connection_port, connection.is_encrypted AS connection_is_encrypted, connection.is_extra_encrypted AS connection_is_extra_encrypted FROM connection WHERE connection.conn_id = ? LIMIT ? OFFSET ?] [parameters: ('datahub_rest_default', 1, 0)] (Background on this error at: http://sqlalche.me/e/13/e3q8)

elegant-toddler-36093

09/01/2021, 4:43 PM

Hi guys! I'm working in ingest CorpUser data to datahub across to json file. All work without problem but all the data related to CorpUserEditableInfo are not showing into datahub. Here you have my json file and how it's showing in datahub, Can you tell me what I am doing wrong?

Copy code

{
        "auditHeader": null,
        "proposedSnapshot": {
            "com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot": {
                "urn": "urn:li:corpuser:carlos.guevara",
                "aspects": [
                    {
                        "com.linkedin.pegasus2avro.identity.CorpUserInfo": {
                            "active": true,
                            "countryCode": "MX",
                            "departmentId": null,
                            "departmentName": null,
                            "displayName": {
                                "string": "Carlos Guevara"
                            },
                            "email": "<mailto:cg@kavak.com|cg@kavak.com>",
                            "firstName": "Carlos",
                            "fullName": "Carlos Guevara",
                            "lastName": "Guevara",
                            "managerUrn": "urn:li:corpuser:milan.sahu",
                            "title": {
                                "string": "Data Engineer"
                            }
                        },
                        "com.linkedin.pegasus2avro.identity.CorpUserEditableInfo": {
                            "pictureLink": "<https://github.com/gabe-lyons.png>",
                            "skills": ["superset"]
                        }
                    }
                ]
            }
        },
        "proposedDelta": null
    },

adventurous-scooter-52064

09/02/2021, 3:09 AM

If I use AWS Glue as my main source and I want to try out SQL Profiles, I assume that I need to use AWS Athena, but it will show two folders on the UI with the same database and table name. Can I use AWS Glue while using Athena to get all the SQL Profiles and store it back to the AWS Glue source? Hopefully this is not that confusing…

some-microphone-33485

09/02/2021, 4:15 AM

Hello , Question with the ingestion rollback . We have an instance running in EKS . We donot have the run ID available with us from where we have ingested the metadata . How to extract all the ingestion ID from the datahub instance ? Thank you.

gentle-optician-51037

09/02/2021, 9:11 AM

Hi, guys, I am new to DataHub and there are some doubt when i was learning。 I'm still trying hard to learn official documents, but I'm really curious about the answers to these simple questions ： 1. How the dataHub know the lineage，for example ，I have two table，tbA ，tbB，all the data of B comes from A，and the tbAB ervery hour will be created .When ingesting data，Could the DataHub analysis the relationship？ or the lineage is to description of other something？ 2. I have bulild the DataHub and ingesting my hive data. When I create new table, I can't find it in the Datahub. Before reading the document, I remember that the data change log can be obtained automatically. Is there something wrong or I need to configure anything？ 3. Our hive tables change frequently. Hundreds of tasks generate new tables every hour, and then clear these tables regularly. How does the datahub handle them? I would appreciate it if you could help me solve my doubts，and I will continue to pay attention and learn and use Datahub。

handsome-belgium-11927

09/02/2021, 11:56 AM

Hello again! Is there a way to ingest Lineage of a Dashboard via python ingestion framework? DatasetSnapshotClass has got UpstreamLineageClass in its aspects, but DashboardSnapshotClass has no upstreams in aspects.

mammoth-bear-12532

09/02/2021, 5:50 PM

<!here> Hi folks! We just released

acryl-datahub==0.8.11.1

which includes the new business glossary source (https://datahubproject.io/docs/metadata-ingestion/source_docs/business_glossary) and the new azure AD source (https://datahubproject.io/docs/metadata-ingestion/source_docs/azure-ad). Please try them out and give us feedback!

🙌 3

numerous-guitar-35145

09/02/2021, 6:55 PM

Hi guys, I'm new here, and I'm trying to create a recipe for Oracle and I'm in doubt on how to do this, Oracle uses wallet's for authentication and not user and password. Can anyone give me an example of how I can do this using a recipe or SQLAlchemy.

best-toddler-40650

09/02/2021, 10:13 PM

Hi, I wrote a mysql recipe to ingest mydatabase schema into datahub. It is working except for the fact that it is bringing the schemas of all databases in myserver, instead of just the ones present in mydatabase. Any idea what is happening?

Copy code

---
# see <https://datahubproject.io/docs/metadata-ingestion/source_docs/mysql> for complete documentation
source:
  type: "mysql"
  config:
    username: my.user
    password: 12345678
    host_port: <http://myserver.com:21286|myserver.com:21286>
    database: mydatabase

# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

stale-jewelry-2440

09/03/2021, 9:38 AM

Hi! I'm trying to profile a MSSQL database at ingestion. The ingestion starts good, then many warning are raised like: [2021-09-03 093406,970] WARNING {great_expectations.dataset.sqlalchemy_dataset:2023} - Regex is not supported for dialect <sqlalchemy_pytds.dialect.MSDialect_pytds object at 0x7f5b33225df0> And at the end I just see a bare 'Killed' word, without seeing any recap about the ingestion. Maybe there are too many warnings, and that makes the procedure explode? And how to solve that warnings? Thank you 🙂

orange-airplane-6566

09/03/2021, 2:22 PM

good morning from Chicago 👋🏻 We've been hitting an issue recently where a failed MCE message can't be written to the

FailedMetadataEventChange_v4

topic. We have compaction enabled on that topic, but it seems DataHub is trying to write messages there without a key. (more details in thread)

average-holiday-92911

09/03/2021, 2:49 PM

Hi, I was trying to ingest Glue table but got below error:

powerful-telephone-71997

09/06/2021, 7:45 AM

Folks, Any pointers for me to check this issue:

Copy code

'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: ERROR :: '
                                      '/value/com.linkedin.metadata.snapshot.DatasetSnapshot/urn :: "Provided urn '
                                      'urn:li:dataset:(urn:li:dataPlatform:redshift,<>.<>.<>.<>),PROD)" is invalid\n'

bumpy-activity-74405

09/07/2021, 6:13 AM

hey, can someone maybe look at this PR?

chilly-barista-6524

09/07/2021, 8:21 AM

Hey, we are using this ingestion script https://github.com/linkedin/datahub/tree/v0.6.0/metadata-ingestion/mce-cli (yeah, we will upgrade to latest very soon 😅🙏) to ingest data in our Datahub deployment. While this is working great to insert the data, I wanted to know can this be used to upsert the dataset as well? I see a

proposedDelta

field in the bootsrap_mce.dat file but not sure how to give input in it, since there is no example provided for upsertion.

rapid-sundown-8805

09/07/2021, 1:40 PM

Hi community! Anyone know how to specify the MCE (or I guess MCP?) topic name when using no-code ingestion + Kafka?

clever-australia-61035

09/08/2021, 8:41 AM

Hi, I’ve enabled LDAP plugin and configured AD server, username, pwd and base-dn details. The ingest process was successful but when i log into the UI using the credentials, it fails with the message “Failed to log in! Invalid Credentials.” My credentials are correct and it is available in the AD.. Is there anything else to be configured in datahub after the ingestion?

happy-magazine-52755

09/08/2021, 8:37 PM

Hi all! I am pretty new to DataHub and maybe anyone of you could help me 🙂 I am trying to ingest from dbt source but the following error pops up:

Copy code

dbtNode.columns = get_columns(catalog[dbtNode.dbt_name])
KeyError: 'seed.redshift_dbt.xxxxxxxxxx'

As I understand this object does not exist in catalog file but it is present in the manifest so I would like to exclude it from ingestion process. I saw that in the config (recipe) it is possible to add

Copy code

node_type_pattern:
      deny:

but it only applies for the whole seed, like

Copy code

"^seed.*"

, otherwise it doesn’t work. Is it possible to exclude this particular node (seed.redshift_dbt.xxxxxxxxxx) from ingestion?