DataHub #ingestion

gorgeous-glass-57878

08/18/2021, 4:58 PM

Hi guys, I am trying to ingest snowflake metadata into Datahub but getting this error - pkg_resources.ContextualVersionConflict: (certifi 2021.5.30 (/usr/local/lib/python3.8/site-packages), Requirement.parse('certifi<2021.0.0'), {'snowflake-connector-python'}) My snowflake recipe structure looks like this -

source:

type: "snowflake" config: username: user password: "password" host_port: port database: database warehouse: warehouse table_pattern: allow: - "DATABASE.SCHEMA.TABLE_NAME" sink: type: "datahub-rest" config: server: 'server_url'

clever-ocean-10572

08/18/2021, 8:24 PM

When using the MSSQL ingestion, is there a way to have it ingest all databases or do you need to specify the database?

modern-nail-74015

08/19/2021, 3:06 AM

Say after I add a column， I want see current schema and one older version？

magnificent-camera-71872

08/19/2021, 5:44 AM

Hi all..... I've got a strange problem which I hope someone can help with. I can successfully ingest a table using a recipe and the CLI and can view this table on the UI. However, when I try to ingest the same table using an Airflow DAG, I get the following error trying to view it on the UI:

bumpy-activity-74405

08/19/2021, 7:14 AM

Hey, is this something that you would consider merging?

able-activity-25706

08/19/2021, 7:58 AM

my recipe is like source: type: mssql config: host_port: "localhost:16000" database: demo_db username: datahub password: datahub sink: type: "datahub-rest" config: server: "http://127.0.0.1:18080"

witty-butcher-82399

08/19/2021, 12:36 PM

Hi! I’ve got the following error while running

redshift

connector with profiling enabled.

Copy code

File "/usr/local/lib/python3.8/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 222, in _handle_convert_column_evrs
    column_profile.nullProportion = res["unexpected_percent"] / 100

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

Seems a bug, so I’m sharing here. I could create an issue in github if you prefer that.

clever-ocean-10572

08/19/2021, 2:01 PM

While ingesting, is it possible to have the database either taged or noted in some way what instance the meta data came from? For instance, if my MSSQL Instance is named BillyBob, I'd like to know what datasets are specifically in BillyBob.

clever-ocean-10572

08/19/2021, 3:47 PM

I feel like I'm missing something obvious. I'm getting the error "uri_args is not supported when ODBC is disabled" How do you enable it?

best-balloon-56

08/20/2021, 6:07 AM

Is there a plan to support additional message queues apart from Kafka, like Google Pubsub?

handsome-football-66174

08/20/2021, 5:28 PM

General Question- How to we connect an existing airflow environment to the datahub docker ? Also in the documentation https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow , where does the airflow.cfg be present (new to airflow as well) ?

handsome-football-66174

08/20/2021, 7:17 PM

Have the following Questions : 1. How to use a different Database for GMS service, any documentation that we can follow ? 2. To run Datahub, both Graph Index and Search are required ? What are used for Graph Index and Search Index by default ?Referring to the Architecture diagram shared

clever-ocean-10572

08/20/2021, 9:04 PM

Could someone please explain to my small brain how to delete from this? I ingested under the env of DEV when I wanted to ingest it as PROD. I really don't understand the documentation on how to delete.

thankyou 1

gorgeous-fountain-3070

08/22/2021, 1:47 PM

Hi, do we a system to ingest data from Microsoft Azure SQL Server yet?

clever-ocean-10572

08/23/2021, 5:50 PM

So when using the data ingestion profiling, I get an error:

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

handsome-football-66174

08/23/2021, 7:26 PM

Hi - Wanted to understand how we can deploy Datahub on AWS, does it have to be run on EKS as per documentation, or can it be also run on an ECS cluster.

high-house-54354

08/23/2021, 8:20 PM

Hi I'm getting an error when setting "env" on recipe to any other thing that is not "PROD", like "TESTE" (portuguese for test), any tips of what I'm doing wrong? I feel that maybe I'm missing something. The error is a long java stacktrace that ends in:

Copy code

. 86 more\nCaused by: java.net.URISyntaxException: Invalid URN Parameter: 'No enum constant com.linkedin.common.FabricType.TESTE: urn:li:dataset:(urn:li:dataPlatform:mssql,db_XXX.dbo.tb_ZZZZ,TESTE)

Using Pipeline.create() as:

Copy code

pipeline = Pipeline.create(
        # This configuration is analogous to a recipe configuration.
        {
            "source": {
                "type": "mssql",
                "config": {
                    "username": conn_odbc.login,
                    "password": conn_odbc.password,
                    "database": database,
                    "host_port": conn_odbc.host,
                    "use_odbc": "True",
                    "env": "TESTE"} 
} 
}

⬆️ 1

bored-advantage-45185

08/23/2021, 9:14 PM

When running data-ingestion using rest

datahub_docker.sh ingest -c postges_to_datahub.yml

, a new screen pops up and it close when the execution is done. Is there a way to see the log or keep the window open to see the status of the ingestion? Currently I don't see data getting loaded to datahub metadata yet.

magnificent-camera-71872

08/24/2021, 12:19 AM

Hi.... Is there an equivalent to the

DataFlowSnapshotClass

for ML flows (like kubeflow pipelines). I looked thro the schema_classes.py definitions but nothing stood out.... Thanks....

melodic-helmet-78607

08/24/2021, 3:39 AM

Hi, are there currently any documented examples for generating com.linkedin.metadata.delta.Delta?(https://datahubproject.io/docs/what/delta/)?

gorgeous-fountain-3070

08/24/2021, 8:21 AM

Hello, I am trying to ingest data from Microsoft Azure SQL Server and receiving the following error. File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\datahub\entrypoints.py", line 91, in main sys.exit(datahub(standalone_mode=False, **kwargs)) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\click\core.py", line 1137, in call return self.main(*args, **kwargs) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\click\core.py", line 1062, in main rv = self.invoke(ctx) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\click\core.py", line 1668, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\click\core.py", line 1668, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\click\core.py", line 763, in invoke return __callback(*args, **kwargs) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\datahub\cli\ingest_cli.py", line 48, in run pipeline_config = load_config_file(config_file) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\datahub\configuration\config_loader.py", line 35, in load_config_file config = config_mech.load_config(config_fp) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\datahub\configuration\yaml.py", line 12, in load_config config = yaml.safe_load(config_fp) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\__init__.py", line 162, in safe_load return load(stream, SafeLoader) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\__init__.py", line 114, in load return loader.get_single_data() File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\constructor.py", line 49, in get_single_data node = self.get_single_node() File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\composer.py", line 36, in get_single_node document = self.compose_document() File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\composer.py", line 55, in compose_document node = self.compose_node(None, None) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\composer.py", line 84, in compose_node node = self.compose_mapping_node(anchor) File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\composer.py", line 127, in compose_mapping_node while not self.check_event(MappingEndEvent): File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\parser.py", line 98, in check_event self.current_event = self.state() File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\parser.py", line 428, in parse_block_mapping_key if self.check_token(KeyToken): File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\scanner.py", line 116, in check_token self.fetch_more_tokens() File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\scanner.py", line 223, in fetch_more_tokens return self.fetch_value() File "C:\Users\v-akshaya\AppData\Local\Programs\Python\Python39\lib\site-packages\yaml\scanner.py", line 577, in fetch_value raise ScannerError(None, None, ScannerError: mapping values are not allowed here in "<file>", line 3, column 9

square-activity-64562

08/24/2021, 12:59 PM

How to ingest s3 so that the icon shows S3 like https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:s3,datahubpro[…]-pipelines.entity_aspect_splits.all_entities,PROD)/schema?

bored-advantage-45185

08/24/2021, 10:08 PM

In the document, there is mention of Airflow.cfg needs to be created. But it's not clear where it needs to created and placeid. Also how it'll be used in the command. Also can the lineage be set at each attribute level, transform task or just at table entity level.

magnificent-camera-71872

08/24/2021, 11:49 PM

Hi... Is there a simple way to get the dependencies of an object. I have a big picture scheme were I'd like to pick up MCEs from kafka and query datahub to find dependencies on the changed object. If the dependency turns out to be an executable (such as airflow dag, dashboard, chart, ml model etc...) then I'll trigger a downstream process to re-refresh those objects... Does this seem reasonable/doable ? As always, greatly appreciate your support ....

mammoth-sugar-1353

08/25/2021, 1:20 PM

Hi all 👋, one of my customers has just started to import datasets into their brand new lakehouse. They'd like to prioritise their ingestion roadmap by request/popularity. The current plan is to connect data hub up to swagger hub and ingest "tables" from the API GET methods, rather than ingest the operational database schemas. Long-term we plan to kafka/stream in operational data, rather than clone the DBs each day, depending on the designed API/topic schemas rather than the ORM controlled ones, so this feels like the best place to start. The aim is to give the data community visibility of available data, without having to ingest everything first. Has anyone else done similar? I spotted a few swagger comments, but it wasn't clear that it was quite the same thing.

rhythmic-london-44496

08/25/2021, 2:50 PM

Hey, I've noticed that LookML ingestion plugin doesn't parse includes in a recursive manner, and because of that, some views might not be ingested. Should I create a github issue for it?

👀 1

blue-holiday-20644

08/25/2021, 3:01 PM

Hi, did anyone manage to resolve the AWS MSK managed Kafka timeout issue when ingesting? I'm getting the same error.

Untitled

colossal-account-65055

08/25/2021, 6:26 PM

Hello! I'm noticing that when ingesting data from BigQuery, if the active service account doesn't actually have the necessary permissions on the GCP project being ingested, then there will be no error message, just a "Pipeline finished successfully" message but no data ingested. Is that expected behavior?

adamant-furniture-37835

08/26/2021, 12:05 PM

Hello! We have started to explore datahub to evaluate if it meets our requirements. So far it looks impressive and promising to us 🙂 One of the requirement we have is ingest metadata from Qliksense, Informatica and Terradata. We wonder if these sources are in scope of plugins development in coming roadmaps ? Or someone have similar needs as we have ? Thanks, Mahesh

thousands-tailor-5575

08/26/2021, 2:47 PM

Hi guys, calling to anyone who has experience spreading self service DataHub usage throughout their organisation. I would appreciate any tips on how to make the adoption easier. (E.g. are you using available connectors to scrape extract metadata from databases and allow them to be used by anyone or are you building abstracted APIs on top to have more control over what users can do?)