DataHub #ingestion

ancient-policeman-73437

08/30/2022, 10:07 AM

Dear Datahub team, I cannot understand how to get base_folder for LookML. We have a quite classic structure of our Looker like ProjectName / models ProjectName / views What does api expect from me to provide ? Many thanks in advance

silly-finland-62382

08/30/2022, 12:19 PM

Hey, I am using this code

Copy code

from pyspark.sql import SparkSession
spark=SparkSession.builder \ 
    .config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
    .enableHiveSupport() \
    .getOrCreate();
df = spark.sql("select * from parkdatabricks_table_test1")

but, I am seeing upstream as hdfs not hive, can u suggest me how to show uostream in dathub as hove of same dataset using spark lineage ?

millions-sundown-65420

08/30/2022, 12:35 PM

Hi. I am planning to integrate Datahub with the Spark code which reads data from a MySQL database and writes the transformed data to a Mongo collection. Is there any end to end simple code example for integrating datahub with spark that I could take a look at? thanks

modern-monitor-68945

08/30/2022, 1:07 PM

Hi! Question regarding airflow integration again. There is a link in the pipeline screen (screenshot in the thread) which should take a user to the dag in question. How can we pass the correct address for each airflow instance (we have several)?

narrow-toothbrush-13209

08/30/2022, 1:11 PM

Hi! Question while pushing data from Airflow Task are throwing error due to Metadata Ingestion. Error :

Copy code

Task exited with return code Negsignal.SIGSEGV

sparse-advantage-78335

08/30/2022, 1:30 PM

Hello community, I'm just starting with DataHub and would like understand ingestion of file-based datasources. Basically one of my datasources are pdfs generated overnight. Can I monitor arrival of the files with DataHub? Or can I only create 'file source' like here: https://datahubproject.io/docs/generated/ingestion/sources/file to have my pdf's source visible in DataHub?

big-barista-70811

08/30/2022, 1:44 PM

Hello guys, good morning. I cant find on the Docs of the DataHub the following: What permissions do i have to give to the User i create it in Oracle Database to DataHub work properly?

lemon-engine-23512

08/30/2022, 3:55 PM

Hello all, can i test my custom transformer code and print before i put it in my recipi? How do i pass the mce event class to it?

silly-finland-62382

08/30/2022, 5:41 PM

Hey, I am using this code

Copy code

from pyspark.sql import SparkSession
spark=SparkSession.builder \ 
    .config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
    .enableHiveSupport() \
    .getOrCreate();
df = spark.sql("select * from parkdatabricks_table_test1")

but, I am seeing upstream as hdfs not hive, can u suggest me how to show upstream in datahub as hive of same dataset using spark lineage? (edited)

brave-nail-85388

08/30/2022, 8:13 PM

how to ingest sso login snowflake via snowflake-usage yaml file

brave-nail-85388

08/30/2022, 8:13 PM

hi team need your help on how to ingest sso login snowflake via snowflake-usage yaml file

cool-actor-73767

08/30/2022, 3:36 PM

Hi Everyone! Does Exists anyway to pass more than one database name in yaml for Athena ingestion from UI in the same time? I need to load some specific databases from Athena - dabase_A, database_B,... if I remove parameter database from yaml load all databases, if I put parameter database only one database at a time is loaded.

few-carpenter-93837

08/31/2022, 5:35 AM

Hey, a question regarding issue #5295, does anyone have an idea if this can be fixed from our side? The issue is with a def

_get_column_info

in vertica.py, for timestamptz & timestamp there's an argument called

precision

, as of currently the import for

sqlalchemy.sql import sqltypes

, doesn't contain an argument in

class TIMESTAMP(DateTime)

for precision, only Timezone. A dirty fix from our side was to just add

precision=None

self.precision = precision

to the class which now gives a correct output, but as you might think, patching our custom code over the sqlalchemy dependency in CLI codebase isn't a perfect solution. Any ideas or directions on how to tackle this would be appreciated.

flaky-soccer-57765

08/31/2022, 8:15 AM

Hey All, Morning, I am trying to ingest MS SQL server data running on another network server. The datahub throws an error stating connection refused. However, I am able to connect to that SQL server through SSMS. attaching logs below. can you suggest please?

better-orange-49102

08/31/2022, 10:13 AM

i am trying to programmatically create a pipeline to GMS where i will 1. receive a json file containing the datahub-compliant metadata, do some custom logic on it. 2. read the json file and call obj.validate on the json objects 3. create a pipeline to ingest the json file to GMS Can I check if there is a way to check if (3) will succeed without actually ingesting into GMS? does the dry-run flag in the Pipeline class work here? My end goal is to either succeed completely (all json objects ingested) or FAIL if even 1 object fail in step 2 or 3. I do not want a half ingested state where some aspects gets rejected by GMS. As to why I am doing this wacky flow, refer to https://datahubspace.slack.com/archives/C02FKQAGRG9/p1661327597785869?thread_ts=1661191206.692289&cid=C02FKQAGRG9

bright-receptionist-94235

08/31/2022, 11:02 AM

Hi All! We are starting to test Vertica ingestion The process is very slow since it query Vertica for each table separately , ex: “SELECT column_name, data_type, column_default, is_nullable FROM v_catalog.columns WHERE lower(table_name) = ‘XXX’ AND lower(table_schema) = ‘analysts’ UNION ALL SELECT column_name, data_type, ‘’ as column_default, true as is_nullable FROM v_catalog.view_columns WHERE lower(table_name) = ‘XXX’ AND lower(table_schema) = ‘analysts’ ” Vertica metadata ingestion is not fast as MySQL, why not get all information in a single query and iterate on the cursor level in the code? it will be much faster and correct way to work with Vertica

silly-finland-62382

08/31/2022, 4:12 PM

Hey, I am using this code

Copy code

from pyspark.sql import SparkSession
spark=SparkSession.builder \ 
    .config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
    .enableHiveSupport() \
    .getOrCreate();
df = spark.sql("select * from parkdatabricks_table_test1")

but, I am seeing upstream as hdfs not hive, can u suggest me how to show upstream in datahub as hive of same dataset using spark lineage? (edited) (edited)

kind-whale-32412

08/31/2022, 8:01 PM

I am having this problem while writing my custom ingestor https://datahubspace.slack.com/archives/C029A3M079U/p1661976075687699 in Java

lemon-engine-23512

08/31/2022, 9:25 PM

Can we have transformers with mcp instead of mce?

full-chef-85630

08/31/2022, 9:07 AM

Hi all, perform tasks using airflow, how to manage "properties" and "view in airflow "

jolly-traffic-67085

09/01/2022, 7:34 AM

Hi team, I have a question, I would like to ask. I would like to ingest a certificated database into datahub. can this will be done?

hallowed-kilobyte-916

09/01/2022, 12:54 PM

I have some recipy script for glue as follows:

Copy code

source:
  type: glue
  config:
    aws_region: ${aws_region}
    aws_access_key_id: ${aws_access_key_id}
    aws_secret_access_key: ${aws_secret_access_key}

I created a

.env

file where i defined the environment variables like

aws_region

. How do I reference the

.env

file? I can't seem to find any documentation on this

millions-sundown-65420

09/01/2022, 1:39 PM

Hi. Are there some example code for integrating Spark code with Datahub to emit metadata events? thanks I just have this at the moment but not sure how Datahub listens to changes in my destination database and emits metadata.

Copy code

sparkSession = SparkSession.builder \
          .appName("Events to Datahub") \
          .config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.23") \
          .config("spark.extraListeners","datahub.spark.DatahubSparkListener") \
          .config("spark.datahub.rest.server", "<http://localhost:9002>") \
          .enableHiveSupport() \
          .getOrCreate()

little-spring-72943

09/01/2022, 4:00 PM

How can I delete all lineage without deleting the objects (urn) itself?

rapid-fall-7147

09/01/2022, 4:34 PM

Hello everyone is there a way in datahub that we can ingest field level description for db sources like delta-lake or redshift through yaml using any transformer . Any ideas ?

late-truck-7887

09/01/2022, 5:14 PM

Hi datahub team, new to datahub here and want to store a dataset name that is distinct from urn — i.e. in my use case the URN is a path on s3, but goal is to have dataset name in datahub be a nice human-readable name. What I tried so far (in Python):

Copy code

aspects = [
        DatasetPropertiesClass(
            name=nice_human_readable_name,
            customProperties=properties,
            description=description,
            externalUrl=url
        ),
    ]

or alternatively:

Copy code

aspects = [
        DatasetPropertiesClass(
            qualifiedName=nice_human_readable_name,
            customProperties=properties,
            description=description,
            externalUrl=url
        ),
    ]

The mcps that are generated have proper format:

connection.submit_change_proposals

Copy code

[MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urn:li:dataset:(urn:li:dataPlatform:s3,test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='ownership', aspect=OwnershipClass({'owners': [OwnerClass({'owner': 'urn:li:corpuser:etl', 'type': 'DATAOWNER', 'source': OwnershipSourceClass({'type': 'SERVICE', 'url': None})})], 'lastModified': AuditStampClass({'time': 1661399154, 'actor': 'urn:li:corpuser:etl', 'impersonator': None, 'message': None})}), systemMetadata=None), MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urn:li:dataset:(urn:li:dataPlatform:s3,test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='datasetProperties', aspect=DatasetPropertiesClass({'customProperties': {'here3567c322-fd92-4417-98f0-90a66e32101b': 'are some fake properties', 'that_are': 'used_for_testing'}, 'externalUrl': None, 'name': 'test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b', 'qualifiedName': None, 'description': 'This is a fake description of a dataset', 'uri': None, 'tags': []}), systemMetadata=None), MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urn:li:dataset:(urn:li:dataPlatform:s3,test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='institutionalMemory', aspect=InstitutionalMemoryClass({'elements': [InstitutionalMemoryMetadataClass({'url': '<https://www.google.com/>', 'description': 'link3567c322-fd92-4417-98f0-90a66e32101b', 'createStamp': AuditStampClass({'time': 1661399154, 'actor': 'urn:li:corpuser:etl', 'impersonator': None, 'message': None})})]}), systemMetadata=None), MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urn:li:dataset:(urn:li:dataPlatform:s3,test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='globalTags', aspect=GlobalTagsClass({'tags': [TagAssociationClass({'tag': 'urn:li:tag:tag13567c322-fd92-4417-98f0-90a66e32101b', 'context': None}), TagAssociationClass({'tag': 'urn:li:tag:tag_23567c322-fd92-4417-98f0-90a66e32101b', 'context': None})]}), systemMetadata=None)]

but then this rather crytpic error message (see attached screenshot). Any advise appreciated! Thanks! Slack Conversation

clever-garden-23538

09/01/2022, 6:15 PM

can File source be used with MCPs or just MCEs? just discovered that MCEs are deprecated, it isn't mentioned anywhere on the File source page

creamy-tent-10151

09/01/2022, 11:30 PM

Hi all, a member of my team was wondering if it is possible to ingest files from s3 data lake, through IAM rather than through Access Key ID/Secret Access Key. Thanks.

alert-fall-82501

09/02/2022, 4:53 AM

Hi Team - I am ingesting data from aws redshift to datahub . I have env password for redshift . But I facing below error while ingesting the data . File "/usr/local/lib/python3.7/site-packages/expandvars.py", line 122, in getenv raise UnboundVariable(var) expandvars.UnboundVariable: 'XXXX_PASSWORD: unbound variable' ....Please suggest on this

steep-laptop-41463

09/02/2022, 7:19 AM

Hello Help me please with inlets in AirFlow