https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • a

    ancient-policeman-73437

    08/30/2022, 10:07 AM
    Dear Datahub team, I cannot understand how to get base_folder for LookML. We have a quite classic structure of our Looker like ProjectName / models ProjectName / views What does api expect from me to provide ? Many thanks in advance
    g
    • 2
    • 10
  • s

    silly-finland-62382

    08/30/2022, 12:19 PM
    Hey, I am using this code
    Copy code
    from pyspark.sql import SparkSession
    spark=SparkSession.builder \ 
        .config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
        .enableHiveSupport() \
        .getOrCreate();
    df = spark.sql("select * from parkdatabricks_table_test1")
    but, I am seeing upstream as hdfs not hive, can u suggest me how to show uostream in dathub as hove of same dataset using spark lineage ?
  • m

    millions-sundown-65420

    08/30/2022, 12:35 PM
    Hi. I am planning to integrate Datahub with the Spark code which reads data from a MySQL database and writes the transformed data to a Mongo collection. Is there any end to end simple code example for integrating datahub with spark that I could take a look at? thanks
    g
    • 2
    • 4
  • m

    modern-monitor-68945

    08/30/2022, 1:07 PM
    Hi! Question regarding airflow integration again. There is a link in the pipeline screen (screenshot in the thread) which should take a user to the dag in question. How can we pass the correct address for each airflow instance (we have several)?
    d
    • 2
    • 3
  • n

    narrow-toothbrush-13209

    08/30/2022, 1:11 PM
    Hi! Question while pushing data from Airflow Task are throwing error due to Metadata Ingestion. Error :
    Copy code
    Task exited with return code Negsignal.SIGSEGV
    d
    g
    b
    • 4
    • 10
  • s

    sparse-advantage-78335

    08/30/2022, 1:30 PM
    Hello community, I'm just starting with DataHub and would like understand ingestion of file-based datasources. Basically one of my datasources are pdfs generated overnight. Can I monitor arrival of the files with DataHub? Or can I only create 'file source' like here: https://datahubproject.io/docs/generated/ingestion/sources/file to have my pdf's source visible in DataHub?
    g
    • 2
    • 2
  • b

    big-barista-70811

    08/30/2022, 1:44 PM
    Hello guys, good morning. I cant find on the Docs of the DataHub the following: What permissions do i have to give to the User i create it in Oracle Database to DataHub work properly?
    b
    • 2
    • 2
  • l

    lemon-engine-23512

    08/30/2022, 3:55 PM
    Hello all, can i test my custom transformer code and print before i put it in my recipi? How do i pass the mce event class to it?
    g
    • 2
    • 4
  • s

    silly-finland-62382

    08/30/2022, 5:41 PM
    Hey, I am using this code
    Copy code
    from pyspark.sql import SparkSession
    spark=SparkSession.builder \ 
        .config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
        .enableHiveSupport() \
        .getOrCreate();
    df = spark.sql("select * from parkdatabricks_table_test1")
    but, I am seeing upstream as hdfs not hive, can u suggest me how to show upstream in datahub as hive of same dataset using spark lineage? (edited)
  • b

    brave-nail-85388

    08/30/2022, 8:13 PM
    how to ingest sso login snowflake via snowflake-usage yaml file
    l
    • 2
    • 1
  • b

    brave-nail-85388

    08/30/2022, 8:13 PM
    hi team need your help on how to ingest sso login snowflake via snowflake-usage yaml file
  • c

    cool-actor-73767

    08/30/2022, 3:36 PM
    Hi Everyone! Does Exists anyway to pass more than one database name in yaml for Athena ingestion from UI in the same time? I need to load some specific databases from Athena - dabase_A, database_B,... if I remove parameter database from yaml load all databases, if I put parameter database only one database at a time is loaded.
    g
    h
    • 3
    • 2
  • f

    few-carpenter-93837

    08/31/2022, 5:35 AM
    Hey, a question regarding issue #5295, does anyone have an idea if this can be fixed from our side? The issue is with a def
    _get_column_info
    in vertica.py, for timestamptz & timestamp there's an argument called
    precision
    , as of currently the import for
    sqlalchemy.sql import sqltypes
    , doesn't contain an argument in
    class TIMESTAMP(DateTime)
    for precision, only Timezone. A dirty fix from our side was to just add
    precision=None
    &
    self.precision = precision
    to the class which now gives a correct output, but as you might think, patching our custom code over the sqlalchemy dependency in CLI codebase isn't a perfect solution. Any ideas or directions on how to tackle this would be appreciated.
    g
    • 2
    • 4
  • f

    flaky-soccer-57765

    08/31/2022, 8:15 AM
    Hey All, Morning, I am trying to ingest MS SQL server data running on another network server. The datahub throws an error stating connection refused. However, I am able to connect to that SQL server through SSMS. attaching logs below. can you suggest please?
    c
    • 2
    • 1
  • b

    better-orange-49102

    08/31/2022, 10:13 AM
    i am trying to programmatically create a pipeline to GMS where i will 1. receive a json file containing the datahub-compliant metadata, do some custom logic on it. 2. read the json file and call obj.validate on the json objects 3. create a pipeline to ingest the json file to GMS Can I check if there is a way to check if (3) will succeed without actually ingesting into GMS? does the dry-run flag in the Pipeline class work here? My end goal is to either succeed completely (all json objects ingested) or FAIL if even 1 object fail in step 2 or 3. I do not want a half ingested state where some aspects gets rejected by GMS. As to why I am doing this wacky flow, refer to https://datahubspace.slack.com/archives/C02FKQAGRG9/p1661327597785869?thread_ts=1661191206.692289&cid=C02FKQAGRG9
    g
    g
    • 3
    • 35
  • b

    bright-receptionist-94235

    08/31/2022, 11:02 AM
    Hi All! We are starting to test Vertica ingestion The process is very slow since it query Vertica for each table separately , ex: “SELECT column_name, data_type, column_default, is_nullable FROM v_catalog.columns WHERE lower(table_name) = ‘XXX’ AND lower(table_schema) = ‘analysts’ UNION ALL SELECT column_name, data_type, ‘’ as column_default, true as is_nullable FROM v_catalog.view_columns WHERE lower(table_name) = ‘XXX’ AND lower(table_schema) = ‘analysts’ ” Vertica metadata ingestion is not fast as MySQL, why not get all information in a single query and iterate on the cursor level in the code? it will be much faster and correct way to work with Vertica
    g
    • 2
    • 2
  • s

    silly-finland-62382

    08/31/2022, 4:12 PM
    Hey, I am using this code
    Copy code
    from pyspark.sql import SparkSession
    spark=SparkSession.builder \ 
        .config("spark.datahub.metadata.dataset.platformInstance", "dataset") \
        .enableHiveSupport() \
        .getOrCreate();
    df = spark.sql("select * from parkdatabricks_table_test1")
    but, I am seeing upstream as hdfs not hive, can u suggest me how to show upstream in datahub as hive of same dataset using spark lineage? (edited) (edited)
    w
    • 2
    • 4
  • k

    kind-whale-32412

    08/31/2022, 8:01 PM
    I am having this problem while writing my custom ingestor https://datahubspace.slack.com/archives/C029A3M079U/p1661976075687699 in Java
  • l

    lemon-engine-23512

    08/31/2022, 9:25 PM
    Can we have transformers with mcp instead of mce?
    g
    • 2
    • 4
  • f

    full-chef-85630

    08/31/2022, 9:07 AM
    Hi all, perform tasks using airflow, how to manage "properties" and "view in airflow "
    g
    • 2
    • 1
  • j

    jolly-traffic-67085

    09/01/2022, 7:34 AM
    Hi team, I have a question, I would like to ask. I would like to ingest a certificated database into datahub. can this will be done?
    h
    • 2
    • 3
  • h

    hallowed-kilobyte-916

    09/01/2022, 12:54 PM
    I have some recipy script for glue as follows:
    Copy code
    source:
      type: glue
      config:
        aws_region: ${aws_region}
        aws_access_key_id: ${aws_access_key_id}
        aws_secret_access_key: ${aws_secret_access_key}
    I created a
    .env
    file where i defined the environment variables like
    aws_region
    . How do I reference the
    .env
    file? I can't seem to find any documentation on this
    h
    • 2
    • 2
  • m

    millions-sundown-65420

    09/01/2022, 1:39 PM
    Hi. Are there some example code for integrating Spark code with Datahub to emit metadata events? thanks I just have this at the moment but not sure how Datahub listens to changes in my destination database and emits metadata.
    Copy code
    sparkSession = SparkSession.builder \
              .appName("Events to Datahub") \
              .config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.23") \
              .config("spark.extraListeners","datahub.spark.DatahubSparkListener") \
              .config("spark.datahub.rest.server", "<http://localhost:9002>") \
              .enableHiveSupport() \
              .getOrCreate()
    h
    d
    • 3
    • 15
  • l

    little-spring-72943

    09/01/2022, 4:00 PM
    How can I delete all lineage without deleting the objects (urn) itself?
    h
    • 2
    • 1
  • r

    rapid-fall-7147

    09/01/2022, 4:34 PM
    Hello everyone is there a way in datahub that we can ingest field level description for db sources like delta-lake or redshift through yaml using any transformer . Any ideas ?
    h
    • 2
    • 4
  • l

    late-truck-7887

    09/01/2022, 5:14 PM
    Hi datahub team, new to datahub here and want to store a dataset name that is distinct from urn — i.e. in my use case the URN is a path on s3, but goal is to have dataset name in datahub be a nice human-readable name. What I tried so far (in Python):
    Copy code
    aspects = [
            DatasetPropertiesClass(
                name=nice_human_readable_name,
                customProperties=properties,
                description=description,
                externalUrl=url
            ),
        ]
    or alternatively:
    Copy code
    aspects = [
            DatasetPropertiesClass(
                qualifiedName=nice_human_readable_name,
                customProperties=properties,
                description=description,
                externalUrl=url
            ),
        ]
    The mcps that are generated have proper format:
    connection.submit_change_proposals
    Copy code
    [MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urn:li:dataset:(urn:li:dataPlatform:s3,test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='ownership', aspect=OwnershipClass({'owners': [OwnerClass({'owner': 'urn:li:corpuser:etl', 'type': 'DATAOWNER', 'source': OwnershipSourceClass({'type': 'SERVICE', 'url': None})})], 'lastModified': AuditStampClass({'time': 1661399154, 'actor': 'urn:li:corpuser:etl', 'impersonator': None, 'message': None})}), systemMetadata=None), MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urn:li:dataset:(urn:li:dataPlatform:s3,test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='datasetProperties', aspect=DatasetPropertiesClass({'customProperties': {'here3567c322-fd92-4417-98f0-90a66e32101b': 'are some fake properties', 'that_are': 'used_for_testing'}, 'externalUrl': None, 'name': 'test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b', 'qualifiedName': None, 'description': 'This is a fake description of a dataset', 'uri': None, 'tags': []}), systemMetadata=None), MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urn:li:dataset:(urn:li:dataPlatform:s3,test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='institutionalMemory', aspect=InstitutionalMemoryClass({'elements': [InstitutionalMemoryMetadataClass({'url': '<https://www.google.com/>', 'description': 'link3567c322-fd92-4417-98f0-90a66e32101b', 'createStamp': AuditStampClass({'time': 1661399154, 'actor': 'urn:li:corpuser:etl', 'impersonator': None, 'message': None})})]}), systemMetadata=None), MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urn:li:dataset:(urn:li:dataPlatform:s3,test_s3_dataset3567c322-fd92-4417-98f0-90a66e32101b,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='globalTags', aspect=GlobalTagsClass({'tags': [TagAssociationClass({'tag': 'urn:li:tag:tag13567c322-fd92-4417-98f0-90a66e32101b', 'context': None}), TagAssociationClass({'tag': 'urn:li:tag:tag_23567c322-fd92-4417-98f0-90a66e32101b', 'context': None})]}), systemMetadata=None)]
    but then this rather crytpic error message (see attached screenshot). Any advise appreciated! Thanks! Slack Conversation
    g
    • 2
    • 2
  • c

    clever-garden-23538

    09/01/2022, 6:15 PM
    can File source be used with MCPs or just MCEs? just discovered that MCEs are deprecated, it isn't mentioned anywhere on the File source page
    g
    • 2
    • 1
  • c

    creamy-tent-10151

    09/01/2022, 11:30 PM
    Hi all, a member of my team was wondering if it is possible to ingest files from s3 data lake, through IAM rather than through Access Key ID/Secret Access Key. Thanks.
    g
    • 2
    • 2
  • a

    alert-fall-82501

    09/02/2022, 4:53 AM
    Hi Team - I am ingesting data from aws redshift to datahub . I have env password for redshift . But I facing below error while ingesting the data . File "/usr/local/lib/python3.7/site-packages/expandvars.py", line 122, in getenv raise UnboundVariable(var) expandvars.UnboundVariable: 'XXXX_PASSWORD: unbound variable' ....Please suggest on this
    d
    a
    • 3
    • 6
  • s

    steep-laptop-41463

    09/02/2022, 7:19 AM
    Hello Help me please with inlets in AirFlow
    d
    • 2
    • 6
1...666768...144Latest