• c

    chilly-potato-57465

    2 days ago
    Hello! In our case we have huge datasets stored in regular file systems (images) and HDFS? As far as I could see from previous questions and sources documentation, there are no plugins to ingest metadata from regular file systems (attributes such as created/modified/ownership/size/assess rights/etc and folders structure) and HDFS. Is this still so? Additionally, I wonder how to ingest metadata (column names) from csv files. I see that is possible from S3 source, is it also possible from regular file systems? Thank you!!
  • l

    lemon-engine-23512

    2 days ago
    Hi team, am trying to schedule ingestion airflow(managed apache airflow on aws, mwaa) . Here we upload dag and yaml files to s3 location. But when i rub schedule in airflow i get error Datahub.configuration.common.ConfigurationError:cannot open config file --s3path to yaml Anyway to resolve this? Thank you
    l
    d
    16 replies
    Copy to Clipboard
  • g

    glamorous-wire-83850

    2 days ago
    Hello, I am currently trying to ingest multi-project GCP but its only ingest “second_project” which is written as storage_project_id. I want ingest all projects in GCP. What should I?
    source:
        type: bigquery
        config:
            project_id: service_acc_project
            storage_project_id: second_project
            credential:
                project_id: service_acc_project
                private_key_id: '${BQ_PRIVATE_KEY_ID2}'
                client_email: <mailto:abc-abc@service.iam.gserviceaccount.com|abc-abc@service.iam.gserviceaccount.com>
                private_key: '${BQ_PRIVATE_KEY2}'
                client_id: '11111111'
            include_tables: true
            include_views: true
            include_table_lineage: true
    g
    1 replies
    Copy to Clipboard
  • m

    microscopic-mechanic-13766

    1 week ago
    Good morning, so I was trying to ingest metadata from Kafka using the following recipe:
    source:
        type: kafka
        config:
            platform_instance: <platform_instance>
            connection:
                consumer_config:
                    security.protocol: SASL_PLAINTEXT
                    sasl.username: <user>
                    sasl.mechanism: PLAIN
                    sasl.password: <password>
                bootstrap: 'broker1:9092'
                schema_registry_url: '<http://schema-registry:8081>'
    When I got the following error:
    File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 98, in _read_output_lines\n'
               '    line_bytes = await ingest_process.stdout.readline()\n'
               '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
               '    raise ValueError(e.args[0])\n'
               'ValueError: Separator is not found, and chunk exceed the limit\n']}
    Mention that recipe worked in previous versions (the current version is v0.8.44) Thanks in advance!
    m
    g
    +1
    16 replies
    Copy to Clipboard
  • f

    fresh-nest-42426

    2 days ago
    Hi all and happy Friday! We are doing some POC with DataHub specifically ingesting table lineage from Redshift and realized that it's also automatically ingests upstream s3 COPY lineage, which is very interesting & useful. However, we have many event level table that gets too many small batches of s3 files loaded using Kinesis firehose so the S3 lineage looks extremely verbose and even seem to completely break ingestion sometimes
    '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 96, in _read_output_lines\n'
               '    line_bytes = await ingest_process.stdout.readline()\n'
               '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
               '    raise ValueError(e.args[0])\n'
               'ValueError: Separator is found, but chunk is longer than limit\n']}
    I see a related thread here https://datahubspace.slack.com/archives/CUMUWQU66/p1663143783318239 Is there way to exclude the upstream S3 lineage collection for certain redshift tables? So far I had to exclude such tables with extensive s3 upstream otherwise ingestion doesn't work. I'm using
    v0.8.44
    and datahub actions
    0.0.7
    Thanks!
  • g

    green-lion-58215

    2 days ago
    Quick qn on DBt ingestion. I am ingesting dbt test results with the run_results.json. I can also see the test cases in the datahub UI as well. But it does not show any pass/failures for the test cases. It is all coming up as “no evaluations found”. I am using datahub 0.8.41 for context. dbt version 0.21.1 any help is appreciated.
    g
    1 replies
    Copy to Clipboard
  • l

    limited-forest-73733

    3 days ago
    Hey team! Whats plan for new datahub release i.e. 0.845
    l
    b
    2 replies
    Copy to Clipboard
  • g

    gray-cpu-75769

    2 days ago
    @hundreds-photographer-13496 do you have any idea about it ?
    g
    h
    3 replies
    Copy to Clipboard
  • l

    little-spring-72943

    1 day ago
    We are trying to build Assertion from our DQ toolset Great Expectations equivalent: expect_column_sum_to_be_between set following values:
    scope = DatasetAssertionScope.DATASET_COLUMN
    operator = AssertionStdOperator.BETWEEN
    aggregation = AssertionStdAggregation.SUM
    UI shows: Column Amount values are between 0 and 1 Sum (aggregation) is missing. How can we fix this or have custom text here?
    l
    1 replies
    Copy to Clipboard
  • c

    clean-tomato-22549

    5 days ago
    Hi team, can this parameter works for presto on hive?
    profiling.partition_datetime
    According to the doc https://datahubproject.io/docs/generated/ingestion/sources/presto-on-hive It is
    Only Bigquery supports this.
    Is their plan to support the parameter for presto on hive?
    c
    d
    3 replies
    Copy to Clipboard