DataHub #ingestion

wonderful-hair-89448

12/16/2022, 4:35 PM

oh okay, will try this way. Thank you @acceptable-account-83031.

👍🏼 1

astonishing-answer-96712

12/21/2022, 7:29 PM

Hi @cuddly-dinner-641, thanks for the checkin. This video may be helpful to show the scope of ways that you can enrich metadata via ingestion and the ui:

https://www.youtube.com/watch?v=xzYJ2lMJraY&t=2s▾

aloof-energy-17918

12/23/2022, 3:30 AM

I tried and get the field tab in chart to show, but when I click on it the page just turn white. Here the code I used.

Copy code

input_fields = []
input_fields.append(
    InputFieldClass(
        builder.make_schema_field_urn(
            builder.make_dataset_urn(platform='mssql',name='db_name.dbo.table_name', env='PROD'), 
            'Data_day'
        )
    )
)
input_fields_aspect = InputFieldsClass(
    fields=input_fields
)
chart_inputfield_mcp = MetadataChangeProposalWrapper(
    entityType="chart",
    entityUrn=builder.make_chart_urn(platform='looker', name='Daily-Revenue'),
    changeType=ChangeTypeClass.UPSERT,
    aspectName="inputFields",
    aspect=input_fields_aspect,
)
emitter.emit_mcp(chart_inputfield_mcp)

thankful-fireman-70616

12/25/2022, 5:50 PM

Anyone who tried ingestion from local Spark/ delta lake set-up?

thankful-fireman-70616

12/27/2022, 7:05 AM

I'm not sure if you are aware about Palantir ontology - they have such capabilities.. you can refer to the image

lively-dusk-19162

01/03/2023, 10:51 PM

Even airflow is being installed in venv when I try to run dag it shows ModuleNotFoundError: No module named airflow

lively-dusk-19162

01/03/2023, 10:53 PM

I didnt properly understand on how to use airflow exactly after going through the document though. My query is : I have already written a python SDK for emitting fine grained lineage without airflow. Now I need to make it with airflow first and then with data validation(great expectations).

lively-dusk-19162

01/03/2023, 10:56 PM

And also when I try to install apache-airflow, it is giving me HTTPSConnectionPool error. Could you please help me on that @dazzling-judge-80093 Thankyou so much in advance

polite-actor-701

01/04/2023, 12:19 AM

Actually I created a new Source. I have an oracle db that stores metadata, and I select data from that db. I yielded SqlWorkUnit by creating a dataset snapshot by referring to the sql_common.py file. When using datahub-kafka for the sink, it took about 45 minutes to ingest 80k data, but indexing is taking a long time. If it's normal, can you tell me how long it takes to complete indexing when ingesting 80k data?

lively-dusk-19162

01/04/2023, 8:02 PM

Could anyone please help me on above issue?

lively-dusk-19162

01/05/2023, 9:19 PM

No, there is no error. I have airlfow pipelines which I am trying to delete. When I do so,it says no urns to delete

lively-dusk-19162

01/05/2023, 9:22 PM

datahub delete --env PROD --entity_type "datajob" datahub delete --env PROD --entity_type "dataflow" These are the two commands I used to delete pipeline and tasks of airflow

abundant-television-56673

01/07/2023, 10:11 PM

Necro-bumping this a bit… but I’ve had the same issue and I’ve found that stopping the gms pod, truncating the MySQL database (I’m running it in RDS) and deleting all indices in ES (I’m using AWS OS) seems to cause chaos and the gms pod won’t restart. I did find that then running the Datahub-MySQL-setup docker image and elastic search one (with correct env vars) in my cluster seemed to set them backup correctly and I can start the gms pod. This half works…. It seems sometimes some users etc. hang around in the settings but can’t figure out why yet. (Could be something to do with my AzureAD ingestion) Would be nice if there was a docker file or something that could be ran to reset Datahub to factory settings (I’m currently putting this together myself but imagine it’ll be very bespoke for my setup in k8s)

astonishing-cartoon-6079

01/11/2023, 8:55 AM

image.png

salmon-psychiatrist-4013

01/13/2023, 7:42 AM

Hello Team, I was able to figure this one out myself and the solution is working perfectly. Here is how the source section in the recipe looks like:

Copy code

source:
  type: vertica
  config:
    # Coordinates
    host_port: host:5433
    database: DB

    # Credentials
    username: "user"
    password: "password"

    options:     {connect_args:{
        "ssl": {
            "ssl_ca": "ca.pem",
            "ssl_cert": "client-cert.pem",
            "ssl_key": "client-key.pem"
        }
    }
    }

salmon-psychiatrist-4013

01/13/2023, 7:44 AM

Wondering if this info could be propagated to wider audience out there who might want to do something similar for other sources too !!

polite-actor-701

01/17/2023, 12:54 AM

It wasn't Tableau's problem. There was another problem during oracle ingestion. I don't know if the attached picture will look good, but there was a log saying to reduce max.poll.interval.ms, so I added consumerProps.setMaxPollRecords like the second picture. Is it right that I handled it well? When I set it to 200, I get the same error, so I set it to 50 and try to test again. Please advise if you have any other opinions.

creamy-machine-95935

01/18/2023, 6:07 PM

Hi @astonishing-answer-96712, What I Mean is that I am trying to create a LookML Recipe and it only have options for Looker Repositories on Github (image below)

polite-actor-701

01/19/2023, 6:27 AM

The attached file is a log file related to datahub ingestion. It's not a log when ingesting a previously inserted image, but even when this log was created, a single piece of data was not indexed. I haven't been able to determine which data is not being indexed. And the log of gms or elasticsearch was too many logs piled up later, so I couldn't extract the log at that point separately.

ingest.txt

white-pillow-1041

01/27/2023, 7:39 PM

Related to this topic, for evaluation purposes I have installed and running DataHub locally and have ingested data sources (to test environments). I'm interested if the metadata content is stored locally on my machine or could be stored outside our organization? For example, the following content that displays the field names and sample values under Stats, is that stored locally or no?

white-pillow-1041

01/27/2023, 10:47 PM

When reviewing the documentation, the main purpose of Actions Framework appears to be getting data out of DataHub itself. I'm not finding specifics how to get data from changes to the actual underlying source, for example to object(s) within Snowflake or PostgreSQL databases. Is there that capability, and if possible point me in the right direction? Much appreciated!

thousands-bird-50049

01/29/2023, 11:40 AM

could you possibly tell me or point in the code what governs which dependencies those virtual environments install? I see each ingestion source type has its own venv, but I have not found the reference requirements.txt or equivalent, is it

base-requirements.txt

victorious-evening-88418

01/30/2023, 2:49 PM

Hi all, I have the same issue with "mssql" ingestion and the following profiling settings: ------------------------------------------------- profiling: enabled: true profile_table_level_only: false ------------------------------------------------- Below the error: ------------------------------------------------- return list(query.columns) AttributeError: 'Insert' object has no attribute 'columns' /[2023-01-30 153903,405] ERROR {datahub.utilities.sqlalchemy_query_combiner:250} - Failed to execute query normally, using fallback: CREATE TABLE "#ge_temp_5e01c147" ( condition INTEGER NOT NULL ) Traceback (most recent call last): File "/home/dani/.local/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", line 246, in _sa_execute_fake handled, result = self._handle_execute(conn, query, args, kwargs) File "/home/dani/.local/lib/python3.9/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", line 211, in _handle_execute ------------------------------------------------- Thanks in advance for your help!

fierce-monkey-46092

02/03/2023, 5:19 AM

hello @hundreds-photographer-13496 my ingestion looks like below. But i'm getting : 1 validation error for PipelineConfig profiling extra fields not permitted (type=value_error.extra) source: type: oracle config: host_port: DB-IP-ADDRESS:1521 platform_instance: cec #Creds username: DB_USER password: "PASSWORD" #Options service_name: DB_SCHEMA schema_pattern: allow: - "TABLE_NAME" - "TABLE_NAME" sink: type: "datahub-rest" config: server: "http://localhost:8080" transformers: - type: "simple_add_dataset_tags" config: tag_urns: - "urnlitag:cec" profiling: enabled: True

victorious-evening-88418

02/09/2023, 5:42 PM

Hi Mayuri N, sorry for the delay. The SQL server is a SQL Managed Instance on Azure and the problem was solved by the following recipe: --------------------------------------------------------------- $ cat sqlmi_#############################_dev.yaml source: type: mssql config: host_port: sql############################# database: sql############################# username: ############################# include_views: true include_tables: true env: 'DEV' domain: "#############################": allow: - ".*" profiling: enabled: true profile_table_level_only: false stateful_ingestion: enabled: true password: '#############################' # Options use_odbc: "True" uri_args: driver: "ODBC Driver 18 for SQL Server" Encrypt: "yes" TrustServerCertificate: "Yes" ssl: "True" pipeline_name: sql############################# sink: type: datahub-rest config: server: 'http://localhost:8080' ---------------------------------------------------------------

victorious-evening-88418

02/10/2023, 1:17 PM

Thanks to you! 🙂

victorious-evening-88418

02/13/2023, 2:05 PM

Hello Rama Teja Reddy, below my recipe that works fine with Datahub version 0.10.0 ('gms_version': 'v0.9.6') ------------------------------------ source: type: unity-catalog config: workspace_url: 'https://#################.azuredatabricks.net' include_table_lineage: true include_column_lineage: true stateful_ingestion: enabled: true token: ################################# env: 'DEV' pipeline_name: 'dbrk_unity_dev' sink: type: datahub-rest config: server: 'http://localhost:8080' ------------------------------------ You could try to insert "pipeline_name: 'pipeline_name'" in your recipe or upgrade the Datahub version.

victorious-evening-88418

02/14/2023, 3:34 PM

Dear all, is there someone that can help me with the ingestion of CSV and Parquet files from the Azure Storage Account? Thanks in advance for your support.

victorious-evening-88418

02/14/2023, 4:11 PM

Hello, thanks for the prompt answer. We are storing a lot of informations in our Azure storage account, basically CSV files in raw layer and parquet in bronze and sylver layers. This is the reason why is not so easy extract and store them in a file system to start the ingestion process.

victorious-evening-88418

02/14/2023, 4:39 PM

Ok, I'll do that. Thank you for the suggestion.