DataHub #ingestion

Join Slack

sparse-barista-40860

07/19/2022, 1:29 AM

For big data storage, Cassandra, mongo, hadoop

sparse-barista-40860

07/19/2022, 1:29 AM

Wich one is compatible right now for expose metadata?

lemon-terabyte-66903

07/19/2022, 2:57 AM

https://datahubspace.slack.com/archives/CV2UVAPPG/p1658199463049969

melodic-ability-49840

07/19/2022, 3:49 AM

Hello, All. I have a HTTP issues for delete metadata. I want to delete one of my schema ingested from MySQL DB. So, I put cli like this.

Copy code

datahub delete --query "information_schema" --hard --include-removed

But, It doesn’t work with this error.

Copy code

HTTPError: 401 Client Error: Unauthorized for url: <http://localhost:8080/entities?action=search>
[2022-07-19 03:22:34,201] INFO     {datahub.entrypoints:177} - DataHub CLI version: 0.8.40 at /home/ec2-user/datahub/venv/lib64/python3.7/site-packages/datahub/__init__.py
[2022-07-19 03:22:34,201] INFO     {datahub.entrypoints:180} - Python version: 3.7.10 (default, Jun  3 2021, 00:02:01)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)] at /home/ec2-user/datahub/venv/bin/python3 on Linux-5.10.109-104.500.amzn2.x86_64-x86_64-with-glibc2.2.5
[2022-07-19 03:22:34,201] INFO     {datahub.entrypoints:182} - GMS config {}

In my opinion, this error is occured from How I started datahub on my EC2 instance. When I started datahub, I modified docker-compose-without-neo4j.quickstart.yml for enable Personal Access Tokens. I referred to below content.

Copy code

you have to create your own yml file - ( copy the one that says docker-compose-without-neo4j.quickstart.yml under docker/quickstart folder and edit it with METADATA_SERVICE_AUTH_ENABLED=true  under both datahub-frontend-reacu and datahub-gms ) then run the quickstart command with the docker file you created. datahub docker quickstart --quickstart-compose-file ./yourdockerfile.yml

https://datahubspace.slack.com/archives/C029A3M079U/p1655101380628779 In this situation, What should I do to solve this problem? I’d appreciate your help.

hallowed-machine-2603

07/19/2022, 4:56 AM

Hi temas, I want to insert the tags/terms/descriptions in bulk using .csv or .xlsx. Is it possible to insert tags/terms/descriptions in bulk for each table?

alert-football-80212

07/19/2022, 7:42 AM

Hi all, what exactly is the difference between snowflake metadata ingestion to snowflake usages?

steep-vr-39297

07/19/2022, 1:31 PM

Hello. I have a question. Containers get the same DB name. Can I have just one DB name?

billions-twilight-48559

07/19/2022, 2:08 PM

Hi, I think there is a problem ingesting table column comments using the DataBricks Hive metastore and Hive ingestor. If I use the delta lake ingestion it works but I can see the comments are in json format, not sure if this is due to delta lake ingestion or a data bricks standard. Anyone knows if maybe changing any parameter in the recipe it would works with the hive ingestion?

brave-tomato-16287

07/19/2022, 5:16 PM

Hello All! Could you help me to understand why tests from

run_results.json

did not parse? It contains:

Copy code

{
  "status": "pass",
  "timing": [
    {
      "name": "compile",
      "started_at": "2022-07-19T05:24:01.030208Z",
      "completed_at": "2022-07-19T05:24:01.076270Z"
    },
    {
      "name": "execute",
      "started_at": "2022-07-19T05:24:01.092355Z",
      "completed_at": "2022-07-19T05:24:01.309409Z"
    }
  ],
  "thread_id": "Thread-3 (worker)",
  "execution_time": 0.3308274745941162,
  "adapter_response": {},
  "message": null,
  "failures": 0,
  "unique_id": "test.dwh.not_null_dim_poses_pos_id.bbf0d646e2"
},
    {
      "status": "pass",
      "timing": [
        {
          "name": "compile",
          "started_at": "2022-07-19T05:24:01.763525Z",
          "completed_at": "2022-07-19T05:24:01.783930Z"
        },
        {
          "name": "execute",
          "started_at": "2022-07-19T05:24:01.784195Z",
          "completed_at": "2022-07-19T05:24:01.974451Z"
        }
      ],
      "thread_id": "Thread-2 (worker)",
      "execution_time": 0.21937918663024902,
      "adapter_response": {},
      "message": null,
      "failures": 0,
      "unique_id": "test.dwh.row_count_dim_poses_1__pos_id.a5451597d7"
    },
    {
      "status": "pass",
      "timing": [
        {
          "name": "compile",
          "started_at": "2022-07-19T05:26:12.219929Z",
          "completed_at": "2022-07-19T05:26:12.239354Z"
        },
        {
          "name": "execute",
          "started_at": "2022-07-19T05:26:12.239617Z",
          "completed_at": "2022-07-19T05:26:12.325147Z"
        }
      ],
      "thread_id": "Thread-9 (worker)",
      "execution_time": 0.10956311225891113,
      "adapter_response": {},
      "message": null,
      "failures": 0,
      "unique_id": "test.dwh.unique_dim_poses_pos_id.e6dc11e045"
    },

refined-energy-76018

07/20/2022, 1:08 AM

Is there a way to mount a recipe file so that it will show up in UI ingestion? Or more generally, what are some ways I can manage my UI ingestion sources through source control?

hallowed-machine-2603

07/20/2022, 5:42 AM

Hi teams, Is there custom option for add or change browse set? If a certain database metadata is introduced in DataHub using recipe, can I set the separated PATH each table? Assumption: Two tables are in the same database, one is Finance_table, the other is Analysis_table. Presently, I use transformers option below transformers: type: set_dataset_browse_path config: path_tamplates: 'Test/Test_1/Test_2/DATASET_PARTS' In this case, all dataset is located at same PATH. But I want to separate PATH for each table. For example, Test/Test_1/Test_2/Finance_table Test/Check_1/Check_2/Analysis_table

steep-vr-39297

07/20/2022, 8:11 AM

Hello, I have question. Data was imported using a hive recipe. When I did

show databases

, there was no

none

DB, but it was created. This is Hive Recipe

Copy code

source:
    type: hive
    config:
      host_port: ip:port
      username: id
      password: pw
      env: DEV
      platform: hive
      schema_pattern:
        allow:
          - 'dev_db_.*'
      domain:
        test:
          allow:
            - '.*'
      options:
       connect_args:
          auth: LDAP

sink:
    type: datahub-rest
    config:
      server: "<http://localhost:8080>"

gorgeous-library-38151

07/20/2022, 9:58 AM

Can I specify the name of the dataset with yml (instead of using file's name) while ingesting s3 files into datahub?

salmon-angle-92685

07/20/2022, 12:28 PM

Hello guys, In my redshift recipe, I add some Glossary Terms based on the table name pattern. However, there is an intersection of tables between two different patterns. When this happens, the second pattern doesn't add the glossary terms. How to correct this ? For example:

Copy code

# Linking all the tables having '_table_' on their name to the glossary1 Glossary Term.
'.*\._table_.*': ["urn:li:glossaryTerm:topic.glossary1"]

# Linking all the tables having '_table_example_' on their name to the glossary2 Glossary Term.
'.*\._table_example_.*': ["urn:li:glossaryTerm:topic.glossary2"]

Since all the tables present on the second pattern are also on the first one, the second pattern doesn't add the glossary2 to the tables. Thank you !

thankful-umbrella-77469

07/20/2022, 1:29 PM

Hi, I will appreciate some assistance regarding Datahub configuration with my cloud services. I’m launching the helm chart and facing with the following error with the pod of

acryl-datahub-actions

Copy code

%6|1658138371.437|FAIL|rdkafka#consumer-1| [thrd:<BOOTSTRAP_SERVER>:9092/bootstrap]: <BOOTSTRAP_SERVER>:9092/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 3ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)

and this error keeps repeating itself. I’m connecting my datahub resources of kafka, elasticSearch and mysql to my cloud services. elastic and mysql -> AWS kafka -> confluent I followed after the doc here but still didn’t overcome the above error Did someone face it as well? Thx

better-orange-49102

07/20/2022, 2:00 PM

with the new glossaryterms UI creation logic, the business-glossary CLI ingestion logic might be due for a change. The old approach of using the term name as the basis for URN can be replaced by UUID (or at least let people specify a UUID so that they reliably edit a specific term and yet not be locked in by the name), however, i can foresee legacy issues from people who have an existing glossary term flow? Not sure what would be the best approach here

colossal-sandwich-50049

07/20/2022, 4:08 PM

Hello, I am playing around with Lineage and getting an error when trying to set downstream lineage. Can someone advise?

Copy code

/** FS Downstream **/
        Downstream downstream = new Downstream()
                .setDataset(new DatasetUrn(new DataPlatformUrn("delta"), "somedataset", FabricType.DEV))
                .setType(DatasetLineageType.TRANSFORMED);

        DownstreamArray downstreamArray = new DownstreamArray(downstream);
        DownstreamLineage downstreamLineage = new DownstreamLineage().setDownstreams(downstreamArray);

        MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
                .entityType("dataset")
                .entityUrn(new DatasetUrn(new DataPlatformUrn("delta"), "someotherdataset", FabricType.DEV))
                .upsert()
                .aspect(downstreamLineage)
                .build();

Trying to build the mcpw gives the following error: Caused by: java.lang.NullPointerException: aspectName could not be inferred from provided aspect and was not explicitly provided as an override

Note: the same exact code works if switching it to set upstream lineage instead of downstream

billowy-monitor-84300

07/20/2022, 5:11 PM

Does Looker ingestion require your credentials to be tied to an admin-level service account on the Looker side? Or can this be a developer account?

chilly-carpet-99599

07/20/2022, 5:23 PM

Hey folks, regarding profiling data sets on Snowflake: what is the purpose of the below query? Does it relate to sample values being shown in stats section? The query always hangs on our db for a very large table, does DataHub allow us to customise or tune profiling queries? Thanks so much.

Copy code

select column from table where column is not null order by column limit 2 offset <tel:500000000|500000000>;

bland-morning-36590

07/20/2022, 6:09 PM

Hi Team, I am trying to ingest metadata from S3 bucket. From the logs I can see I am able to make a connection, but hitting “KeyError: ‘ContentRange’“. Any idea why this could be happening?

colossal-sandwich-50049

07/20/2022, 10:06 PM

Is there any documentation showing how to create an

MLModel

entity using the java (or python) emitter?

nice-country-99675

07/21/2022, 2:05 AM

👋 Hello team! Just a quick question here... is there a way to access an aspect in a transformer transforming another aspect?

plus1 1

purple-ghost-69116

07/21/2022, 3:05 AM

Hi Team, I quick question regarding data ingestion from redshift or glue. How do we ingest metadata like owner, tags, domain, queries into datahub. Do we need to manually populate all this by a user?

lemon-zoo-63387

07/21/2022, 3:56 AM

Hello, everyone, I am trying to ingest metadata from mssqls, two of which succeeded, but one alarm. How to solve it? Thanks in advance for your help

Copy code

'Error: Client does not have encryption enabled but it is required by server, enable encryption and try connecting again\n'

Configure Recipe：

Copy code

source:
    type: mssql
    config:
        env: QA
        host_port: '10.xxxx:50920'
        database: OA
        username: xxx
        password: 'oaxxxxn'
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-gms:8080>'

logs:

Copy code

'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line '
           '114, in connect\n'
           '    return dialect.connect(*cargs, **cparams)\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 508, '
           'in connect\n'
           '    return self.dbapi.connect(*cargs, **cparams)\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/__init__.py", line 1345, in '
           'connect\n'
           '    conn._open(sock=sock)\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/__init__.py", line 372, in _open\n'
           '    self._try_open(timeout=retry_time, sock=sock)\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/__init__.py", line 354, in '
           '_try_open\n'
           '    self._connect(host=host, port=port, instance=instance, timeout=timeout, sock=sock)\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/__init__.py", line 304, in '
           '_connect\n'
           '    route = conn.login(login, sock, self._tzinfo_factory)\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/tds.py", line 1703, in login\n'
           '    self._main_session.process_prelogin(login)\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/tds.py", line 1294, in '
           'process_prelogin\n'
           '    self.parse_prelogin(octets=p, login=login)\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/tds.py", line 1343, in '
           'parse_prelogin\n'
           "    raise tds_base.Error('Client does not have encryption enabled but it is required by server, '\n"
           '\n'
           'Error: Client does not have encryption enabled but it is required by server, enable encryption and try connecting again\n'
           '\n'
           'The above exception was the direct cause of the following exception:\n'
           '\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 106, in '
           'run\n'
           '    88   def run(\n'
           '    89       ctx: click.Context,\n'
           '    90       config: str,\n'
           '    91       dry_run: bool,\n'
           '    92       preview: bool,\n'
           '    93       strict_warnings: bool,\n'
           '    94       preview_workunits: int,\n'
           '    95       suppress_error_logs: bool,\n'
           '    96   ) -> None:\n'
           ' (...)\n'
           '    102      pipeline_config = load_config_file(config_file)\n'
           '    103  \n'
           '    104      try:\n'
           '    105          logger.debug(f"Using config: {pipeline_config}")\n'
           '--> 106          pipeline = Pipeline.create(pipeline_config, dry_run, preview, preview_workunits)\n'
           '    107      except ValidationError as e:\n'
           '\n'
           'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line '
           '204, in create\n'
           '    196  def create(\n'
           '    197      cls,\n'
           '    198      config_dict: dict,\n'
           '    199      dry_run: bool = False,\n'
           '    200      preview_mode: bool = False,\n'
           '    201      preview_workunits: int = 10,\n'
           '    202  ) -> "Pipeline":\n'
           '    203      config = PipelineConfig.parse_obj(config_dict)\n'
           '--> 204      return cls(\n'
           '    205          config,\n'
           '\n'

wonderful-egg-79350

07/21/2022, 5:19 AM

Is it possible for me to ingest metadata using json or csv file? How to ingest table schema using json?

witty-painting-90923

07/21/2022, 8:51 AM

Hello everyone! We have added an elastic cluster in datahub. But now this cluster is deleted. Hence the ingestion doesnt work, and the old metadata is just hanging there. Is there a way currently to remove it from datahub? Thank you!

cool-vr-73109

07/21/2022, 10:19 AM

Hi team, Tried S3 data ingestion using Custome ingestion source option.Ingestion failed for profiling enabled as true with below error. JAVA_HOME and SPARK_VERSION are set. We have deployed Datahub in an ec2 instance with Docker. Please help

best-leather-7441

07/21/2022, 3:05 PM

hi ! my team and I would like to implement a gitlab CI/CD job to ingest dbt data to datahub. Does anyone have some tips on how to do this ?

average-rocket-98592

07/21/2022, 3:15 PM

Hi! I’m trying to ingest data from Oracle with the profiling option enabled with default values, but it returns the following error: limit\n value is not a valid integer

nice-solstice-89964

07/21/2022, 3:51 PM

Hi, community. We have ingestions for Looker and dbt/Snowflake on DataHub But when trying to show the lineage from Looker, we can only see the publish table of the charts on Looker, but cannot further see the source tables of those publish tables. The only way to see such lineage (source tables -> publish tables) is to check from dbt section on DataHub. Is there a way to also integrate those 2 sources together? So that when we check the lineage of Looker, we can also track back to the very source tables for some fields on dbt. Thanks very much for any hint! 🙇