When using the ingestion framework is it expected when speci DataHub #ingestion

When using the ingestion framework is it expected ...

incalculable-ocean-74010

03/02/2021, 5:43 PM

When using the ingestion framework is it expected when specifying a database for the crawler to work through all databases but prefix the DatasetURN of each entity with the database defined in the crawling config.

microscopic-receptionist-23548

03/02/2021, 5:54 PM

can you give an example?

gray-shoe-75895

03/02/2021, 5:55 PM

It is expected because of these lines in the code, which were added to maintain SQL server compatibility. If you want to restrict to a certain database, the allow/deny lists are probably a better approach for now

gray-shoe-75895

03/02/2021, 5:56 PM

Once again, we could probably improve the docs here

incalculable-ocean-74010

03/02/2021, 6:07 PM

Harshal understood. Perhaps the lists approach is better, to document and use while removing the database property altogether

gray-shoe-75895

03/02/2021, 6:10 PM

Yep I think so as well - will check with @mammoth-bear-12532 since he's the most familiar with SQL server ingestion in particular

gray-shoe-75895

03/03/2021, 5:44 AM

@incalculable-ocean-74010 this PR should make the behavior a bit more clear https://github.com/linkedin/datahub/pull/2161 - let me know if this isn't aligned with what you were imagining

👍 1

incalculable-ocean-74010

03/04/2021, 2:34 PM

Hey @gray-shoe-75895 about allow/block lists, can you share how to ignore specific databases?

incalculable-ocean-74010

03/04/2021, 2:35 PM

Say you have a SQL engine with the following databases:

default

. Is there a way to specify to the crawler to not crawl the

default

database?

incalculable-ocean-74010

03/04/2021, 2:37 PM

Turns out I have to specify the database as

None.<database_name>

. Is this intended?

incalculable-ocean-74010

03/04/2021, 2:38 PM

Copy code

---
source:
  type: hive
  config:
    host_port: <host>
    table_pattern:
      allow:
        - "None.dev"
      #deny:
      # - "default"
    options: 
      connect_args:      
        auth: KERBEROS
        thrift_transport_protocol: http
        kerberos_service_name: hive

incalculable-ocean-74010

03/04/2021, 2:39 PM

☝️ as a concrete example. Is the prefix required for some specific reason? If yes, that may be something else to be documented

incalculable-ocean-74010

03/04/2021, 2:45 PM

Noticed something else. Even if I specify to default database in the deny table_pattern:

Copy code

---
source:
  type: hive
  config:
    host_port: <>
    table_pattern:
      allow:
        - "<http://None.bi|None.bi>"
        - "None.events"
      deny:
        - "None.default"
    options: 
      connect_args:      
        auth: KERBEROS
        thrift_transport_protocol: http
        kerberos_service_name: hive

The crawler will try to query the default database but the user running the process does not have permissions to query default (for internal company reasons):

Copy code

ExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [datahub-dev] does not have [USE] privilege on [default]

Is this a bug? What is the ingestion framework trying to do?

mammoth-bear-12532

03/04/2021, 3:30 PM

Yeah we didn’t consider perms as a reason for allow/deny, but it makes sense. Current code post-filters the list. We should probably do a combination of pre-check to prevent unnecessary queries and post-check to apply final filters.

incalculable-ocean-74010

03/04/2021, 3:45 PM

If you agree I’m happy to try and make the necessary changes but need some guidance in the ingestion codebase

mammoth-bear-12532

03/04/2021, 4:04 PM

Yeah that would be great. Either I or @gray-shoe-75895 can guide you maybe in the next hour or so.

🙌 1

incalculable-ocean-74010

03/04/2021, 5:08 PM

Whenever you're free let me know, I think I found the root cause

gray-shoe-75895

03/04/2021, 5:34 PM

Nice - what did you find?

incalculable-ocean-74010

03/04/2021, 5:36 PM

Could we have a call it's probably easier to explain

gray-shoe-75895

03/04/2021, 5:36 PM

sure thing - I'll call you on slack?

✅ 1

Open in Slack

Previous Next