When using the ingestion framework is it expected ...
# ingestion
i
When using the ingestion framework is it expected when specifying a database for the crawler to work through all databases but prefix the DatasetURN of each entity with the database defined in the crawling config.
m
can you give an example?
g
It is expected because of these lines in the code, which were added to maintain SQL server compatibility. If you want to restrict to a certain database, the allow/deny lists are probably a better approach for now
Once again, we could probably improve the docs here
i
Harshal understood. Perhaps the lists approach is better, to document and use while removing the database property altogether
g
Yep I think so as well - will check with @mammoth-bear-12532 since he's the most familiar with SQL server ingestion in particular
@incalculable-ocean-74010 this PR should make the behavior a bit more clear https://github.com/linkedin/datahub/pull/2161 - let me know if this isn't aligned with what you were imagining
👍 1
i
Hey @gray-shoe-75895 about allow/block lists, can you share how to ignore specific databases?
Say you have a SQL engine with the following databases:
A
,
B
,
default
. Is there a way to specify to the crawler to not crawl the
default
database?
Turns out I have to specify the database as
None.<database_name>
. Is this intended?
Copy code
---
source:
  type: hive
  config:
    host_port: <host>
    table_pattern:
      allow:
        - "None.dev"
      #deny:
      # - "default"
    options: 
      connect_args:      
        auth: KERBEROS
        thrift_transport_protocol: http
        kerberos_service_name: hive
☝️ as a concrete example. Is the prefix required for some specific reason? If yes, that may be something else to be documented
Noticed something else. Even if I specify to default database in the deny table_pattern:
Copy code
---
source:
  type: hive
  config:
    host_port: <>
    table_pattern:
      allow:
        - "<http://None.bi|None.bi>"
        - "None.events"
      deny:
        - "None.default"
    options: 
      connect_args:      
        auth: KERBEROS
        thrift_transport_protocol: http
        kerberos_service_name: hive
The crawler will try to query the default database but the user running the process does not have permissions to query default (for internal company reasons):
Copy code
ExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [datahub-dev] does not have [USE] privilege on [default]
Is this a bug? What is the ingestion framework trying to do?
m
Yeah we didn’t consider perms as a reason for allow/deny, but it makes sense. Current code post-filters the list. We should probably do a combination of pre-check to prevent unnecessary queries and post-check to apply final filters.
i
If you agree I’m happy to try and make the necessary changes but need some guidance in the ingestion codebase
m
Yeah that would be great. Either I or @gray-shoe-75895 can guide you maybe in the next hour or so.
🙌 1
i
Whenever you're free let me know, I think I found the root cause
g
Nice - what did you find?
i
Could we have a call it's probably easier to explain
g
sure thing - I'll call you on slack?
1