Hi All, I am trying to ingest metadata from a mssq...
# ingestion
c
Hi All, I am trying to ingest metadata from a mssql database. So the command I am using is
datahub ingest -c ./mssql-recipe_._yml
. And my recipe file looks as below,
Copy code
source:
  type: mssql
  config:
    username: <user_name>
    password: <password>
    host_port: <ip>:1433
    database: <db_name>
    table_pattern:
      #deny:
      #  - "^.*\\.sys_.*" # deny all tables that start with sys_
      allow:
        - "schema_name_1.*"

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"
But when I try, I get the below error,
Copy code
sqlalchemy.exc.OperationalError: (pytds.tds_base.OperationalError) Database '<organization_name>\<dba_name>' does not exist. Make sure that the name is entered correctly.
[SQL: use [<organization_name>\<dba_name>]]
• Is there a way to print or get the underlying queries submitted to the database (so I can figure out why it tries to query an object with the dba's name, probably a previously deleted item / user account ect etc...) ? • Is there a list of permissions that needs to be assigned to the db user that we use for the crawler / ingestion. Current user that I am trying has some additional server level access and was wondering whether it is providing additional metadata of the server that is not expected (eg: dba user names etc etc...)
g
Adding the echo option will make it print each sql query issued. I'd also recommend running datahub CLI with the debug flag enabled, e.g.
datahub --debug ingest ...
Copy code
source:
  type: mssql
  # normal stuff
  options:
    echo: True
The list of permissions varies by DB system, so I'm not super sure what the precise list is for mssql - in general, the user should be able to list databases, schemas, tables, and column-level metadata
c
Thanks @gray-shoe-75895. Let me try with above options and see whether I can narrow it down šŸ‘
šŸ‘ 1
@gray-shoe-75895 I get the below error (extra fields not permitted) when I add the
echo option
Copy code
1 validation error for PipelineConfig
source -> options
  extra fields not permitted (type=value_error.extra)
This is how my yml file looks now
Copy code
source:
  type: mssql
  options:
    echo: true
  config:
    username: <user_name>
    password: <password>
    host_port: <ip>:1433
    database: <db_name>
    table_pattern:
      #deny:
      #  - "^.*\\.sys_.*" # deny all tables that start with sys_
      allow:
        - "schema_name.*"

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"
g
ah sorry, I messed up the indentation in my previous message - options should be under
config
šŸ‘ 1
c
By any chance have you see the below error before,
Copy code
Traceback (most recent call last):
  File "/usr/local/bin/datahub", line 8, in <module>
    sys.exit(datahub())
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/datahub/entrypoints.py", line 74, in ingest
    pipeline.run()
  File "/usr/local/lib/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 87, in run
    for wu in self.source.get_workunits():
  File "/usr/local/lib/python3.6/site-packages/datahub/ingestion/source/sql_common.py", line 209, in get_workunits
    if not sql_config.table_pattern.allowed(dataset_name):
  File "/usr/local/lib/python3.6/site-packages/datahub/configuration/common.py", line 69, in allowed
    if re.match(deny_pattern, string):
  File "/usr/lib64/python3.6/re.py", line 172, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib64/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib64/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib64/python3.6/sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib64/python3.6/sre_parse.py", line 416, in _parse_sub
    not nested and not items))
  File "/usr/lib64/python3.6/sre_parse.py", line 616, in _parse
    source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 0
g
I haven't seen that before - seems like an issue with the table_pattern regexes?
c
ā˜ļø can it be due to that I am trying to deny all tables in a specific schema ?
g
is it the same config as before?
perhaps
btw, we also support
schema_pattern
c
Ah ok, I didn't know that. Then I guess I can deny a schema straight away like this ->
Copy code
schema_pattern:
      deny:
        - "schema_name"
Do you know where I might be able to find all supported configs / options (eg: schema_pattern, table_pattern, echo etc etc...) for a given source (if available) ? šŸ™‚
@gray-shoe-75895 worked like a charm with the
schema_pattern
. Thanks for the help mate šŸŽ‰ Just a one small question, when we are using
datahub ingest
command, can we change the path of the metadata dataset that we create. I assume by default, the path is like
Dataset/prod/<data_source_type>/<data_source_name>/<table_name>
, can I change it to something like
Dataset/nonprod/...
?
šŸŽ‰ 1
g
Most of the docs live here https://datahubproject.io/docs/metadata-ingestion/, but they're not quite comprehensive with all the config options available
for the
prod
thing, there's an option to set
env
under config in the recipe
šŸ‘ 1