A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Hello, does the ingestion framework use <https://www.python.org/dev/peps/pep-0249/> to crawl metadata using sqlalchemy?

Trying to extend the framework to support <https://druid.apache.org/|Druid> but I think it works a little differently than is expected. After adding a Druid Source:
```# This import verifies that the dependencies are available.
import pydruid # noqa: F401

from .sql_common import BasicSQLAlchemyConfig, SQLAlchemySource

class DruidConfig(BasicSQLAlchemyConfig):
    # defaults
    scheme = "druid"

    def get_sql_alchemy_url(self):
        return f"{BasicSQLAlchemyConfig.get_sql_alchemy_url(self)}/druid/v2/sql/"


class DruidSource(SQLAlchemySource):
    def __init__(self, config, ctx):
        super().__init__(config, ctx, "druid")

    @classmethod
    def create(cls, config_dict, ctx):
        config = DruidConfig.parse_obj(config_dict)
        return cls(config, ctx)```
and registering this new source class I get the following when crawling it:
```# datahub ingest -c druid_to_console.yml 
[2021-03-12 15:09:04,996] DEBUG    {datahub.entrypoints:64} - Using config: {'source': {'type': 'druid', 'config': {'host_port': '&lt;omitted url&gt;'}}, 'sink': {'type': 'file', 'config': {'filename': './druid.json'}}}
[2021-03-12 15:09:04,996] DEBUG    {datahub.ingestion.run.pipeline:63} - Source type:druid,&lt;class 'datahub.ingestion.source.druid.DruidSource'&gt; configured
[2021-03-12 15:09:04,996] INFO     {datahub.ingestion.sink.file:27} - Will write to druid.json
[2021-03-12 15:09:04,996] DEBUG    {datahub.ingestion.run.pipeline:69} - Sink type:file,&lt;class 'datahub.ingestion.sink.file.FileSink'&gt; configured
[2021-03-12 15:09:04,996] DEBUG    {datahub.ingestion.source.sql_common:172} - sql_alchemy_url=&lt;omitted url&gt;
[2021-03-12 15:09:05,466] DEBUG    {datahub.ingestion.run.pipeline:38} - sink called success callback
[2021-03-12 15:09:05,560] DEBUG    {datahub.ingestion.run.pipeline:38} - sink called success callback
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1215, in _fetchone_impl
    return self.cursor.fetchone()
AttributeError: 'NoneType' object has no attribute 'fetchone'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/datahub", line 33, in &lt;module&gt;
    sys.exit(load_entry_point('datahub', 'console_scripts', 'datahub')())
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)```

The ingestion framework uses sqlalchemy, and internally relies on dbapi via the driver

My best guess is that it’s an issue with pydruid, but I’m not really sure where to go from there

Has the ingestion framework been tested agaisnt a database that has a table schema but no rows?

Actually yes we do test that in the MySQL integration test

This seems like a particuliarity of my metadata database for druid which has become inconsistent. I will test further but it seems the code snippet in first threaded message works as a druid crawler if you want to add that to the metadata ingestion :slightly_smiling_face:

<https://github.com/linkedin/datahub/pull/2235>