Hello, does the ingestion framework use <https://w...
# ingestion
i
Hello, does the ingestion framework use https://www.python.org/dev/peps/pep-0249/ to crawl metadata using sqlalchemy?
Trying to extend the framework to support Druid but I think it works a little differently than is expected. After adding a Druid Source:
Copy code
# This import verifies that the dependencies are available.
import pydruid # noqa: F401

from .sql_common import BasicSQLAlchemyConfig, SQLAlchemySource

class DruidConfig(BasicSQLAlchemyConfig):
    # defaults
    scheme = "druid"

    def get_sql_alchemy_url(self):
        return f"{BasicSQLAlchemyConfig.get_sql_alchemy_url(self)}/druid/v2/sql/"


class DruidSource(SQLAlchemySource):
    def __init__(self, config, ctx):
        super().__init__(config, ctx, "druid")

    @classmethod
    def create(cls, config_dict, ctx):
        config = DruidConfig.parse_obj(config_dict)
        return cls(config, ctx)
and registering this new source class I get the following when crawling it:
Copy code
# datahub ingest -c druid_to_console.yml 
[2021-03-12 15:09:04,996] DEBUG    {datahub.entrypoints:64} - Using config: {'source': {'type': 'druid', 'config': {'host_port': '<omitted url>'}}, 'sink': {'type': 'file', 'config': {'filename': './druid.json'}}}
[2021-03-12 15:09:04,996] DEBUG    {datahub.ingestion.run.pipeline:63} - Source type:druid,<class 'datahub.ingestion.source.druid.DruidSource'> configured
[2021-03-12 15:09:04,996] INFO     {datahub.ingestion.sink.file:27} - Will write to druid.json
[2021-03-12 15:09:04,996] DEBUG    {datahub.ingestion.run.pipeline:69} - Sink type:file,<class 'datahub.ingestion.sink.file.FileSink'> configured
[2021-03-12 15:09:04,996] DEBUG    {datahub.ingestion.source.sql_common:172} - sql_alchemy_url=<omitted url>
[2021-03-12 15:09:05,466] DEBUG    {datahub.ingestion.run.pipeline:38} - sink called success callback
[2021-03-12 15:09:05,560] DEBUG    {datahub.ingestion.run.pipeline:38} - sink called success callback
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/result.py", line 1215, in _fetchone_impl
    return self.cursor.fetchone()
AttributeError: 'NoneType' object has no attribute 'fetchone'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/datahub", line 33, in <module>
    sys.exit(load_entry_point('datahub', 'console_scripts', 'datahub')())
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
@gray-shoe-75895 do you have any pointers?
g
The ingestion framework uses sqlalchemy, and internally relies on dbapi via the driver
My best guess is that it’s an issue with pydruid, but I’m not really sure where to go from there
i
Has the ingestion framework been tested agaisnt a database that has a table schema but no rows?
g
interesting - I don't think so
Actually yes we do test that in the MySQL integration test
i
This seems like a particuliarity of my metadata database for druid which has become inconsistent. I will test further but it seems the code snippet in first threaded message works as a druid crawler if you want to add that to the metadata ingestion 🙂
g
Would love if you opened a PR for this!
i