Good day, I'm trying to do a simple ingestion from...
# ingestion
g
Good day, I'm trying to do a simple ingestion from a PostgreSQL but facing some error messages that I straggle to understand. ( DataHub is running locally via "datahub docker quickstart" ) My yaml file:
Copy code
source:
  type: postgres
  config:
    # Coordinates
    host_port: URL:5432
    database: DATABASENAME
    # Credentials
    username: user
    password: password
    #Options
    include_tables: True
    include_views: True
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:9002/api/gms>"   #this path is what UI ingestion tool sugests, I also tried default <http://localhost:8080>" with same result
Both postgres and datahub-rest plugins looks enabled. Upd: Error log moved into thread.
plus1 1
c
same err, ping me please when it will be solved
@glamorous-house-64036 working for me
Copy code
# Assumes the DataHub repo is cloned locally.
./metadata-ingestion/scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml
i
Hello Dmytro, what version of DataHub are you using?
l
Hi @glamorous-house-64036! Gentle reminder to please post large blocks of code/logs in threads - it’s a HUGE help to make sure we can keep track of all unanswered questions across channels! teamwork
g
Moving Log to the thread, sorry: Errors on ingestion is look like that:
Copy code
[2022-02-10 11:12:54,369] ERROR    {datahub.entrypoints:119} - File "/usr/local/lib/python3.8/dist-packages/datahub/cli/ingest_cli.py", line 77, in run
    67   def run(config: str, dry_run: bool, preview: bool, strict_warnings: bool) -> None:
 (...)
    73       pipeline_config = load_config_file(config_file)
    74   
    75       try:
    76           logger.debug(f"Using config: {pipeline_config}")
--> 77           pipeline = Pipeline.create(pipeline_config, dry_run, preview)
    78       except ValidationError as e:

File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 175, in create
    171  def create(
    172      cls, config_dict: dict, dry_run: bool = False, preview_mode: bool = False
    173  ) -> "Pipeline":
    174      config = PipelineConfig.parse_obj(config_dict)
--> 175      return cls(config, dry_run=dry_run, preview_mode=preview_mode)

File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 111, in __init__
    105  def __init__(
    106      self, config: PipelineConfig, dry_run: bool = False, preview_mode: bool = False
    107  ):
    108      self.config = config
    109      self.dry_run = dry_run
    110      self.preview_mode = preview_mode
--> 111      self.ctx = PipelineContext(
    112          run_id=self.config.run_id,

File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/api/common.py", line 41, in __init__
    32   def __init__(
    33       self,
    34       run_id: str,
    35       datahub_api: Optional[DatahubClientConfig] = None,
    36       pipeline_name: Optional[str] = None,
    37       dry_run: bool = False,
    38       preview_mode: bool = False,
    39   ) -> None:
    40       self.run_id = run_id
--> 41       self.graph = DataHubGraph(datahub_api) if datahub_api is not None else None
    42       self.pipeline_name = pipeline_name

File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/graph/client.py", line 39, in __init__
    37   def __init__(self, config: DatahubClientConfig) -> None:
    38       self.config = config
--> 39       super().__init__(
    40           gms_server=self.config.server,

File "/usr/local/lib/python3.8/dist-packages/datahub/emitter/rest_emitter.py", line 117, in __init__
    65   def __init__(
    66       self,
    67       gms_server: str,
    68       token: Optional[str] = None,
    69       connect_timeout_sec: Optional[float] = None,
    70       read_timeout_sec: Optional[float] = None,
    71       retry_status_codes: Optional[List[int]] = None,
    72       retry_methods: Optional[List[str]] = None,
    73       retry_max_times: Optional[int] = None,
    74       extra_headers: Optional[Dict[str, str]] = None,
    75       ca_certificate_path: Optional[str] = None,
    76   ):
 (...)
    113  
    114      if retry_max_times:
    115          self._retry_max_times = retry_max_times
    116  
--> 117      retry_strategy = Retry(
    118          total=self._retry_max_times,

---- (full traceback above) ----
File "/usr/local/lib/python3.8/dist-packages/datahub/cli/ingest_cli.py", line 77, in run
    pipeline = Pipeline.create(pipeline_config, dry_run, preview)
File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 175, in create
    return cls(config, dry_run=dry_run, preview_mode=preview_mode)
File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 111, in __init__
    self.ctx = PipelineContext(
File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/api/common.py", line 41, in __init__
    self.graph = DataHubGraph(datahub_api) if datahub_api is not None else None
File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/graph/client.py", line 39, in __init__
    super().__init__(
File "/usr/local/lib/python3.8/dist-packages/datahub/emitter/rest_emitter.py", line 117, in __init__
    retry_strategy = Retry(

TypeError: __init__() got an unexpected keyword argument 'allowed_methods'
datahub version is 0.8.26.1
i
This is very strange… Let me get back to you on this @glamorous-house-64036!
👍 1
g
I got same error when I try to ingest example from here datahub/metadata-ingestion/examples/recipes/example_to_datahub_rest.yml - in both cases when I use default endpoint for the rest sink 8080 and the one that UI shows -:9002/api/gms
h
Hi @glamorous-house-64036, could you tell what version of urllib3 you have on your system?
Copy code
pip freeze | grep urllib3
urllib3==1.26.7
g
urllib3==1.25.8
h
That's probably the issue! Could you try running
pip install urllib3 --upgrade
?
g
Ok, after upgrade I got a urllib3==1.26.8 Unfortunately exactly same issues with ingestion 😞
I forgot to mention: datahub docker ingest-sample-data creates same error. Out of interest tried it on a Ubuntu 18.04 and 20.04 in WLS2 on a different laptop and can see same problem.
Ok, problem solution: • update pyspark • use server: "http://localhost:8080" instead of whatever UI suggest in sink config.