Hi, I am evaluating Datahub to be implemented in ...
# ingestion
m
Hi, I am evaluating Datahub to be implemented in our company. I am trying to ingest a sample file that you provide in the docs . I am not sure where I should drop the file, I just dropped it in a folder like so: /home/miquelp/datahub/file_onboarding/test_containers.json However I am getting an error while executing the recipe, seems like it doesn't find the file (The error message could be improved)? I have looked for a similar error but couldn't find anything. What linux user is the one that executes the ingestion? Can you give us a hand? Thanks
Copy code
~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '06b7698c-048e-470e-bf2c-1ff4fca75bd0',
 'infos': ['2023-05-29 10:18:33.415813 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-05-29 10:18:37.476974 INFO: Failed to execute 'datahub ingest'",
           '2023-05-29 10:18:37.477118 INFO: Caught exception EXECUTING task_id=06b7698c-048e-470e-bf2c-1ff4fca75bd0, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Report ~~~~
{
  "cli": {
    "cli_version": "0.10.0.7",
    "cli_entry_location": "/usr/local/lib/python3.10/site-packages/datahub/__init__.py",
    "py_version": "3.10.10 (main, Mar 14 2023, 02:37:11) [GCC 10.2.1 20210110]",
    "py_exec_path": "/usr/local/bin/python",
    "os_details": "Linux-5.15.0-72-generic-x86_64-with-glibc2.31",
    "peak_memory_usage": "57.82 MB",
    "mem_info": "57.82 MB"
  },
  "source": {
    "type": "file",
    "report": {
      "events_produced": 0,
      "events_produced_per_sec": 0,
      "entities": {},
      "aspects": {},
      "warnings": {},
      "failures": {},
      "total_num_files": 0,
      "num_files_completed": 0,
      "files_completed": [],
      "percentage_completion": "0%",
      "estimated_time_to_completion_in_minutes": -1,
      "total_bytes_read_completed_files": 0,
      "total_parse_time_in_seconds": 0,
      "total_count_time_in_seconds": 0,
      "total_deserialize_time_in_seconds": 0,
      "aspect_counts": {},
      "entity_type_counts": {},
      "start_time": "2023-05-29 10:18:35.206188 (now)",
      "running_time": "0 seconds"
    }
  },
  "sink": {
    "type": "datahub-rest",
    "report": {
      "total_records_written": 0,
      "records_written_per_second": 0,
      "warnings": [],
      "failures": [],
      "start_time": "2023-05-29 10:18:35.161225 (now)",
      "current_time": "2023-05-29 10:18:35.208860 (now)",
      "total_duration_in_seconds": 0.05,
      "gms_version": "v0.10.3",
      "pending_requests": 0
    }
  }
}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub  ingest run -c /tmp/datahub/ingest/06b7698c-048e-470e-bf2c-1ff4fca75bd0/recipe.yml --report-to /tmp/datahub/ingest/06b7698c-048e-470e-bf2c-1ff4fca75bd0/ingestion_report.json
[2023-05-29 10:18:35,113] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.0.7
No ~/.datahubenv file found, generating one for you...
[2023-05-29 10:18:35,164] INFO     {datahub.ingestion.run.pipeline:184} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2023-05-29 10:18:35,206] INFO     {datahub.ingestion.run.pipeline:201} - Source configured successfully.
[2023-05-29 10:18:35,207] INFO     {datahub.cli.ingest_cli:129} - Starting metadata ingestion
[2023-05-29 10:18:35,209] INFO     {datahub.ingestion.reporting.file_reporter:52} - Wrote UNKNOWN report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/06b7698c-048e-470e-bf2c-1ff4fca75bd0/ingestion_report.json' mode='w' encoding='UTF-8'>
[2023-05-29 10:18:35,209] INFO     {datahub.cli.ingest_cli:134} - Source (file) report:
{'events_produced': 0,
 'events_produced_per_sec': 0,
 'entities': {},
 'aspects': {},
 'warnings': {},
 'failures': {},
 'total_num_files': 0,
 'num_files_completed': 0,
 'files_completed': [],
 'percentage_completion': '0%',
 'estimated_time_to_completion_in_minutes': -1,
 'total_bytes_read_completed_files': 0,
 'total_parse_time_in_seconds': 0,
 'total_count_time_in_seconds': 0,
 'total_deserialize_time_in_seconds': 0,
 'aspect_counts': {},
 'entity_type_counts': {},
 'start_time': '2023-05-29 10:18:35.206188 (now)',
 'running_time': '0 seconds'}
[2023-05-29 10:18:35,210] INFO     {datahub.cli.ingest_cli:137} - Sink (datahub-rest) report:
{'total_records_written': 0,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2023-05-29 10:18:35.161225 (now)',
 'current_time': '2023-05-29 10:18:35.210294 (now)',
 'total_duration_in_seconds': 0.05,
 'gms_version': 'v0.10.3',
 'pending_requests': 0}
[2023-05-29 10:18:35,809] ERROR    {datahub.entrypoints:188} - Command failed: Failed to process /home/miquelp/datahub/file_onboarding/test_containers.json
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/entrypoints.py", line 175, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
    raise e
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 198, in run
    loop.run_until_complete(run_func_check_upgrade(pipeline))
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 158, in run_func_check_upgrade
    ret = await the_one_future
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 149, in run_pipeline_async
    return await loop.run_in_executor(
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 140, in run_pipeline_to_completion
    raise e
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 132, in run_pipeline_to_completion
    pipeline.run()
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 339, in run
    for wu in itertools.islice(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/file.py", line 196, in get_workunits
    for f in self.get_filenames():
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/file.py", line 193, in get_filenames
    raise Exception(f"Failed to process {self.config.path}")
Exception: Failed to process /home/miquelp/datahub/file_onboarding/test_containers.json
b
whats your recipe config? seems like there was an error reading the file
m
Copy code
source:
    type: file
    config:
        filename: /home/miquelp/datahub/file_onboarding/test_containers.json
I am executing from the frontend
b
oh.. but your file is on localhost and not in the container right
you should try to ingest via CLI because UI ingestion cant read from localhost
m
Ok, I tried executing from within the CLI of the server but had a 403 which I assume might be due to our proxy being in the middle or something. It worked after trying from a pc from the network. Is it planned to be able to upload a file via the frontend or something similar? Might be interesting. Regardless of this, is there a reference documentation of how the metadata must be described in the ingested files? It is not clear to me how to define a schema.table.column database like object with descriptions at each level. Is this possible currently? thanks! Pablo M.
b
there is a option to read files in URLs - maybe you can put the file in gitlab/github if you want to generate your own metadata file, you probably want to explore the python sdk and use it instead (and write it out to file)
g
@many-rocket-80549 from your logs it looks like you’re using a slightly older version of the datahub cli (0.10.0.7) We have some explanation about how to ingest dataset schemas here https://datahubproject.io/docs/generated/metamodel/entities/dataset/#schemas. As for generating the hierarchy with descriptions at each level, the levels are modeled as “containers” in datahub, where each container can have a description. Most entity types (including both datasets and containers) can have parent containers, which you can use to build the hierarchy Like xL said, I’d recommend using the python sdk to generate these instead of writing the json by hand