Hey everyone! A question in the area of AWS s3 (It...
# troubleshoot
l
Hey everyone! A question in the area of AWS s3 (It's my first day trying out DataHub) . Is it possible to ingest files that don't have extension? I see the ingestion is "successful", but 0 events are created. I suspect it's because I specify something like
<s3://bucket/pref/pref/*/*>
in the
source.config.path_specs.include
I understand it expects something like
s3://.../*.*
, but this won't match the pattern of my files. Am I missing something?
d
in you path_spec you can specify `default_extension`` then for files without extension it will assume the specified file type. If it is not set the files without extensions will be skipped.
for example:
Copy code
path_specs:
      - include: "<s3://mypath/{table}/{partition_key[0]}/{partition_key[1]}/{partition_key[2]}/*>"
        default_extension: csv
l
thanks! It did solve the error, but now it's just running "successfully" without creating any new record
d
can you share logs with me? or you can try to run with debug logs as well if you see what is happening:
Copy code
datahub --debug ingest ...
l
I'm doing this on the ui, here's the logs from the run:
Copy code
~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '554972bc-e60d-4cc2-8210-0efff4115a3c',
 'infos': ['2022-11-20 08:04:09.274310 [exec_id=554972bc-e60d-4cc2-8210-0efff4115a3c] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-11-20 08:04:13.342947 [exec_id=554972bc-e60d-4cc2-8210-0efff4115a3c] INFO: stdout=venv setup time = 0\n'
           'This version of datahub supports report-to functionality\n'
           'datahub  ingest run -c /tmp/datahub/ingest/554972bc-e60d-4cc2-8210-0efff4115a3c/recipe.yml --report-to '
           '/tmp/datahub/ingest/554972bc-e60d-4cc2-8210-0efff4115a3c/ingestion_report.json\n'
           '[2022-11-20 08:04:11,178] INFO     {datahub.cli.ingest_cli:177} - DataHub CLI version: 0.8.43.5\n'
           '[2022-11-20 08:04:11,208] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured '
           'to talk to <http://datahub-datahub-gms:8080>\n'
           '[2022-11-20 08:04:11,492] ERROR    {logger:26} - Please set env variable SPARK_VERSION\n'
           '[2022-11-20 08:04:12,194] INFO     {datahub.cli.ingest_cli:127} - Starting metadata ingestion\n'
           '[2022-11-20 08:04:12,196] INFO     {datahub.ingestion.reporting.file_reporter:54} - Wrote SUCCESS report successfully to '
           "<_io.TextIOWrapper name='/tmp/datahub/ingest/554972bc-e60d-4cc2-8210-0efff4115a3c/ingestion_report.json' mode='w' encoding='UTF-8'>\n"
           '[2022-11-20 08:04:12,196] INFO     {datahub.cli.ingest_cli:145} - Finished metadata ingestion\n'
           '\n'
           'Cli report:\n'
           "{'cli_version': '0.8.43.5',\n"
           " 'cli_entry_location': '/usr/local/lib/python3.10/site-packages/datahub/__init__.py',\n"
           " 'py_version': '3.10.7 (main, Sep 13 2022, 14:31:33) [GCC 10.2.1 20210110]',\n"
           " 'py_exec_path': '/usr/local/bin/python',\n"
           " 'os_details': 'Linux-5.4.181-99.354.amzn2.x86_64-x86_64-with-glibc2.31'}\n"
           'Source (s3) report:\n'
           "{'events_produced': '0',\n"
           " 'events_produced_per_sec': '0',\n"
           " 'event_ids': [],\n"
           " 'warnings': {},\n"
           " 'failures': {},\n"
           " 'filtered': [],\n"
           " 'start_time': '2022-11-20 08:04:11.853852',\n"
           " 'running_time_in_seconds': '0',\n"
           " 'read_rate': '0'}\n"
           'Sink (datahub-rest) report:\n'
           "{'total_records_written': '0',\n"
           " 'records_written_per_second': '0',\n"
           " 'warnings': [],\n"
           " 'failures': [],\n"
           " 'start_time': '2022-11-20 08:04:10.574181',\n"
           " 'current_time': '2022-11-20 08:04:12.262083',\n"
           " 'total_duration_in_seconds': '1.69',\n"
           " 'gms_version': 'v0.9.2',\n"
           " 'pending_requests': '0'}\n"
           '\n'
           ' Pipeline finished successfully ; produced 0 events\n',
           "2022-11-20 08:04:13.343114 [exec_id=554972bc-e60d-4cc2-8210-0efff4115a3c] INFO: Successfully executed 'datahub ingest'"],
 'structured_report': '{"source": {"type": "s3", "report": {"events_produced": "0", "events_produced_per_sec": "0", "event_ids": [], "warnings": {}, '
                      '"failures": {}, "filtered": [], "start_time": "2022-11-20 08:04:11.853852", "running_time_in_seconds": "0", "read_rate": '
                      '"0"}}, "sink": {"type": "datahub-rest", "report": {"total_records_written": "0", "records_written_per_second": "0", '
                      '"warnings": [], "failures": [], "start_time": "2022-11-20 08:04:10.574181", "current_time": "2022-11-20 08:04:12.195627", '
                      '"total_duration_in_seconds": "1.62", "gms_version": "v0.9.2", "pending_requests": "0"}}}'}
Execution finished successfully!
well, now it works! I previously tried from the UI but have taken your suggestion of running debug using the cli and it works. Thanks!