Trying to test out a S3 Data Lake with a local doc...
# troubleshoot
l
Trying to test out a S3 Data Lake with a local docker deployment and am getting the error:
'[2022-04-29 02:44:40,288] ERROR    {logger:26} - Please set env variable SPARK_VERSION\n'
I am just having trouble figuring out where this env variable is or how to change it. Thanks
e
Hey @limited-agent-54038, sorry to hear that. Where exactly are you getting this error?
is this when you’re running ingestion?
l
correct ^
e
would you mind running the command
datahub check plugins
and telling me what the output of that is?
l
Copy code
Sources:
[2022-04-29 09:17:51,592] ERROR  {logger:26} - Please set env variable SPARK_VERSION
athena     (disabled)
azure-ad
bigquery    (disabled)
bigquery-usage (disabled)
clickhouse   (disabled)
clickhouse-usage (disabled)
data-lake
datahub-business-glossary
datahub-lineage-file
dbt
druid     (disabled)
elasticsearch (disabled)
feast
file
glue
hive      (disabled)
kafka     (disabled)
kafka-connect (disabled)
ldap      (disabled)
looker     (disabled)
lookml     (disabled)
mariadb    (disabled)
metabase    (disabled)
mode      (disabled)
mongodb    (disabled)
mssql     (disabled)
mysql     (disabled)
nifi      (disabled)
okta      (disabled)
openapi
oracle     (disabled)
postgres    (disabled)
powerbi    (disabled)
redash     (disabled)
redshift    (disabled)
redshift-usage (disabled)
sagemaker
snowflake   (disabled)
snowflake-usage (disabled)
sqlalchemy   (disabled)
starburst-trino-usage (disabled)
superset    (disabled)
tableau    (disabled)
trino     (disabled)

Sinks:
console
datahub-kafka (disabled)
datahub-rest
file

Transformers:
add_dataset_ownership
add_dataset_properties
add_dataset_tags
add_dataset_terms
mark_dataset_status
pattern_add_dataset_ownership
pattern_add_dataset_tags
pattern_add_dataset_terms
set_dataset_browse_path
simple_add_dataset_ownership
simple_add_dataset_properties
simple_add_dataset_tags
simple_add_dataset_terms
simple_remove_dataset_ownership
e
So you’re using the
data-lake
connector right? I think that error is likely a red herring
To confirm, what does your stack trace look like when you run ingestion?
l
This was a custom connector, let me just refresh the docker and see if anything changes
👍 1
e
Sounds good, please let me know!
l
Still getting this error:
'[2022-04-29 17:10:00,004] ERROR    {logger:26} - Please set env variable SPARK_VERSION\n'
I am using a custom connector ( I dont see anything for datalakes for Aws s3 datalake files)
source: type: s3 config: path_spec: include: 's3://***/SDATA-1511/test_scenario_03/*' aws_config: aws_access_key_id: ** aws_secret_access_key: * aws_region: us-west-2 env: DEV profiling: enabled: false sink: type: datahub-rest config: server: 'http://datahub-gms:8080'
e
We do actually have an S3 data lake connector
It might be easier for you to try that out?
l
should that be showing up here? do I need to update/refresh the instance once I install a new ingestion?
e
Ah so the sources we show on UI-based ingestion is a subset of all the sources that are there
This is a source that you would have to use the datahub CLI tool to run
l
alright - I will try that out, the s3 ingestion does not show up when I run the check plugins either
@echoing-airport-49548 just to confirm, I can only do this ingestion from CLI and it will show up in the UI once its done? do I need to make a dev instance of it? or can I just do this using the quickstart version?
e
that’s right! you should be able to do it using the quickstart version
l
kk thanks, I will try to fix those bugs and focus on that side
e
sounds good, let me know!
l
I am getting the entrypoint error which I think means something to do with the authorization I need to figure out
I feel like I am so close and once I get one I will be set haha
sorry @echoing-airport-49548 getting frustrated because I have no idea what I am doing wrong - Im getting this error:
ERROR  {datahub.entrypoints:152} - File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/datahub/entrypoints.py", line 138, in main
and there is no clear instructions or guide on how to fix this
e
no worries Jake, would you mind sharing your entire stack trace?
l
Here is the yaml I am trying:
Copy code
source:
  type: "s3"
  config:
    platform: s3
        path_spec:
            include: 's3://***********/SDATA-1511/test_scenario_03/*.*'
        aws_config:
            aws_access_key_id: **********
            aws_secret_access_key: *******
            aws_region: us-west-2


sink:
    type: datahub-rest
    config:
        server: '<http://localhost:9002/api/gms>'
# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/file> for complete documentation
from what I can tell, it looks like some authentication issue with datahub...? but it doesnt make sense to me and I cant find any guide on how to fix it
e
ah
is your gms at localhost:8080?
l
yes
e
your config needs to look like the following
l
datahub-gms container is localhost:8080
e
Copy code
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"
instead of 9002
l
alright - still the same error message, no changes
e
is it the exact same? the error message looks like an issue with your yaml
l
yep, the exact same error message and full thread
e
Oh interesting, did changing the sink like so help?
l
well my sink is localhost:8080 based on what I see in Docker (this is all locally deployed). This did not help, but I didnt know what the auth API layer was or if that is the issue
Do you know of any super basic config test that I can do with public data so I can try it out?
e
Can you try ingesting sample data following that guide?
l
yep, that works perfectly fine
Even if I try to ingest a file, I am getting the error, so it has something to do with authenticating to the API...?
I changed it from REST sink to a console sink and I am getting the same error, so this must be a bigger issue. I am on v0.8.30.0
e
could you try upgrading to
v0.8.33
?
l
thanks for the help earlier, I was able to track down some of the bugs/errors, still getting this tho:
Copy code
'[2022-04-29 23:53:10,704] INFO     {datahub.entrypoints:161} - DataHub CLI version: 0.8.32.1 at '
           '/tmp/datahub/ingest/venv-1d251598-bc7e-4509-9629-67c3faae0601/lib/python3.9/site-packages/datahub/__init__.py\n'
           '[2022-04-29 23:53:10,705] INFO     {datahub.entrypoints:164} - Python version: 3.9.9 (main, Dec 21 2021, 10:03:34) \n'
           '[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-1d251598-bc7e-4509-9629-67c3faae0601/bin/python3 on '
           'Linux-5.10.76-linuxkit-x86_64-with-glibc2.31\n'
           "[2022-04-29 23:53:10,705] INFO     {datahub.entrypoints:167} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': "
           "'v0.8.33', 'commit': 'c34a1ba73520a9f646b21540b046d1a38441b2a2'}}, 'managedIngestion': {'defaultCliVersion': '0.8.32.1', 'enabled': "
           "True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, "
           "'datasetUrnNameCasing': False, 'retention': 'true', 'noCode': 'true'}\n",

           
           "2022-04-29 23:53:11.390812 [exec_id=1d251598-bc7e-4509-9629-67c3faae0601] INFO: Failed to execute 'datahub ingest'",
           '2022-04-29 23:53:11.392091 [exec_id=1d251598-bc7e-4509-9629-67c3faae0601] INFO: Caught exception EXECUTING '
           'task_id=1d251598-bc7e-4509-9629-67c3faae0601, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
The GMS config comes back and I can hit that server easily. When I run it locally to output to the console it works but doesnt when I try to export it to the server
e
Hey Jake just to confirm, did you test out once with
Copy code
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-gms:8080>'
I know you can hit your GMS at
localhost:8080
but just want to eliminate any variables here!
l
I get different errors with the two options. The only one that actually returns a GMS value is 'http://datahub-gms:8080' - here is the error message:
Copy code
'---- (full traceback above) ----\n'
           'File "/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 95, in '
           'run\n'
           '    pipeline = Pipeline.create(pipeline_config, dry_run, preview, preview_workunits)\n'
           'File "/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line '
           '184, in create\n'
           '    return cls(\n'
           'File "/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line '
           '132, in __init__\n'
           '    self.source: Source = source_class.create(\n'
           'File '
           '"/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/datahub/ingestion/source/data_lake/__init__.py", '
           'line 252, in create\n'
           '    return cls(config, ctx)\n'
           'File '
           '"/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/datahub/ingestion/source/data_lake/__init__.py", '
           'line 176, in __init__\n'
           '    self.init_spark()\n'
           'File '
           '"/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/datahub/ingestion/source/data_lake/__init__.py", '
           'line 246, in init_spark\n'
           '    self.spark = SparkSession.builder.config(conf=conf).getOrCreate()\n'
           'File "/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/pyspark/sql/session.py", line 186, in '
           'getOrCreate\n'
           '    sc = SparkContext.getOrCreate(sparkConf)\n'
           'File "/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/pyspark/context.py", line 378, in '
           'getOrCreate\n'
           '    SparkContext(conf=conf or SparkConf())\n'
           'File "/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/pyspark/context.py", line 133, in '
           '__init__\n'
           '    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)\n'
           'File "/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/pyspark/context.py", line 327, in '
           '_ensure_initialized\n'
           '    SparkContext._gateway = gateway or launch_gateway(conf)\n'
           'File "/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/pyspark/java_gateway.py", line 105, in '
           'launch_gateway\n'
           '    raise Exception("Java gateway process exited before sending its port number")\n'
           '\n'
           'Exception: Java gateway process exited before sending its port number\n'
           '[2022-05-04 03:37:52,868] INFO     {datahub.entrypoints:161} - DataHub CLI version: 0.8.32.1 at '
           '/tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/lib/python3.9/site-packages/datahub/__init__.py\n'
           '[2022-05-04 03:37:52,868] INFO     {datahub.entrypoints:164} - Python version: 3.9.9 (main, Dec 21 2021, 10:03:34) \n'
           '[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-f5791ce2-9915-424a-8798-3860928ab87a/bin/python3 on '
           'Linux-5.10.76-linuxkit-x86_64-with-glibc2.31\n'
           "[2022-05-04 03:37:52,868] INFO     {datahub.entrypoints:167} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': "
           "'v0.8.33', 'commit': 'c34a1ba73520a9f646b21540b046d1a38441b2a2'}}, 'managedIngestion': {'defaultCliVersion': '0.8.32.1', 'enabled': "
           "True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, "
           "'datasetUrnNameCasing': False, 'retention': 'true', 'noCode': 'true'}\n",
           "2022-05-04 03:37:53.532891 [exec_id=f5791ce2-9915-424a-8798-3860928ab87a] INFO: Failed to execute 'datahub ingest'",
           '2022-05-04 03:37:53.536504 [exec_id=f5791ce2-9915-424a-8798-3860928ab87a] INFO: Caught exception EXECUTING '
           'task_id=f5791ce2-9915-424a-8798-3860928ab87a, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
g
Hi @limited-agent-54038, I’m also trying the same thing and facing the similar issue. Just wanted to know were you able to get a fix for it
l
Hi Tejas - I did not pursue this further after my messages, got too much work on my plate
g
Thanks Jake for the response
l
I spent too long trying to figure it out haha if you get something, that would be huge 🙂 good luck tho