Hello! I am trying to ingest a .csv from a bucket ...
# ingestion
c
Hello! I am trying to ingest a .csv from a bucket on a local S3 with a DataHub deployed on K8S. I am running the ingestion from the CLI (v 0.8.43.4) and can ping the S3 server from the machine which runs the CLI. I have a very simple recipe: source: type: s3 config: path_specs: - include: "{s3-server}/{bucket-name}/*.csv" env: "PROD" profiling: enabled: false # sink configs sink: type: "datahub-rest" config: server: "http://localhost:8080" The pipeline finishes successfully but generates no events. The source and sink reports from the ingest are empty except for the timestamps. I only see this error (ERROR {logger:26} - Please set env variable SPARK_VERSION) when I run with --debug but in my recipe profiling is disabled. Any advice on how to resolve this will be greatly appreciated! Thank you in advance!
h
Hi @chilly-potato-57465 what do you mean by local s3 ? Can you share absolute s3 path of one of the csv files that you are expecting to get ingested ? Also, can you enable debug level logs - that would definitely help understand whats happening better.
Hey, can you also share absolute s3 path of one of the csv files that you are expecting to get ingested ? My guess is that path_specs.include is not correctly set here. You'll need to replace
- include: "{s3-server}/{bucket-name}/*.csv"
with actual values of s3 server and bucket name
c
Hi! Here it is
- include: "s3-test/DataHubDemo/sample-data.csv"
I thought it is suspicious to not specify a protocol or something more than server name but read that s3:// is only for AWS S3?
h
Can you try prefixing with s3:// ? If you don't specify s3://, the source looks for local folder with name s3-test - which is not what we want.
c
I see. When I prefixed it I got: ValueError: aws_config not set. Cannot browse s3
h
Right, can you set aws_config ? aws region as the least
Copy code
config:
    ...other source configs...
    aws_config:
        aws_region: us-east-1
c
only setting aws_config.aws_region leads to NoCredentialsError: Unable to locate credentials
when setting: aws_region: "" aws_access_key_id: "somesecret" aws_secret_access_key: "longersecret" the error is: ValueError: Invalid endpoint: https://s3..amazonaws.com
h
Please set aws_endpoint_url as well
equal to your s3 server url
c
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)
When I have all the aws_config params we set above and changed
- include: "DataHubDemo/sample-data.csv"
to only have the bucket name I am back to
Pipeline finished successfully ; produced 0 events
h
Hey - you'll have to specify s3:// if its s3 compatible source.
c
if I do that and have
- include: "<s3://DataHubDemo/sample-data.csv>"
I get:
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:997)
h
Just came across this - https://github.com/datahub-project/datahub/issues/5307 looks like s3 source with self-signed certificate is not supported yet
c
OK, I see. The last step in the proposed solution from 6 days ago is to
Copy code
To fix the error, a cafile would have to be given when creating the S3 client.
I am not sure which this client is? The Datahub S3 plugin?
h
Can you try setting AWS_CA_BUNDLE environment variable as described here - https://stackoverflow.com/questions/32946050/ssl-certificate-verify-failed-in-aws-cli datahub s3 source uses boto under the hood and it supports AWS_CA_BUNDLE env var to specify custom certificates - https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
c
Will do, will get back to you when done! Huge thanks for the help!
h
Hey @chilly-potato-57465 did it work for you ?
c
Hello @hundreds-photographer-13496, have not followed up on this yet. Will let you know when have the opportunity to try it out. Thanks!
Hi @hundreds-photographer-13496, I set the AWS_CA_BUNDLE variable on the machine which runs the DataHub CLI but still receive the same error. Should I set that variable some where else? Thanks!
h
It should've worked when variable is set to point to the server ca certificate on the machine which runs Datahub CLI.
c
Hm, it didn't. Do you think I can do something else in this case, then?
h
I would simply recommend to double check whether AWS_CA_BUNDLE variable is pointing to correct certificate. I don't have anything else as of now, but will get back if anything changes 🙂
c
Hi @hundreds-photographer-13496, it is still not working, unfortunately. The AWS_CA_BUNDLE points to the correct certificate, I have also installed awscli as my colleague had exactly the same issue (self-signed certificate) to test. And there I passed the certificate in various ways - setting the environment variable, using --ca-bundle option when executing the command and pointing to the certificate from the aws/config file - but no success 😞 I even installed it in the Windows Trusted Root Certification Authority Store but no success. I am now reaching out to our IT support to check the proxy settings. Other suggestions how to proceed would be greatly appreciated!