Question - Deployed Datahub via Helm charts on EKS...
# all-things-deployment
h
Question - Deployed Datahub via Helm charts on EKS Cluster - Do we need to expose gms also via ingress ? How do we do Ingestion ?
e
If you are trying to ingest from outside the cluster, you need to add ingress to gms as well
h
@early-lamp-41924 -I set up the Ingress for gms - but getting error -
Copy code
{"exceptionClass":"com.linkedin.restli.server.RestLiServiceException","stackTrace":"com.linkedin.restli.server.RestLiServiceException [HTTP Status:404]\n\tat com.linkedin.restli.server.RestLiServiceException.fromThrowable(RestLiServiceException.java:315)\n\tat com.linkedin.restli.server.BaseRestLiServer.buildPreRoutingError(BaseRestLiServer.java:158)\n\tat com.linkedin.restli.server.RestRestLiServer.buildPreRoutingRestException(RestRestLiServer.java:203)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:177)\n\tat com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:164)\n\tat com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:120)\n\tat com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:132)\n\tat com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)\n\tat com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n\tat
e
what’s the request you are sending?
is this through ingest recipe?
are you able to curl to the gms endpoint?
h
This is via Airflow -
Copy code
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='<ip>', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f85dc1dc810>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2021-10-05 18:47:37,345] {taskinstance.py:1551} INFO - Marking task as UP_FOR_RETRY. dag_id=datahub_mysql_ingest, task_id=ingest_from_mysql, execution_date=20211005T184735, start_date=20211005T184737, end_date=20211005T184737
[2021-10-05 18:47:37,448] {local_task_job.py:149} INFO - Task exited with return code 1
Able to curl - curl -v <https://<hostname>> *  Trying <ip>... * TCP_NODELAY set * Connected to <hostname>(<ip>) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: *  CAfile: /etc/ssl/cert.pem  CApath: none
e
^ @mammoth-bear-12532
Seems like Airflow is having trouble talking to gms
@handsome-football-66174 Can you try an actual curl like
Copy code
curl --location --request POST 'http://<<host-name>>/entities?action=search' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
    "input": "*",
    "entity": "dataset",
    "start": 0,
    "count": 10
}'
h
Curl output curl: (60) SSL certificate problem: self signed certificate in certificate chain More details here: https://curl.haxx.se/docs/sslcerts.html curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the web page mentioned above.
e
seems like this is the issue
h
Any ideas on how to resolve this ?
e
you are using acm to setup certs right?
can you confirm that it correctly covers this domain?
h
Yes dexter using AWS certificate Manager. The domain matches correctly.
a
add -kv to your curl and paste output please
curl -kv ...
h
curl -kv --location --request POST 'https://<hostname>/entities?action=search' \ --header 'X-RestLi-Protocol-Version: 2.0.0' \ --header 'Content-Type: application/json' \ --data-raw '{ "input": "*", "entity": "dataset", "start": 0, "count": 10 }' Note: Unnecessary use of -X or --request, POST is already inferred. * Trying 10.224.129.163... * TCP_NODELAY set * Connected to <hostname> (<ip>) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/ssl/cert.pem CApath: none * TLSv1.2 (OUT), TLS handshake, Client hello (1): * TLSv1.2 (IN), TLS handshake, Server hello (2): * TLSv1.2 (IN), TLS handshake, Certificate (11): * TLSv1.2 (IN), TLS handshake, Server key exchange (12): * TLSv1.2 (IN), TLS handshake, Server finished (14): * TLSv1.2 (OUT), TLS handshake, Client key exchange (16): * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.2 (OUT), TLS handshake, Finished (20): * TLSv1.2 (IN), TLS change cipher, Change cipher spec (1): * TLSv1.2 (IN), TLS handshake, Finished (20): * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256 * ALPN, server accepted to use h2 * Server certificate: * subject: CN=<cn> * start date: Sep 22 162011 2021 GMT * expire date: Oct 22 172011 2022 GMT * issuer: C=US; O=<company>; OU=IHDP; ST=<state>; CN=<cn>; L=<location> * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. * Using HTTP2, server supports multi-use * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * Using Stream ID: 1 (easy handle 0x7f9dd1008200)
POST /entities?action=search HTTP/2
Host: <hostname>
User-Agent: curl/7.64.1
Accept: /
X-RestLi-Protocol-Version: 2.0.0
Content-Type: application/json
Content-Length: 78
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)! * We are completely uploaded and fine < HTTP/2 200 < date: Wed, 06 Oct 2021 143953 GMT < content-type: application/json < content-length: 114 < x-restli-protocol-version: 2.0.0 < server: Jetty(9.4.20.v20190813) < * Connection #0 to host <hostname> left intact {"value":{"numEntities":0,"pageSize":10,"metadata":{"urns":[],"searchResultMetadatas":[]},"from":0,"entities":[]}}* Closing connection 0
a
That looks good from your output of airflow it appears you are hitting Jetty port on end point of /config at :8080 are you pointing Airflow to the ingress ^ ?
h
Yes pointing to the ingress for gms
a
ingress would be on https://host/config if you're calling no? from logs looks like it's http://host:8080/config
e
config endpoint returns details that the ingestion library needs.
seems like curl is working ^
its returning the correct result
@handsome-football-66174 can you share the recipe you are using? Obfuscate just the domain if needed
h
Airflow connection : datahub_rest_default datahub_rest <https://cobalt-dev-metadata-management-service-gms.dev.ihdp.awsnonprod.healthcareit.net|https://<>gms-hostname> False False
Recipe : """MySQL DataHub Ingest DAG This example demonstrates how to ingest metadata from MySQL into DataHub from within an Airflow DAG. Note that the DB connection configuration is embedded within the code. """ from datetime import timedelta from airflow import DAG from airflow.utils.dates import days_ago try: from airflow.operators.python import PythonOperator except ModuleNotFoundError: from airflow.operators.python_operator import PythonOperator from datahub.ingestion.run.pipeline import Pipeline default_args = { "owner": "airflow", "depends_on_past": False, "email": ["<myemail>"], "email_on_failure": False, "email_on_retry": False, "retries": 1, "retry_delay": timedelta(minutes=5), "execution_timeout": timedelta(minutes=120), } def ingest_from_mysql(): pipeline = Pipeline.create( # This configuration is analogous to a recipe configuration. { "source": { "type": "mysql", "config": { "username": "<username>", "password": "<>", "database": "<dbname>", "host_port": "<mysqlhost>:3306", }, }, "sink": { "type": "datahub-rest", "config": {"server": "http//&lt;gms host&gt;8080"}, }, } ) pipeline.run() pipeline.raise_from_status() with DAG( "datahub_mysql_ingest", default_args=default_args, description="An example DAG which ingests metadata from MySQL to DataHub", schedule_interval=timedelta(days=1), start_date=days_ago(2), catchup=False, ) as dag: ingest_task = PythonOperator( task_id="ingest_from_mysql", python_callable=ingest_from_mysql, )
e
ah
can you remove
:8080
“config”: {“server”: “http://<gms-host>“},
h
😔 Missed it here !
let me update and try it out.