brave-pencil-21289
06/14/2022, 7:00 AMmysterious-nail-70388
06/14/2022, 7:26 AMwonderful-quill-11255
06/14/2022, 7:29 AMmysterious-nail-70388
06/14/2022, 7:35 AMbright-cpu-56427
06/14/2022, 7:48 AMrhythmic-flag-69887
06/14/2022, 8:45 AMERROR {datahub.entrypoints:165} - You seem to have connected to the frontend instead of the GMS endpoint. The rest emitter should connect to DataHub GMS (usually <datahub-gms-host>:8080) or Frontend GMS API (usually <frontend>:9002/api/gms)
Also what am I to expect if i get dbt working? Will i see the lineage in datahub similar to what dbt shows?sparse-monitor-9160
06/14/2022, 12:32 PM[2022-06-14 08:27:04,506] INFO {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38
[2022-06-14 08:27:10,903] ERROR {datahub.entrypoints:167} - Stackprinter failed while formatting <FrameInfo /usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py, line 270, scope SQLAlchemyConfig>:
File "/usr/local/lib/python3.9/site-packages/stackprinter/frame_formatting.py", line 225, in select_scope
raise Exception("Picked an invalid source context: %s" % info)
Exception: Picked an invalid source context: [270], [219], dict_keys([219, 220])
So here is your original traceback at least:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 106, in run
pipeline = Pipeline.create(pipeline_config, dry_run, preview, preview_workunits)
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 202, in create
return cls(
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 149, in __init__
source_class = source_registry.get(source_type)
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 126, in get
tp = self._ensure_not_lazy(key)
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 84, in _ensure_not_lazy
plugin_class = import_path(path)
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path
item = importlib.import_module(module_name)
File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/snowflake.py", line 29, in <module>
from datahub.ingestion.source.sql.sql_common import (
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 236, in <module>
class SQLAlchemyConfig(StatefulIngestionConfigBase):
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 270, in SQLAlchemyConfig
from datahub.ingestion.source.ge_data_profiler import GEProfilingConfig
File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 12, in <module>
from great_expectations import __version__ as ge_version
File "/usr/local/lib/python3.9/site-packages/great_expectations/__init__.py", line 7, in <module>
from great_expectations.data_context import DataContext
File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/__init__.py", line 1, in <module>
from great_expectations.data_context.data_context import (
File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/data_context/__init__.py", line 1, in <module>
from great_expectations.data_context.data_context.base_data_context import (
File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 20, in <module>
from great_expectations.core.config_peer import ConfigPeer
File "/usr/local/lib/python3.9/site-packages/great_expectations/core/__init__.py", line 3, in <module>
from .expectation_suite import (
File "/usr/local/lib/python3.9/site-packages/great_expectations/core/expectation_suite.py", line 10, in <module>
from great_expectations.core.evaluation_parameters import (
File "/usr/local/lib/python3.9/site-packages/great_expectations/core/evaluation_parameters.py", line 27, in <module>
from great_expectations.core.util import convert_to_json_serializable
File "/usr/local/lib/python3.9/site-packages/great_expectations/core/util.py", line 22, in <module>
from great_expectations.types import SerializableDictDot
File "/usr/local/lib/python3.9/site-packages/great_expectations/types/__init__.py", line 15, in <module>
import pyspark
File "/usr/local/lib/python3.9/site-packages/pyspark/__init__.py", line 51, in <module>
from pyspark.context import SparkContext
File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 31, in <module>
from pyspark import accumulators
File "/usr/local/lib/python3.9/site-packages/pyspark/accumulators.py", line 97, in <module>
from pyspark.serializers import read_int, PickleSerializer
File "/usr/local/lib/python3.9/site-packages/pyspark/serializers.py", line 72, in <module>
from pyspark import cloudpickle
File "/usr/local/lib/python3.9/site-packages/pyspark/cloudpickle.py", line 145, in <module>
_cell_set_template_code = _make_cell_set_template_code()
File "/usr/local/lib/python3.9/site-packages/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: an integer is required (got type bytes)
[2022-06-14 08:27:10,904] INFO {datahub.entrypoints:176} - DataHub CLI version: 0.8.38 at /usr/local/lib/python3.9/site-packages/datahub/__init__.py
[2022-06-14 08:27:10,904] INFO {datahub.entrypoints:179} - Python version: 3.9.7 (default, Sep 3 2021, 12:36:14)
[Clang 11.0.0 (clang-1100.0.33.17)] at /usr/local/opt/python@3.9/bin/python3.9 on macOS-10.14.6-x86_64-i386-64bit
[2022-06-14 08:27:10,904] INFO {datahub.entrypoints:182} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': 'v0.8.38', 'commit': '38718b59b358fc6c564ee982752bf2023533b224'}}, 'managedIngestion': {'defaultCliVersion': '0.8.38', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}
Here is my YML (with sensitive data replaced by my_
):
source:
type: "snowflake"
config:
account_id: "my_account.us-east-1"
warehouse: "sor_wh"
username: "my_username"
password: "my_password"
role: "my_role"
include_views: false
include_table_lineage: false
table_pattern:
allow:
- "temp_1"
sink:
type: "datahub-rest"
config:
server: '<http://localhost:8080>'
Here is my environment:
$ python -c "import platform; print(platform.platform())"
Darwin-18.7.0-x86_64-i386-64bit
$ python -c "import sys; print(sys.version); print(sys.executable); import datahub; print(datahub.__file__); print(datahub.__version__);"
2.7.16 (default, Jan 27 2020, 04:46:15)
[GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)]
/usr/bin/python
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: No module named datahub
$ python3 -c "import sys; print(sys.version); print(sys.executable); import datahub; print(datahub.__file__); print(datahub.__version__);"
3.9.7 (default, Sep 3 2021, 12:36:14)
[Clang 11.0.0 (clang-1100.0.33.17)]
/usr/local/opt/python@3.9/bin/python3.9
/usr/local/lib/python3.9/site-packages/datahub/__init__.py
0.8.38
wooden-jackal-88380
06/14/2022, 1:09 PMhundreds-pillow-5032
06/14/2022, 3:17 PMmodern-laptop-12942
06/14/2022, 4:32 PMTraceback (most recent call last):
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1307, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 150, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 161, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/Test_ingestion_dag.py", line 34, in datahub_recipe
pipeline = Pipeline.create(config)
File "/home/airflow/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 150, in create
return cls(config, dry_run=dry_run, preview_mode=preview_mode)
File "/home/airflow/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 116, in __init__
self.source: Source = source_class.create(
File "/home/airflow/.local/lib/python3.9/site-packages/datahub/ingestion/source/sql/snowflake.py", line 182, in create
config = SnowflakeConfig.parse_obj(config_dict)
File "pydantic/main.py", line 511, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 4 validation errors for SnowflakeConfig
host_port
field required (type=value_error.missing)
account_id
extra fields not permitted (type=value_error.extra)
include_view_lineage
extra fields not permitted (type=value_error.extra)
upstream_lineage_in_report
extra fields not permitted (type=value_error.extra)
I use source.type: snowflake. And I can successfully ingest using CLI for this recipe.lemon-zoo-63387
06/15/2022, 1:21 AMsource:
type: oracle
config:
host_port: '10.xxx.xx.xx4:1521'
database: Qxxx
username: dxxxxv
password: Dxxxxm
sink:
type: databub-rest
config:
server: '<http://localhost:8080>'
dry-doctor-17275
06/15/2022, 1:32 AMadamant-sugar-28445
06/15/2022, 2:41 AMdatahub --help
Usage: datahub [OPTIONS] COMMAND [ARGS]...
Options:
--debug / --no-debug
--version Show the version and exit.
--help Show this message and exit.
Commands:
check Helper commands for checking various aspects of DataHub.
docker Helper commands for setting up and interacting with a local DataHub instance using Docker.
ingest Ingest metadata into DataHub.
version Print version number and exit.
bitter-toddler-42943
06/15/2022, 5:31 AMhigh-family-71209
06/15/2022, 7:27 AMadamant-sugar-28445
06/15/2022, 9:07 AMtype: sqlalchemy
config:
platform: hive
connect_uri: "XXXX"
include_views: False
table_pattern:
allow:
- "<MdatabaseName>.<table>"
schema_pattern:
allow:
- "<databaseName>"
deny:
- "<otherSchemas>"
options:
connect_args:
auth: '<AuthStrategy>'
sink:
type: "datahub-rest"
config:
server: "<serverURI>"
astonishing-dusk-99990
06/15/2022, 9:33 AMbrave-pencil-21289
06/15/2022, 11:53 AMsome-kangaroo-13734
06/15/2022, 12:18 PMsource:
type: bigquery
config:
project_id: A
use_exported_bigquery_audit_metadata: false
profiling:
enabled: false
credential:
project_id: B
private_key_id: '${GCP_PRIVATE_KEY_ID}'
private_key: '${GCP_PRIVATE_KEY}'
client_id: '${GCP_CLIENT_ID}'
client_email: '${GCP_CLIENT_EMAIL}'
domain:
foo:
allow:
- 'A\..*'
sink:
type: datahub-rest
config:
server: '<https://xxx/api/gms>'
token: '${GMS_TOKEN}'
Error:
'Forbidden: 403 POST <https://bigquery.googleapis.com/bigquery/v2/projects/A/jobs?prettyPrint=false>: Access Denied: Project '
'A: User does not have bigquery.jobs.create permission in project A.\n'
cuddly-arm-8412
06/15/2022, 4:29 PMsalmon-area-51650
06/15/2022, 5:09 PMdbt
ingestion. The cronjob which is executing the integration was ok (no errors) but I cannot see dbt platform in the UI.
This is the configuration of the job:
source:
type: "dbt"
config:
# Coordinates
manifest_path: "<s3://bucket/manifest.json>"
catalog_path: "<s3://bucket/catalog.json>"
sources_path: "<s3://bucket/sources.json>"
aws_connection:
aws_region: "eu-west-2"
# Options
target_platform: "snowflake"
load_schemas: True # note: if this is disabled
env: STG
sink:
type: "datahub-rest"
config:
server: "<http://datahub-datahub-gms:8080>"
And this is the output of the job:
'failures': {},
'cli_version': '0.0.0+docker.b4bf1d4',
'cli_entry_location': '/usr/local/lib/python3.8/site-packages/datahub/__init__.py',
'py_version': '3.8.13 (default, May 28 2022, 14:23:53) \n[GCC 10.2.1 20210110]',
'py_exec_path': '/usr/local/bin/python',
'os_details': '...-glibc2.2.5',
'soft_deleted_stale_entities': []}
Sink (datahub-rest) report:
{'records_written': 2048,
'warnings': [],
'failures': [],
'downstream_start_time': datetime.datetime(2022, 6, 15, 16, 28, 19, 669215),
'downstream_end_time': datetime.datetime(2022, 6, 15, 16, 29, 23, 420550),
'downstream_total_latency_in_seconds': 63.751335,
'gms_version': 'v0.8.38'}
Pipeline finished with 1061 warnings in source producing 2048 workunits
And there is no any `dbt`platform in the UI
Any idea?
Thanks in advance!mysterious-lamp-91034
06/15/2022, 6:25 PMdatasetProfile
aspect physically in?
I am seeing the dataset stats in the UI. I don't see them in db
mysql> select * from metadata_aspect_v2 where aspect='datasetProfile'\G
Empty set (0.00 sec)
numerous-bird-27004
06/15/2022, 8:00 PMdry-zoo-35797
06/15/2022, 11:09 PMlemon-zoo-63387
06/16/2022, 1:20 AMpython3 -m datahub docker nuke
For example, only delete Oracle in the datasetwonderful-egg-79350
06/16/2022, 5:55 AMsource:
type: bigquery
config:
project_id: "my-project-id"
options:
credentials_path : "./gcp-credential.json"
table_pattern :
# Allow ony one table
allow :
- "my_dataset.my_table"
sink:
type: "datahub-rest"
config:
server: <http://localhost:8080>
square-lawyer-36076
06/16/2022, 3:00 PMsteep-midnight-37232
06/16/2022, 3:52 PMbulky-jackal-3422
06/16/2022, 4:54 PMbillions-morning-53195
06/16/2022, 5:43 PMWARNING: AWS Glue Schema Registry DOES NOT have a python SDK. As such, python based libraries like ingestion or datahub-actions (UI ingestion) is not supported when using AWS Glue Schema Registry