https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • h

    hallowed-analyst-96384

    02/14/2022, 7:52 PM
    Hi friends! Sorry, I'm a real newbie to data catalogs and especially Datahub, but I really need your help: we have a project that downloads and collects files from FTP/SFTP servers and then moves them to GCS after some transformations, to finally send them to HDFS. The whole process is also saved in a postgres database. I managed to ingest metadata from Postgres to our datahub in Kubernetes, but I think it's not the right architecture. Here is the last picture I need: To see in Lineage how data passed from FTP/SFTP servers to GCP and later to HDFS. The problem I have is I still don't understand how exactly lineage is created, whether it's after ingestion or during, or automatically. I have seen examples of lineage code but I still can't exactly understand how/where to implement it in our project.
    o
    l
    • 3
    • 5
  • a

    adorable-flower-19656

    02/15/2022, 3:08 AM
    Hello, I'd like to apply datahub for my BigQuery. It has about 7000 tables. When I succeed ingestion, it takes about 1~2 hours. But sometimes the ingestion fails with "NoSuchTableError". It seems to have something to do with the table being dropped during ingestion execution. But I'm not sure. The creation/deletion of the tables frequently occur. My questions are, 1. If there is a table that is deleted during ingestion, do I fail to ingest? 2. If so, can I ignore the deleted tables and complete the ingestion?
    s
    • 2
    • 8
  • s

    stocky-midnight-78204

    02/15/2022, 5:03 AM
    I tried to add integrate datahub with spark, I am able to get the task lineage but spark job got stuck and is always running.
    c
    l
    k
    • 4
    • 9
  • m

    mysterious-nail-70388

    02/15/2022, 6:34 AM
    Hello, I found that the dataHub version installed with PIP was 0.8.24. The results I got from retrieving metadata using it were different from those I got from the same database by downloading the source code and recompiling ingestion. How did this happen? Datahub metadata retrieval via PIP installation is missing. The first image is the metadata obtained by PIP download, and the second and third images are the metadata obtained by 0.8.26 version built by myself
    s
    d
    r
    • 4
    • 9
  • m

    mysterious-nail-70388

    02/15/2022, 7:31 AM
    Hello, how to calculate the storage capacity of Hive partition table data? Is this available through profiling?
    d
    • 2
    • 2
  • f

    few-air-56117

    02/15/2022, 9:21 AM
    Hi gusy, its posible to ingest data via ui , without specifying the gms ip? Why, i have datahub on k8s, and i dont want to have the gms ip public
    s
    • 2
    • 11
  • a

    ambitious-guitar-89068

    02/15/2022, 9:23 AM
    Hello, Is there a NIFI -> Datahub ingestion expert here who can help with this issue (the message says this Processor type is not supported, but is there a way out?) :
    Copy code
    `                                             'Dropping Nifi Processor of type org.apache.nifi.processors.slack.PutSlack, id '
                                                 '017718bd-4edc-1e55-534e-ca304519ef4b, name PutSlack from lineage view.                     This is '
                                                 'likely an Ingress or Egress node which may be reading to/writing from external '
                                                 'datasets                     However not currently supported in datahub',`
    h
    • 2
    • 2
  • b

    billions-receptionist-60247

    02/15/2022, 10:05 AM
    Hi Any idea why i'm getting this error
    Copy code
    {'error': 'Unable to emit metadata to DataHub GMS',
                   'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                            'message': 'java.lang.RuntimeException: Unknown aspect container for entity dataset',
    b
    m
    • 3
    • 3
  • h

    hallowed-gpu-49827

    02/15/2022, 10:36 AM
    Hello folks In this article it’s stated that there’s an endpoint in frontend that redirects to gms and can be used to ingest data with authentication. But I can’t find the endpoint address anywhere…
    b
    • 2
    • 2
  • s

    strong-kite-83354

    02/15/2022, 2:56 PM
    Hello - I'm trying out the new (and in beta) data-lake ingestion tool on a file on my Windows laptop. As far as I can tell it very nearly works but the push to the DataHub GMS fails with an error "Unable to emit metadata to DataHub GMS". If I look down the sizeable info message then it strikes me the core of the problem is something around: 'Caused by: com.linkedin.data.template.TemplateOutputCastException: Invalid URN syntax: Invalid URN Parameter: ' "'No enum constant com.linkedin.common.FabricType.prod: " 'urnlidataset:(urnlidataPlatform:local-data-lake,Indices-2021-03,prod)\n'
    s
    • 2
    • 8
  • n

    nutritious-egg-28432

    02/15/2022, 4:07 PM
    Hello guys, can we connect Datahub with DB2 ?
    d
    • 2
    • 2
  • r

    red-napkin-59945

    02/15/2022, 5:35 PM
    Hey Team, I am wondering how do we model the relationship between
    Looker View A
    and
    Looker View B
    if
    View B
    includes
    View A
    ?
    l
    m
    g
    • 4
    • 13
  • g

    gentle-optician-51037

    02/16/2022, 6:49 AM
    Hi guys, I am wondering where is metadata information stored after ingestion data from hive or mysql, and In what form is it stored to facilitate page display? JSON or XML?Thanks for anwser~🙂
    b
    • 2
    • 1
  • a

    adorable-flower-19656

    02/16/2022, 9:22 AM
    Hi, I'm using Datahub 0.8.26 via datahub-helm and UI ingestion for my BigQuery. But 'data container' entities are not ingested at all. In the UI ingestion, advanced -> CLI version is 0.8.19.1(default). Is there anything related to this? Or, do I need some options in recipe?
    d
    b
    • 3
    • 3
  • p

    plain-lion-38626

    02/16/2022, 1:31 PM
    Hi everybody. I'm using a custom python emitter to add lineage to Bigquery objects that are under different projects. The UPSERT option seems to overwrite lineage upstream when switching between projects. E.g.: table `project1.Dataset1.table1``has
    project1.Dataset2.table2
    as an upstream. but it also has
    project2.Dataset2.table2
    as another upstream. When using the custom emitter (with UPSERT option) the second project seems to overwrite the first one. Is this a bug or do I need to query all project and add the lineage upstream afterwards?
    Copy code
    lineage_mcp = MetadataChangeProposalWrapper(
                        entityType="dataset",
                        changeType=ChangeTypeClass.UPSERT,
                        entityUrn=builder.make_dataset_urn(platform, fq_table_name, env),
                        aspectName="upstreamLineage",
                        aspect=upstream_lineage,)
    l
    • 2
    • 4
  • f

    few-air-56117

    02/16/2022, 2:22 PM
    hi guys, where i can find statistics/usage in datahub mysql?
    d
    b
    • 3
    • 9
  • c

    cuddly-apple-7818

    02/16/2022, 3:32 PM
    Hi all, so apparently datahub can automatically ingest bigquery column and table descriptions. My understanding is that the reverse isn’t true. That is, if we edit a column description on datahub, there’s no way for us to sync back to bigquery, correct?
    b
    b
    • 3
    • 4
  • f

    freezing-farmer-89710

    02/16/2022, 11:45 PM
    Hi, I'm trying to ingest from glue, however with version acryl_datahub-0.8.26.4 it generates the following error, even though I previously installed the glue plugin ( python3 -m pip install 'acryl-datahub[glue]' ), however when I install version 0.8.23.1, it ingests correctly. Error:
    Copy code
    [2022-02-16 23:04:00,956] ERROR    {datahub.entrypoints:125} - File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/api/registry.py", line 84, in _ensure_not_lazy
        80   def _ensure_not_lazy(self, key: str) -> Union[Type[T], Exception]:
        81       path = self._mapping[key]
        82       if isinstance(path, str):
        83           try:
    --> 84               plugin_class = import_path(path)
        85               self.register(key, plugin_class, override=True)
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path
        18   def import_path(path: str) -> Any:
     (...)
        28           module_name, object_name = path.rsplit(":", 1)
        29       else:
        30           module_name, object_name = path.rsplit(".", 1)
        31   
    --> 32       item = importlib.import_module(module_name)
        33       for attr in object_name.split("."):
    File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
        109  def import_module(name, package=None):
     (...)
        123          for character in name:
        124              if character != '.':
        125                  break
        126              level += 1
    --> 127      return _bootstrap._gcd_import(name[level:], package, level)
    File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
    File "<frozen importlib._bootstrap>", line 983, in _find_and_load
    File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
    File "<frozen importlib._bootstrap_external>", line 728, in exec_module
    File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/aws/glue.py", line 27, in <module>
        23   from datahub.ingestion.api.source import Source, SourceReport
        24   from datahub.ingestion.api.workunit import MetadataWorkUnit
        25   from datahub.ingestion.source.aws.aws_common import AwsSourceConfig
        26   from datahub.ingestion.source.aws.s3_util import make_s3_urn
    --> 27   from datahub.ingestion.source.sql.sql_common import SqlContainerSubTypes
        28   from datahub.metadata.com.linkedin.pegasus2avro.common import Status
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/sql/sql_common.py", line 22, in <module>
        18   )
        19   from urllib.parse import quote_plus
        20   
        21   import pydantic
    --> 22   from sqlalchemy import create_engine, inspect
        23   from sqlalchemy.engine.reflection import Inspector
    ---- (full traceback above) ----
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/api/registry.py", line 84, in _ensure_not_lazy
        plugin_class = import_path(path)
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path
        item = importlib.import_module(module_name)
    File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
    File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
    File "<frozen importlib._bootstrap>", line 983, in _find_and_load
    File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
    File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
    File "<frozen importlib._bootstrap_external>", line 728, in exec_module
    File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/aws/glue.py", line 27, in <module>
        from datahub.ingestion.source.sql.sql_common import SqlContainerSubTypes
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/source/sql/sql_common.py", line 22, in <module>
        from sqlalchemy import create_engine, inspect
    ModuleNotFoundError: No module named 'sqlalchemy'
    The above exception was the direct cause of the following exception:
    File "/usr/local/lib/python3.7/site-packages/datahub/cli/ingest_cli.py", line 77, in run
        67   def run(config: str, dry_run: bool, preview: bool, strict_warnings: bool) -> None:
     (...)
        73       pipeline_config = load_config_file(config_file)
        74   
        75       try:
        76           logger.debug(f"Using config: {pipeline_config}")
    --> 77           pipeline = Pipeline.create(pipeline_config, dry_run, preview)
        78       except ValidationError as e:
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 175, in create
        171  def create(
        172      cls, config_dict: dict, dry_run: bool = False, preview_mode: bool = False
        173  ) -> "Pipeline":
        174      config = PipelineConfig.parse_obj(config_dict)
    --> 175      return cls(config, dry_run=dry_run, preview_mode=preview_mode)
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 120, in __init__
        105  def __init__(
        106      self, config: PipelineConfig, dry_run: bool = False, preview_mode: bool = False
        107  ):
     (...)
        116          preview_mode=preview_mode,
        117      )
        118  
        119      source_type = self.config.source.type
    --> 120      source_class = source_registry.get(source_type)
        121      self.source: Source = source_class.create(
    File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/api/registry.py", line 130, in get
        115  def get(self, key: str) -> Type[T]:
     (...)
        126      tp = self._ensure_not_lazy(key)
        127      if isinstance(tp, ModuleNotFoundError):
        128          raise ConfigurationError(
        129              f"{key} is disabled; try running: pip install '{__package_name__}[{key}]'"
    --> 130          ) from tp
        131      elif isinstance(tp, Exception):
    ConfigurationError: glue is disabled; try running: pip install 'acryl-datahub[glue]'
    d
    • 2
    • 3
  • f

    freezing-farmer-89710

    02/17/2022, 4:03 AM
    hello, I hope everyone is well, I hope you can help me please, the situation I have is, I would like to bring all the run-ids and persist them, since when I execute > datahub ingest list-runs , it only brings me some, and I want to be able to access all of them, to execute rollback on a specific run if necessary, I would like to know if there is a programmatic way to bring all the existing run-ids, or if there is a way to launch a query to the DB that saves the metadata to list them. Thank you
    l
    m
    • 3
    • 3
  • r

    rich-policeman-92383

    02/17/2022, 4:44 AM
    Hi Rest API payload for group creation is incorrect. corpUser should corpuser. Using the existing payload results in error.
    Copy code
    Caused by: com.linkedin.data.template.TemplateOutputCastException: Invalid URN syntax: Urn entity type should be 'corpuser'.: urn:li:corpUser:datahub
    b
    • 2
    • 2
  • f

    flaky-airplane-82352

    02/17/2022, 2:52 PM
    Hi, is there a way to rename a Domain after created? I've created a domain with a long string but it get's broken in the main page of datahub, and I wish to rename it to a shorter string. Thanks.
    h
    • 2
    • 2
  • b

    billowy-flag-4217

    02/17/2022, 8:54 PM
    Hello, I'm fairly new to DataHub and I am attempting to ingest metadata via the LookML plugin. All is well apart from receiving an error where a particular Looker project utilises on importing files from another project using
    include:
    . This results in an error:
    ['cannot resolve include: '//path/to.view']
    Does anyone know how to resolve this?
    b
    b
    • 3
    • 5
  • w

    wide-army-23885

    02/17/2022, 10:42 PM
    Hello, I'm trying to ingest ML Models into DataHub. Since there's no recipe, what's the best way to do it? I was looking for the Demo code, which contains 2 ML Models, but I couldn't find it.
    l
    • 2
    • 7
  • s

    stocky-midnight-78204

    02/18/2022, 2:32 AM
    Is there any timeline to support apache hudi ingestion?
    m
    • 2
    • 3
  • w

    witty-painting-90923

    02/18/2022, 9:33 AM
    Hello! I am trying to ingest MongoDB metadata to datahub with airflow. 3 databases out of 8 are ingested, but at some point i get an error “array type is not backed by a DataList” Does anyone know what this might be? Maybe array fields are not supported? Thank you!
    Copy code
    'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
                    'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                             'message': "Parameters of method 'ingest' failed validation with error 'ERROR :: "
                                        '/entity/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.schema.SchemaMetadata/fields/5/type/type/com.linkedin.schema.ArrayType/nestedType '
                                        ':: array type is not backed by a DataList\n'
                                        'ERROR :: '
                                        '/entity/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.schema.SchemaMetadata/fields/33/type/type/com.linkedin.schema.ArrayType/nestedType '
                                        ':: array type is not backed by a DataList\n'
                                        'ERROR :: '
                                        '/entity/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.schema.SchemaMetadata/fields/37/type/type/com.linkedin.schema.UnionType/nestedTypes '
    d
    • 2
    • 4
  • s

    stale-printer-44316

    02/18/2022, 2:36 PM
    Hi- In the auto ingestion tab in 8.25 when I run the ingestion, it doesn't provide me with status, duration or details. Could you please help?
    d
    • 2
    • 2
  • b

    broad-battery-31188

    02/18/2022, 3:11 PM
    Hello Team, Why would MariaDB ingestion require access to
    user
    table ?
    Copy code
    OperationalError: (pymysql.err.OperationalError) (1142, "SHOW VIEW command denied to user 'datahub'@'<ip address>' for table 'user'")
    d
    • 2
    • 1
  • l

    lemon-hydrogen-83671

    02/18/2022, 5:55 PM
    Anyone know if there's something akin to an
    add_documentation
    transformer out there? I was thinking of adding one that would populate the documentation tab with a template or something for urns gathered in a recipe
    plus1 1
    m
    • 2
    • 2
  • s

    silly-beach-19296

    02/18/2022, 6:02 PM
    hello everyone, I have deployed datahub in an EKS and now I want to delete the result of an Athena data ingestion. Should I connect to a node to delete ? or directly to the GMS console?
    m
    • 2
    • 1
  • b

    bland-barista-59197

    02/19/2022, 1:19 AM
    Hi Team, is it possible to fix navigation links in Rest.Li documentation http://localhost:8080/restli/docs e.g click on Home it take me to http://localhost:8080/restli/restli/docs
    👍 1
    h
    • 2
    • 2
1...293031...144Latest