https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • s

    sparse-barista-40860

    07/19/2022, 1:29 AM
    For big data storage, Cassandra, mongo, hadoop
    c
    • 2
    • 1
  • s

    sparse-barista-40860

    07/19/2022, 1:29 AM
    Wich one is compatible right now for expose metadata?
  • l

    lemon-terabyte-66903

    07/19/2022, 2:57 AM
    https://datahubspace.slack.com/archives/CV2UVAPPG/p1658199463049969
    m
    • 2
    • 1
  • m

    melodic-ability-49840

    07/19/2022, 3:49 AM
    Hello, All. I have a HTTP issues for delete metadata. I want to delete one of my schema ingested from MySQL DB. So, I put cli like this.
    Copy code
    datahub delete --query "information_schema" --hard --include-removed
    But, It doesn’t work with this error.
    Copy code
    HTTPError: 401 Client Error: Unauthorized for url: <http://localhost:8080/entities?action=search>
    [2022-07-19 03:22:34,201] INFO     {datahub.entrypoints:177} - DataHub CLI version: 0.8.40 at /home/ec2-user/datahub/venv/lib64/python3.7/site-packages/datahub/__init__.py
    [2022-07-19 03:22:34,201] INFO     {datahub.entrypoints:180} - Python version: 3.7.10 (default, Jun  3 2021, 00:02:01)
    [GCC 7.3.1 20180712 (Red Hat 7.3.1-13)] at /home/ec2-user/datahub/venv/bin/python3 on Linux-5.10.109-104.500.amzn2.x86_64-x86_64-with-glibc2.2.5
    [2022-07-19 03:22:34,201] INFO     {datahub.entrypoints:182} - GMS config {}
    In my opinion, this error is occured from How I started datahub on my EC2 instance. When I started datahub, I modified docker-compose-without-neo4j.quickstart.yml for enable Personal Access Tokens. I referred to below content.
    Copy code
    you have to create your own yml file - ( copy the one that says docker-compose-without-neo4j.quickstart.yml under docker/quickstart folder and edit it with METADATA_SERVICE_AUTH_ENABLED=true  under both datahub-frontend-reacu and datahub-gms ) then run the quickstart command with the docker file you created. datahub docker quickstart --quickstart-compose-file ./yourdockerfile.yml
    https://datahubspace.slack.com/archives/C029A3M079U/p1655101380628779 In this situation, What should I do to solve this problem? I’d appreciate your help.
    b
    • 2
    • 7
  • h

    hallowed-machine-2603

    07/19/2022, 4:56 AM
    Hi temas, I want to insert the tags/terms/descriptions in bulk using .csv or .xlsx. Is it possible to insert tags/terms/descriptions in bulk for each table?
    c
    b
    • 3
    • 2
  • a

    alert-football-80212

    07/19/2022, 7:42 AM
    Hi all, what exactly is the difference between snowflake metadata ingestion to snowflake usages?
    b
    • 2
    • 6
  • s

    steep-vr-39297

    07/19/2022, 1:31 PM
    Hello. I have a question. Containers get the same DB name. Can I have just one DB name?
    c
    • 2
    • 5
  • b

    billions-twilight-48559

    07/19/2022, 2:08 PM
    Hi, I think there is a problem ingesting table column comments using the DataBricks Hive metastore and Hive ingestor. If I use the delta lake ingestion it works but I can see the comments are in json format, not sure if this is due to delta lake ingestion or a data bricks standard. Anyone knows if maybe changing any parameter in the recipe it would works with the hive ingestion?
    m
    c
    • 3
    • 21
  • b

    brave-tomato-16287

    07/19/2022, 5:16 PM
    Hello All! Could you help me to understand why tests from
    run_results.json
    did not parse? It contains:
    Copy code
    {
      "status": "pass",
      "timing": [
        {
          "name": "compile",
          "started_at": "2022-07-19T05:24:01.030208Z",
          "completed_at": "2022-07-19T05:24:01.076270Z"
        },
        {
          "name": "execute",
          "started_at": "2022-07-19T05:24:01.092355Z",
          "completed_at": "2022-07-19T05:24:01.309409Z"
        }
      ],
      "thread_id": "Thread-3 (worker)",
      "execution_time": 0.3308274745941162,
      "adapter_response": {},
      "message": null,
      "failures": 0,
      "unique_id": "test.dwh.not_null_dim_poses_pos_id.bbf0d646e2"
    },
        {
          "status": "pass",
          "timing": [
            {
              "name": "compile",
              "started_at": "2022-07-19T05:24:01.763525Z",
              "completed_at": "2022-07-19T05:24:01.783930Z"
            },
            {
              "name": "execute",
              "started_at": "2022-07-19T05:24:01.784195Z",
              "completed_at": "2022-07-19T05:24:01.974451Z"
            }
          ],
          "thread_id": "Thread-2 (worker)",
          "execution_time": 0.21937918663024902,
          "adapter_response": {},
          "message": null,
          "failures": 0,
          "unique_id": "test.dwh.row_count_dim_poses_1__pos_id.a5451597d7"
        },
        {
          "status": "pass",
          "timing": [
            {
              "name": "compile",
              "started_at": "2022-07-19T05:26:12.219929Z",
              "completed_at": "2022-07-19T05:26:12.239354Z"
            },
            {
              "name": "execute",
              "started_at": "2022-07-19T05:26:12.239617Z",
              "completed_at": "2022-07-19T05:26:12.325147Z"
            }
          ],
          "thread_id": "Thread-9 (worker)",
          "execution_time": 0.10956311225891113,
          "adapter_response": {},
          "message": null,
          "failures": 0,
          "unique_id": "test.dwh.unique_dim_poses_pos_id.e6dc11e045"
        },
    m
    • 2
    • 4
  • r

    refined-energy-76018

    07/20/2022, 1:08 AM
    Is there a way to mount a recipe file so that it will show up in UI ingestion? Or more generally, what are some ways I can manage my UI ingestion sources through source control?
    m
    • 2
    • 5
  • h

    hallowed-machine-2603

    07/20/2022, 5:42 AM
    Hi teams, Is there custom option for add or change browse set? If a certain database metadata is introduced in DataHub using recipe, can I set the separated PATH each table? Assumption: Two tables are in the same database, one is Finance_table, the other is Analysis_table. Presently, I use transformers option below transformers: type: set_dataset_browse_path config: path_tamplates: 'Test/Test_1/Test_2/DATASET_PARTS' In this case, all dataset is located at same PATH. But I want to separate PATH for each table. For example, Test/Test_1/Test_2/Finance_table Test/Check_1/Check_2/Analysis_table
    c
    • 2
    • 5
  • s

    steep-vr-39297

    07/20/2022, 8:11 AM
    Hello, I have question. Data was imported using a hive recipe. When I did
    show databases
    , there was no
    none
    DB, but it was created. This is Hive Recipe
    Copy code
    source:
        type: hive
        config:
          host_port: ip:port
          username: id
          password: pw
          env: DEV
          platform: hive
          schema_pattern:
            allow:
              - 'dev_db_.*'
          domain:
            test:
              allow:
                - '.*'
          options:
           connect_args:
              auth: LDAP
    
    sink:
        type: datahub-rest
        config:
          server: "<http://localhost:8080>"
    c
    d
    f
    • 4
    • 5
  • g

    gorgeous-library-38151

    07/20/2022, 9:58 AM
    Can I specify the name of the dataset with yml (instead of using file's name) while ingesting s3 files into datahub?
    c
    • 2
    • 1
  • s

    salmon-angle-92685

    07/20/2022, 12:28 PM
    Hello guys, In my redshift recipe, I add some Glossary Terms based on the table name pattern. However, there is an intersection of tables between two different patterns. When this happens, the second pattern doesn't add the glossary terms. How to correct this ? For example:
    Copy code
    # Linking all the tables having '_table_' on their name to the glossary1 Glossary Term.
    '.*\._table_.*': ["urn:li:glossaryTerm:topic.glossary1"]
    
    # Linking all the tables having '_table_example_' on their name to the glossary2 Glossary Term.
    '.*\._table_example_.*': ["urn:li:glossaryTerm:topic.glossary2"]
    Since all the tables present on the second pattern are also on the first one, the second pattern doesn't add the glossary2 to the tables. Thank you !
  • t

    thankful-umbrella-77469

    07/20/2022, 1:29 PM
    Hi, I will appreciate some assistance regarding Datahub configuration with my cloud services. I’m launching the helm chart and facing with the following error with the pod of
    acryl-datahub-actions
    Copy code
    %6|1658138371.437|FAIL|rdkafka#consumer-1| [thrd:<BOOTSTRAP_SERVER>:9092/bootstrap]: <BOOTSTRAP_SERVER>:9092/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 3ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)
    and this error keeps repeating itself. I’m connecting my datahub resources of kafka, elasticSearch and mysql to my cloud services. elastic and mysql -> AWS kafka -> confluent I followed after the doc here but still didn’t overcome the above error Did someone face it as well? Thx
  • b

    better-orange-49102

    07/20/2022, 2:00 PM
    with the new glossaryterms UI creation logic, the business-glossary CLI ingestion logic might be due for a change. The old approach of using the term name as the basis for URN can be replaced by UUID (or at least let people specify a UUID so that they reliably edit a specific term and yet not be locked in by the name), however, i can foresee legacy issues from people who have an existing glossary term flow? Not sure what would be the best approach here
    m
    • 2
    • 2
  • c

    colossal-sandwich-50049

    07/20/2022, 4:08 PM
    Hello, I am playing around with Lineage and getting an error when trying to set downstream lineage. Can someone advise?
    Copy code
    /** FS Downstream **/
            Downstream downstream = new Downstream()
                    .setDataset(new DatasetUrn(new DataPlatformUrn("delta"), "somedataset", FabricType.DEV))
                    .setType(DatasetLineageType.TRANSFORMED);
    
            DownstreamArray downstreamArray = new DownstreamArray(downstream);
            DownstreamLineage downstreamLineage = new DownstreamLineage().setDownstreams(downstreamArray);
    
            MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
                    .entityType("dataset")
                    .entityUrn(new DatasetUrn(new DataPlatformUrn("delta"), "someotherdataset", FabricType.DEV))
                    .upsert()
                    .aspect(downstreamLineage)
                    .build();
    
    Trying to build the mcpw gives the following error: Caused by: java.lang.NullPointerException: aspectName could not be inferred from provided aspect and was not explicitly provided as an override
    
    Note: the same exact code works if switching it to set upstream lineage instead of downstream
    c
    • 2
    • 1
  • b

    billowy-monitor-84300

    07/20/2022, 5:11 PM
    Does Looker ingestion require your credentials to be tied to an admin-level service account on the Looker side? Or can this be a developer account?
    m
    • 2
    • 2
  • c

    chilly-carpet-99599

    07/20/2022, 5:23 PM
    Hey folks, regarding profiling data sets on Snowflake: what is the purpose of the below query? Does it relate to sample values being shown in stats section? The query always hangs on our db for a very large table, does DataHub allow us to customise or tune profiling queries? Thanks so much.
    Copy code
    select column from table where column is not null order by column limit 2 offset <tel:500000000|500000000>;
    l
    c
    • 3
    • 2
  • b

    bland-morning-36590

    07/20/2022, 6:09 PM
    Hi Team, I am trying to ingest metadata from S3 bucket. From the logs I can see I am able to make a connection, but hitting “KeyError: ‘ContentRange’“. Any idea why this could be happening?
    • 1
    • 1
  • c

    colossal-sandwich-50049

    07/20/2022, 10:06 PM
    Is there any documentation showing how to create an
    MLModel
    entity using the java (or python) emitter?
    c
    b
    • 3
    • 12
  • n

    nice-country-99675

    07/21/2022, 2:05 AM
    👋 Hello team! Just a quick question here... is there a way to access an aspect in a transformer transforming another aspect?
    plus1 1
    b
    • 2
    • 2
  • p

    purple-ghost-69116

    07/21/2022, 3:05 AM
    Hi Team, I quick question regarding data ingestion from redshift or glue. How do we ingest metadata like owner, tags, domain, queries into datahub. Do we need to manually populate all this by a user?
    b
    • 2
    • 4
  • l

    lemon-zoo-63387

    07/21/2022, 3:56 AM
    Hello, everyone, I am trying to ingest metadata from mssqls, two of which succeeded, but one alarm. How to solve it? Thanks in advance for your help
    Copy code
    'Error: Client does not have encryption enabled but it is required by server, enable encryption and try connecting again\n'
    Configure Recipe:
    Copy code
    source:
        type: mssql
        config:
            env: QA
            host_port: '10.xxxx:50920'
            database: OA
            username: xxx
            password: 'oaxxxxn'
    sink:
        type: datahub-rest
        config:
            server: '<http://datahub-gms:8080>'
    logs:
    Copy code
    'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line '
               '114, in connect\n'
               '    return dialect.connect(*cargs, **cparams)\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 508, '
               'in connect\n'
               '    return self.dbapi.connect(*cargs, **cparams)\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/__init__.py", line 1345, in '
               'connect\n'
               '    conn._open(sock=sock)\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/__init__.py", line 372, in _open\n'
               '    self._try_open(timeout=retry_time, sock=sock)\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/__init__.py", line 354, in '
               '_try_open\n'
               '    self._connect(host=host, port=port, instance=instance, timeout=timeout, sock=sock)\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/__init__.py", line 304, in '
               '_connect\n'
               '    route = conn.login(login, sock, self._tzinfo_factory)\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/tds.py", line 1703, in login\n'
               '    self._main_session.process_prelogin(login)\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/tds.py", line 1294, in '
               'process_prelogin\n'
               '    self.parse_prelogin(octets=p, login=login)\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/pytds/tds.py", line 1343, in '
               'parse_prelogin\n'
               "    raise tds_base.Error('Client does not have encryption enabled but it is required by server, '\n"
               '\n'
               'Error: Client does not have encryption enabled but it is required by server, enable encryption and try connecting again\n'
               '\n'
               'The above exception was the direct cause of the following exception:\n'
               '\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 106, in '
               'run\n'
               '    88   def run(\n'
               '    89       ctx: click.Context,\n'
               '    90       config: str,\n'
               '    91       dry_run: bool,\n'
               '    92       preview: bool,\n'
               '    93       strict_warnings: bool,\n'
               '    94       preview_workunits: int,\n'
               '    95       suppress_error_logs: bool,\n'
               '    96   ) -> None:\n'
               ' (...)\n'
               '    102      pipeline_config = load_config_file(config_file)\n'
               '    103  \n'
               '    104      try:\n'
               '    105          logger.debug(f"Using config: {pipeline_config}")\n'
               '--> 106          pipeline = Pipeline.create(pipeline_config, dry_run, preview, preview_workunits)\n'
               '    107      except ValidationError as e:\n'
               '\n'
               'File "/tmp/datahub/ingest/venv-91f9ab72-fe31-48a7-93d1-378e08f9ea5f/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line '
               '204, in create\n'
               '    196  def create(\n'
               '    197      cls,\n'
               '    198      config_dict: dict,\n'
               '    199      dry_run: bool = False,\n'
               '    200      preview_mode: bool = False,\n'
               '    201      preview_workunits: int = 10,\n'
               '    202  ) -> "Pipeline":\n'
               '    203      config = PipelineConfig.parse_obj(config_dict)\n'
               '--> 204      return cls(\n'
               '    205          config,\n'
               '\n'
    c
    • 2
    • 3
  • w

    wonderful-egg-79350

    07/21/2022, 5:19 AM
    Is it possible for me to ingest metadata using json or csv file? How to ingest table schema using json?
    c
    • 2
    • 3
  • w

    witty-painting-90923

    07/21/2022, 8:51 AM
    Hello everyone! We have added an elastic cluster in datahub. But now this cluster is deleted. Hence the ingestion doesnt work, and the old metadata is just hanging there. Is there a way currently to remove it from datahub? Thank you!
    c
    s
    • 3
    • 8
  • c

    cool-vr-73109

    07/21/2022, 10:19 AM
    Hi team, Tried S3 data ingestion using Custome ingestion source option.Ingestion failed for profiling enabled as true with below error. JAVA_HOME and SPARK_VERSION are set. We have deployed Datahub in an ec2 instance with Docker. Please help
    c
    w
    • 3
    • 34
  • b

    best-leather-7441

    07/21/2022, 3:05 PM
    hi ! my team and I would like to implement a gitlab CI/CD job to ingest dbt data to datahub. Does anyone have some tips on how to do this ?
    m
    • 2
    • 2
  • a

    average-rocket-98592

    07/21/2022, 3:15 PM
    Hi! I’m trying to ingest data from Oracle with the profiling option enabled with default values, but it returns the following error: limit\n value is not a valid integer
    c
    f
    • 3
    • 7
  • n

    nice-solstice-89964

    07/21/2022, 3:51 PM
    Hi, community. We have ingestions for Looker and dbt/Snowflake on DataHub But when trying to show the lineage from Looker, we can only see the publish table of the charts on Looker, but cannot further see the source tables of those publish tables. The only way to see such lineage (source tables -> publish tables) is to check from dbt section on DataHub. Is there a way to also integrate those 2 sources together? So that when we check the lineage of Looker, we can also track back to the very source tables for some fields on dbt. Thanks very much for any hint! 🙇
    c
    • 2
    • 1
1...555657...144Latest