https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • r

    red-pizza-28006

    11/10/2021, 3:57 PM
    What would be the best way to ingest KPIs that I have in this format in a confluence page.
    b
    • 2
    • 7
  • p

    plain-farmer-27314

    11/10/2021, 9:22 PM
    hey, wondering why the dbt ingestor has this extra value in its constructor: https://github.com/linkedin/datahub/blob/b2f59e8745c8fb40ae7aa0f4d5b4d0aac8b97968/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L538 where does that value come from?
    m
    b
    • 3
    • 4
  • w

    wooden-gpu-7761

    11/11/2021, 6:33 AM
    Hi everyone, wondering if anyone’s had issues ingesting lineage from BigQuery due to 503s when querying GCP logs. The logs are:
    Copy code
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 1083, in emit
        msg = self.format(record)
      File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 927, in format
        return fmt.format(record)
      File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 663, in format
        record.message = record.getMessage()
      File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 367, in getMessage
        msg = msg % self.args
    TypeError: not all arguments converted during string formatting
    Call stack:
      File "/Users/hyunmin/datahub-recipes/env/bin/datahub", line 8, in <module>
        sys.exit(main())
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/datahub/entrypoints.py", line 93, in main
        sys.exit(datahub(standalone_mode=False, **kwargs))
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 58, in run
        pipeline.run()
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 141, in run
        for wu in self.source.get_workunits():
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/datahub/ingestion/source/sql/bigquery.py", line 207, in get_workunits
        self._compute_big_query_lineage()
      File "/Users/hyunmin/datahub-recipes/env/lib/python3.9/site-packages/datahub/ingestion/source/sql/bigquery.py", line 121, in _compute_big_query_lineage
        logger.error(
    Message: 'Error computing lineage information using GCP logs.'
    Arguments: (ServiceUnavailable('POST <https://logging.googleapis.com/v2/entries:list?prettyPrint=false>: The service is currently unavailable.'),)
    I’ve tried to relax start_time, end_time, and max_query_duration constraints (to almost 10 second intervals) but unfortunately still haven’t seen good results. It seems like the project DataHub is querying against is too big in terms of log size and GCP’s API seems to timeout when returning the logs internally (this was confirmed by GCP’s support team). Would there be any options I could tweak or anything I’m missing? FYI I’ve tried to call the API manually via curl with smaller page sizes of about 10 and seen better results, but it seems like DataHub’s bigquery ingestion module uses a fixed page size of 1000. Any ideas would be much appreciated!
    b
    • 2
    • 5
  • r

    rhythmic-sundown-12093

    11/11/2021, 7:23 AM
    Hi, team, When I integrate mongodb I get this error: OperationFailure: Unrecognized expression '$bsonSize' mongodb version: 3.6.23
    b
    c
    m
    • 4
    • 6
  • o

    orange-flag-48535

    11/11/2021, 8:00 AM
    This should probably count as a bug report: If a JSON key in an MCE file has dots (periods) in it, they get parsed as nested fields (or something strange). It turns out one of the JSON schemas I'm importing has such key names (because I guess humans are creating the key names). Note that the key name is enclosed in quotes, so it's a little surprising to see it being interpreted like that. JSON standard says key name can be any string (RFC 8259). The jq utility and IntelliJ handle that case without issues. The offending line is probably this in translateFieldPath.tsx - https://github.com/linkedin/datahub/blob/master/datahub-web-react/src/app/entity/dataset/profile/schema/utils/translateFieldPath.tsx#L8
    m
    b
    • 3
    • 2
  • n

    nice-planet-17111

    11/11/2021, 9:44 AM
    Hello team, is it possible to ingest multiple MYSQL databases in one yaml file ? 🙂
    b
    b
    • 3
    • 9
  • r

    red-pizza-28006

    11/11/2021, 9:57 AM
    I am trying to ingest Kafka and getting this error
    Copy code
    KafkaException: KafkaError{code=_INVALID_ARG,val=-186,str="Failed to create consumer: Invalid sasl.kerberos.kinit.cmd value: Property not available: "sasl.kerberos.keytab""}
    Anyone faced this before? For context, we use Confluent Kafka
    b
    m
    • 3
    • 6
  • n

    nice-planet-17111

    11/12/2021, 5:41 AM
    Hello team, i'm trying to ingest Groups via curl, but i'm facing an error... Can anybody help me? 😞 • Ingestion user works fine in the same condition • When i try the code below, it says "`No root resource defined for path '/corpGroups'`"
    Copy code
    curl '<http://localhost:8080/corpGroups?action=ingest>' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '{
    "snapshot": {
    	"aspects": [
    		{
    			"com.linkedin.identity.CorpGroupInfo":{
    					"email": "", 
    					"admins": ["urn:li:corpUser:test_user1"], 
    					"members": ["urn:li:corpUser:test_user1", "urn:li:corpUser:test_user2"], "groups": []}}], "urn": "urn:li:corpGroup:dev"}}'
    when i changed to
    /corpGroups?action=ingest
    to
    /entities?action=ingest
    , it says
    "Cannot parse request entity"
    ✔️ 1
    v
    b
    • 3
    • 4
  • c

    creamy-library-6587

    11/12/2021, 7:40 AM
    hi team, is it possible to add tag to db table or field while ingest schema through api?
    b
    b
    • 3
    • 3
  • r

    rough-tent-62538

    11/12/2021, 7:50 AM
    Hi, all! I'm setting up lineage collection from Airflow. Is it possible to pass some usage data (like number of records processed) along with the data flow?
    b
    • 2
    • 3
  • h

    handsome-belgium-11927

    11/12/2021, 10:56 AM
    Anybody knows how to nuke everything if I started datahub with a docker-compose file? When I run
    datahub docker quickstart
    it is easily dropped by
    datahub docker nuke
    , but how to get the same result if I started it with
    docker-compose up -d
    ?
    b
    • 2
    • 5
  • l

    lively-jackal-83760

    11/12/2021, 11:19 AM
    Hi Guys Is it possible to set env by custom string, not one of [PROD, DEV, CORP, IE] ? Because we have several departments, each has his own prod\dev kafka for instance and sometimes topics have the same names. And I wanna to ingest all envs, but divide them somehow
    v
    m
    b
    • 4
    • 7
  • b

    brief-lizard-77958

    11/15/2021, 12:58 PM
    Slightly relating to the post above: I wonder if it's possible (or will be possible after the update mentioned in the thread above) to ingest into Datasets without defining the env in the .yml config. I want to have a structure in Datasets without dividing between prod, dev,...
    b
    m
    • 3
    • 7
  • b

    brief-toothbrush-55766

    11/15/2021, 3:07 PM
    Is there anybody who has a custom docker image for metadata-ingestion. Since, I need to be able to point to the recipe files in different locations rather than the current WorkingDir, I cant use the out of the box container/image. If I have to build custom docker image, can I just use a python3.6+ base image and then include the following (assuming other dependencies are ok) :
    Copy code
    RUN python3 -m pip install --upgrade pip wheel setuptools
    RUN python3 -m pip install --upgrade acryl-datahub
    RUN datahub version
    m
    • 2
    • 13
  • b

    brief-toothbrush-55766

    11/15/2021, 10:10 PM
    I am executing ingest via docker container(linkedin/datahub-ingestion). But, am running into this issue here:
    Copy code
    'error': 'Unable to emit metadata to DataHub GMS',
                   'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                            'message': "No root resource defined for path '/datasets'",
                            'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:404]: No root resource defined for path '
                                          "'/datasets'\n"
    m
    m
    • 3
    • 7
  • b

    billions-tent-29367

    11/15/2021, 10:34 PM
    I have a problem building metadata-ingestion: I've recently started getting this error from testQuick:
    Copy code
    =========================== short test summary info ============================
    FAILED tests/unit/test_airflow.py::test_lineage_backend[airflow-1-10-x-decl]
    FAILED tests/unit/test_airflow.py::test_lineage_backend[airflow-2-x-decl] - a...
    ========== 2 failed, 141 passed, 14 deselected, 2 warnings in 14.86s ===========
    m
    • 2
    • 7
  • b

    better-orange-49102

    11/16/2021, 3:43 AM
    when i ingest aspects using REST end point, i noticed that mySQL field "createdby" is always populated by urnliprincipal:UNKNOWN is there an attribute add to the MCE to change this? for instance i would like to know who is the person who toggled status.removed to True
    b
    • 2
    • 12
  • p

    polite-flower-25924

    11/16/2021, 6:11 AM
    Hey team, I just wonder how I can delete datasets that are ingested from a particular job/pod without DataHub CLI? I’ve read the https://datahubproject.io/docs/how/delete-metadata/ part however I’m not utilizing DataHub CLI. What should I do in k8s enviornment in order to rollback a particular run id?
    b
    s
    • 3
    • 6
  • v

    victorious-dream-46349

    11/16/2021, 11:03 AM
    Hi... Does anyone tried to extend the datahub entities ? We are able to easily extend the gms by simply putting pdl files. But for graphql-core, we are kinda of confused on where to start adding custom code. Some inputs will really help. thanks
    plus1 1
    m
    • 2
    • 3
  • r

    red-pizza-28006

    11/16/2021, 1:47 PM
    How can I get the Datahub REST Connection configured in MWAA? In my local environment, I do see that option but dont see it in MWAA (all the datahub plugins are already installed)
    m
    • 2
    • 4
  • n

    nice-planet-17111

    11/17/2021, 5:27 AM
    Hello team 🙂 Does anyone know what specific privileges are needed on schema level to ingest data from MySQL ? (more on thread)
    m
    • 2
    • 4
  • f

    full-area-6720

    11/17/2021, 6:10 AM
    Hi, is it possible to create clickable links in datahub so that one can jump between primary key and foreign key?
    m
    • 2
    • 2
  • r

    rhythmic-sundown-12093

    11/17/2021, 6:16 AM
    Hi, team, when setting the parameter env, an error will be reported
    Copy code
    source:
      type: mysql
      config:
        env: "Stage"
        host_port: xxx
        database: yyy
    
        # Credentials
        username: zzzz
        password: 'xxxxxx'
    
        schema_pattern:
          allow: ["AAAA"]
    
    sink:
      type: "datahub-rest"
      config:
              server: '<http://localhost:9003>'
    log:
    Copy code
    Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at com.linkedin.metadata.dao.utils.RecordUtils.invokeProtectedMethod(RecordUtils.java:370)
        at com.linkedin.metadata.dao.utils.RecordUtils.getRecordTemplateField(RecordUtils.java:289)
        at com.linkedin.metadata.dao.utils.ModelUtils.getUrnFromSnapshot(ModelUtils.java:128)
        at com.linkedin.metadata.entity.EntityService.ingestSnapshotUnion(EntityService.java:377)
        at com.linkedin.metadata.entity.EntityService.ingestEntity(EntityService.java:312)
        at com.linkedin.metadata.resources.entity.EntityResource.lambda$ingest$4(EntityResource.java:183)
        at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)
        ... 81 more
    Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.linkedin.metadata.dao.utils.RecordUtils.invokeProtectedMethod(RecordUtils.java:368)
        ... 87 more
    Caused by: com.linkedin.data.template.TemplateOutputCastException: Invalid URN syntax: Invalid URN Parameter: 'No enum constant com.linkedin.common.FabricType.Stage: urn:li:dataset:(urn:li:dataPlatform:mysql,AAAA.BBBBBB,Stage)
        at com.linkedin.common.urn.DatasetUrn$1.coerceOutput(DatasetUrn.java:78)
        at com.linkedin.common.urn.DatasetUrn$1.coerceOutput(DatasetUrn.java:69)
        at com.linkedin.data.template.DataTemplateUtil.coerceOutput(DataTemplateUtil.java:954)
        at com.linkedin.data.template.RecordTemplate.obtainCustomType(RecordTemplate.java:365)
        ... 91 more
    Caused by: java.net.URISyntaxException: Invalid URN Parameter: 'No enum constant com.linkedin.common.FabricType.Stage: urn:li:dataset:(urn:li:dataPlatform:mysql,AAAA.BBBBBB,Stage)
        at com.linkedin.common.urn.DatasetUrn.createFromUrn(DatasetUrn.java:55)
        at com.linkedin.common.urn.DatasetUrn.createFromString(DatasetUrn.java:38)
        at com.linkedin.common.urn.DatasetUrn$1.coerceOutput(DatasetUrn.java:76)
        ... 94 more
    n
    • 2
    • 4
  • n

    nice-planet-17111

    11/17/2021, 6:22 AM
    Hello again 🙂 is it possible to customize folder structure of entities on web browser when ingesting from mysql? For example, currently entities is saved under
    prod/mysql/{schema_name}/{schema_name}.{table_name}
    . But this has some problems.. 1. it does not represent instance name - therefore make it hard to search 2. When i ingest from multiple instances, all the schemas will be under the same mysql. Is there a way to set path like prod/mysql/{instance_nam}/{schema_name}/{table_name} or something like that?
    ✔️ 1
    b
    s
    s
    • 4
    • 7
  • o

    orange-flag-48535

    11/17/2021, 7:28 AM
    Hi, What is the format for specifying an Enum's values in the MCE json file?
    b
    m
    • 3
    • 9
  • c

    creamy-library-6587

    11/17/2021, 9:52 AM
    Hi, I want to start mce-consumer-job, but there always error with "import com.linkedin.mxe.MetadataChangeEvent". seem it can't identify MetadataChangeEvent.pdl. what can I do for this problem?
    m
    • 2
    • 2
  • b

    boundless-scientist-520

    11/17/2021, 10:46 AM
    Hi! I've executed the following superset recipe:
    Copy code
    source:
      type: "superset"
      config:
        username: xxxx
        password: xxxx
        provider: db
        connect_uri: <http://supersetxxxxxx.com>
        env: "DEV"
    sink:
      type: "datahub-rest"
      config:
        server: "<http://datahub-datahub-gms.datahub.svc.cluster.local:8080>"
    The ingestion was successful. I can see in Datahub the charts, dashboard and lineage. But in the category of "Datasets" datasets do not appear, although I can see them in the lineage (attached image). How can I see these datasets? Do I need any configuration in the recipe? Thanks for any help.
    s
    m
    • 3
    • 3
  • b

    better-orange-49102

    11/18/2021, 2:02 AM
    so John mentioned that I should use the Status to remove a dataset from the listing in UI and not the Deprecated aspect. however, I was testing out Deprecated for a dataset, and saw that once the dataset is deprecated (and vanishes from UI), I cannot undo the aspect and get the dataset back into listing (ie, setting the decommission time to a future timestamp or setting the deprecated flag to false). Is this supposed to be correct behavior?
    e
    • 2
    • 3
  • n

    nutritious-train-7865

    11/18/2021, 2:19 PM
    Hi team, when I was trying to execute the following trino recipe for testing:
    Copy code
    source:
      type: trino
      config:
        # Coordinates
        host_port: xxxx
        database: xxxx
    
        # Credentials
        username: xxxx
        password: xxxx
    
    sink:
      type: "file"
      config:
        filename: "./example_output_mces.json"
    I was getting this error:
    Copy code
    DBAPIError: (trino.exceptions.FailedToObtainAddedPrepareHeader) 
    [SQL: SELECT "table_name"
    FROM "information_schema"."tables"
    WHERE "table_schema" = ? and "table_type" != 'VIEW']
    Can anybody help me with it?
    m
    d
    +3
    • 6
    • 22
  • c

    clean-crayon-15379

    11/18/2021, 6:13 PM
    Hi team, short question on dataset statistics. Can I access them via a GraphAPI querywhen they have been created? Idea is to programatically get infos about dataset statistcs (last update, number of NaNs...). Additionally, is there a concise way of reporting the number of datasets with statistics attached? Thank you for insights!
    m
    • 2
    • 6
1...181920...144Latest