https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • i

    incalculable-ocean-74010

    02/16/2021, 12:53 PM
    It is being referenced in a GMS client implementation class:
    Copy code
    @RestLiCollection(name = "streams", namespace = "com.linkedin.stream", keyName = "stream")
    public final class Streams extends BaseBrowsableEntityResource<
            // @formatter:off
            ComplexResourceKey<StreamKey, EmptyRecord>,
            Stream,
            StreamUrn,
            StreamSnapshot,
            StreamAspect,
            StreamDocument> {
    
        @Inject
        @Named("streamBrowseDao")
        private BaseBrowseDAO _browseDAO;
    
        @Inject
        @Named("streamSearchDao")
        private BaseSearchDAO _searchDAO;
    
        @Inject
        @Named("streamDao")
        private BaseLocalDAO _localDAO;
    ...
    b
    • 2
    • 2
  • o

    orange-night-91387

    02/16/2021, 9:58 PM
    Hi! I'm running into an unsafe type-cast error in trying to convert a in-memory MetadataChangeEvent to a GenericRecord to be sent to Kafka using EventUtils (pegasusToAvroMCE). The MetadataChangeEvent has DataMap format for some of the fields (ex: URN fields). In DataTranslator.java from the REST.li project this is causing the issue:
    Copy code
    line 531-533:        
    case STRING:
              result = new Utf8((String) value);
              break;
    Value in this case is a DataMap representing a DatasetUrn. The Avro schema defines that as as a String type expecting something like "urnlidataset:..." but since it is in the form "{ platform: {...}, origin: ... , name: ...}" this case results in a ClassCastException. Is there a different way I can generate the GenericRecord with the format I have? NOTE: This is NOT master, this is a separate development branch that I'm working on in a forked repo. Not a bug report, looking for advice 🙂
    m
    m
    • 3
    • 4
  • c

    curved-magazine-23582

    02/17/2021, 3:26 AM
    is there a way to add a new custom dataplatform?
    c
    b
    a
    • 4
    • 4
  • m

    mammoth-bear-12532

    02/18/2021, 8:04 AM
    Ingestion enthusiasts: Wanted to let you know that we've landed some big improvements in the Python ingestion suite for DataHub (including support for Airflow based ingestion scheduling). Check it out here (https://github.com/linkedin/datahub/tree/master/metadata-ingestion) and let us know how it can be improved further! We'll do a tour of this in the town-hall on Friday, so do attend if you are curious about it!
    🎉 5
    👍 8
    a
    • 2
    • 1
  • p

    powerful-egg-69769

    02/22/2021, 4:11 PM
    is it possible to ingest metadata with a regular HTTP request on the DataHub REST API?
    c
    b
    g
    • 4
    • 16
  • a

    acoustic-printer-83045

    02/25/2021, 9:45 PM
    Hi everyone! I’m using DataHub as a data catalog hackathon project at InvisionApp. I was able to adapt the data ingest scripts to pull from redshift and I have the contents of my warehouse listed as datasets in DataHub, I’m now trying to use the manifest file from DBT (getdbt.com / data build tool) to assign lineage to at least a subset of my data. DataHub is working great, however I’m struggling a bit with the MCE definitions to add upstream lineage. Right now I’ve modified the metadata-ingestion componentry to append lineage based on my DBT data. Right now it’s just hardcoded while I figure out how to make it all work. The object output for the upstream lineage object I’ve appended to ‘aspects’ that I’m seeing on send to the datahub rest point is:
    Copy code
    {'upstreams': [
                        {'auditStamp': {'time': 0, 'actor': '', 'impersonator': None
                            }, 'dataset': 'urn:li:dataset:(urn:li:dataPlatform:redshift,events.analytics_dev_garylucas.carr_quarterly,PROD)', 'type': 'TRANSFORMED'
                        }
                    ]
                }
    I don’t see an error from that but when I go to load lineage I get the following error in the back end (+ a UI error on the front end)
    Copy code
    datahub-frontend        | 21:36:25 [application-akka.actor.default-dispatcher-313] ERROR application - Fetch Dataset upstreams error
    datahub-frontend        | com.linkedin.data.template.TemplateOutputCastException: Invalid URN syntax: Urn doesn't start with 'urn:'. Urn:  at index 0:
    datahub-frontend        | 	at com.linkedin.common.urn.UrnCoercer.coerceOutput(UrnCoercer.java:25)
    datahub-frontend        | 	at com.linkedin.common.urn.UrnCoercer.coerceOutput(UrnCoercer.java:11)
    datahub-frontend        | 	at com.linkedin.data.template.DataTemplateUtil.coerceOutput(DataTemplateUtil.java:954)
    datahub-frontend        | 	at com.linkedin.data.template.RecordTemplate.obtainCustomType(RecordTemplate.java:365)
    datahub-frontend        | 	at com.linkedin.common.AuditStamp.getActor(AuditStamp.java:159)
    datahub-frontend        | 	at com.linkedin.datahub.util.DatasetUtil.toLineageView(DatasetUtil.java:97)
    datahub-frontend        | 	at com.linkedin.datahub.dao.table.LineageDao.lambda$getUpstreamLineage$1(LineageDao.java:39)
    datahub-frontend        | 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    datahub-frontend        | 	at java.util.Iterator.forEachRemaining(Iterator.java:116)
    datahub-frontend        | 	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    datahub-frontend        | 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    datahub-frontend        | 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    datahub-frontend        | 	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    datahub-frontend        | 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    datahub-frontend        | 	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    datahub-frontend        | 	at com.linkedin.datahub.dao.table.LineageDao.getUpstreamLineage(LineageDao.java:40)
    datahub-frontend        | 	at controllers.api.v2.Dataset.getDatasetUpstreams(Dataset.java:250)
    datahub-frontend        | 	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$28$$anonfun$apply$28.apply(Routes.scala:910)
    datahub-frontend        | 	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$28$$anonfun$apply$28.apply(Routes.scala:910)
    datahub-frontend        | 	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:134)
    datahub-frontend        | 	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:133)
    datahub-frontend        | 	at play.core.routing.HandlerInvokerFactory$JavaActionInvokerFactory$$anon$8$$anon$2$$anon$1.invocation(HandlerInvoker.scala:108)
    I’m pretty sure that I’ve misconfigured my upstream lineage object, however it passes validation on the way in.  Any suggestions on how to troubleshoot this further? Thanks in advance and I appreciate any insight
    🎉 1
    m
    b
    l
    • 4
    • 29
  • i

    incalculable-ocean-74010

    03/01/2021, 10:25 AM
    Also I had to install thrift (
    pip install thrift
    ) in my python environment to get this far
    g
    • 2
    • 3
  • i

    incalculable-ocean-74010

    03/01/2021, 5:34 PM
    Namely, LDAP or Kerberos?
    b
    g
    • 3
    • 15
  • w

    white-chef-85966

    03/02/2021, 8:38 AM
    Hi there, please anyone can tell me how to manually update the relationships(upstream/downstream) of datasets? I know there are APIs/kafka messages can help but I hope there can be pages to do so.
    m
    • 2
    • 3
  • i

    incalculable-ocean-74010

    03/02/2021, 12:11 PM
    Hello, when running the hive crawler is it normal to have the following warnings: •
    unable to map type DATE to metadata schema
    •
    unable to map type TIMESTAMP to metadata schema
    •
    unable to map type DECIMAL to metadata schema
    m
    • 2
    • 11
  • c

    calm-sunset-28996

    03/02/2021, 2:28 PM
    I have a few questions wrt. this recently created file: https://github.com/linkedin/datahub/blob/master/docker/datahub-ingestion/Dockerfile Is this meant to be a standalone deployment or part of the compose ecosystem? Because as I’m doing something similar at work, I’m trying to use this component. However I’m hitting some issues when building the image. (Examples is that gradle is not pre-installed in the openjdk8, so I switched this out for the gradle one.) Not sure if this is the purpose or if I’m just doing something wrong here. 🙂
    m
    g
    • 3
    • 5
  • i

    incalculable-ocean-74010

    03/02/2021, 5:43 PM
    When using the ingestion framework is it expected when specifying a database for the crawler to work through all databases but prefix the DatasetURN of each entity with the database defined in the crawling config.
    m
    g
    m
    • 4
    • 19
  • c

    calm-sunset-28996

    03/04/2021, 3:23 PM
    Can we delete /modify this pydantic check? https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/configuration/kafka.py#L21 Our bootstrap name has dots in the name so it does not pass, so we patched it. For the rest everything works fine, so nice work!
    b
    g
    • 3
    • 10
  • b

    brief-toothbrush-55766

    03/05/2021, 12:37 PM
    Copy code
    pip install -e .
    Obtaining file:///home/gama/SDAP/datahub/metadata-ingestion
        ERROR: Command errored out with exit status 1:
         command: /home/gama/SDAP/datahub/metadata-ingestion/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/gama/SDAP/datahub/metadata-ingestion/setup.py'"'"'; __file__='"'"'/home/gama/SDAP/datahub/metadata-ingestion/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-fs5vr9kr
             cwd: /home/gama/SDAP/datahub/metadata-ingestion/
        Complete output (5 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/home/gama/SDAP/datahub/metadata-ingestion/setup.py", line 57, in <module>
            packages=setuptools.find_namespace_packages(where="./src"),
        AttributeError: module 'setuptools' has no attribute 'find_namespace_packages'
        ----------------------------------------
    WARNING: Discarding file:///home/gama/SDAP/datahub/metadata-ingestion. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    c
    l
    g
    • 4
    • 7
  • i

    incalculable-ocean-74010

    03/05/2021, 6:02 PM
    Hello, is anyone in the community working on creating a helm chart for the metadata-ingestion module?
    g
    e
    • 3
    • 7
  • i

    incalculable-ocean-74010

    03/05/2021, 6:02 PM
    As a follow-up, how mature are the helm charts for datahub?
    g
    e
    • 3
    • 2
  • b

    brief-toothbrush-55766

    03/06/2021, 12:31 PM
    Is datahub ingestion able to handle sources with spatial columns i.e geometry? Seems like its not. Got the following error while ingesting metadata from a Postgres(postgis) source with a 'geom' column with 'geometry' type.:
    m
    • 2
    • 2
  • b

    brief-toothbrush-55766

    03/08/2021, 8:22 PM
    Running into this error:
    ImportError: cannot import name 'TagSnapshotClass'
    while trying to ingest a dataset with source:postgres-> sink:datahub-rest. Again, this worked before, then I did a git pull, started the venv(also installed GeoAlchemy2) and tried to ingest as before. Anything am missingß
    g
    • 2
    • 2
  • b

    breezy-glass-7892

    03/09/2021, 9:20 AM
    Hi team, I’ve just deployed the app and ran the
    datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml
    ; I don’t see the dataset in
    <http://localhost:9001>
    . I also loaded the data from bigQuery
    Copy code
    source:
      type: bigquery
      config:
        project_id: data-sandbox-123
        # options:
          # credentials_path: "/service_account_key.json"
    sink:
      type: "datahub-rest"
      config:
        server: '<http://localhost:8080>'
    Something I might be missing here?
    e
    w
    • 3
    • 26
  • c

    calm-sunset-28996

    03/09/2021, 7:32 PM
    Got a question, how are you all handling secrets? Because we can’t really commit these recipes to git with a password in full text 😄 So I patched the yaml config for now to fetch from ssm (we use AWS) whenever a path is prefixed with “ssm;//” . Not sure if anybody has a better way or idea? Seemed a bit cleaner than rewriting them on the fly. (As the ingest entrypoint expects a file and not a loaded config object.)
    m
    i
    • 3
    • 5
  • g

    gentle-exabyte-43102

    03/11/2021, 7:52 PM
    Good Morning! Anyone seen thrift errors like this before?
    thrift.transport.TTransport.TTransportException: Bad status: 78 (b'5.7.22-log')
    g
    • 2
    • 28
  • i

    incalculable-ocean-74010

    03/12/2021, 3:10 PM
    Hello, does the ingestion framework use https://www.python.org/dev/peps/pep-0249/ to crawl metadata using sqlalchemy?
    g
    • 2
    • 10
  • c

    curved-crayon-1929

    03/16/2021, 5:16 AM
    Hi All, Could someone please confirm if ingestion from Mongo DB is supported? if yes please help me with the respective YAML file thanks. could someone help me as this is important for us to proceed further. Thanks
    m
    i
    b
    • 4
    • 4
  • c

    calm-lawyer-777

    03/17/2021, 11:19 AM
    Hi Guys, quick question: we successfully importing hive (kerberized) metadata. now we want to update the dataset inside the datahub with lineage information. how to do that? currently we extract the upstream and downstream information from hive sql history.
    g
    l
    • 3
    • 6
  • i

    incalculable-ocean-74010

    03/26/2021, 4:51 PM
    Hello, is anyone working on a way to persist manual field descriptions if the underlying databases do not have them in the tables definitions?
    m
    g
    • 3
    • 32
  • w

    wonderful-quill-11255

    03/28/2021, 1:45 PM
    Hi. Is the ingestion library published to PyPi? If not, is there a plan for doing that?
    m
    • 2
    • 1
  • c

    calm-lawyer-777

    03/30/2021, 10:30 AM
    Hi team, I want to ask is datahub maintain schema versioning?
    g
    m
    c
    • 4
    • 14
  • a

    able-jelly-81126

    03/30/2021, 2:36 PM
    hey! 👋 we’ve been adding support for AWS Glue over the last day and are getting ready to open a PR in the near future, is there any guides on what documentation we need to add and how/where to add it?
    m
    c
    • 3
    • 7
  • h

    high-hospital-85984

    03/31/2021, 10:00 AM
    Just checking to make sure I’ve understood this correctly: we cant create tags via MCE’s becuase the builder is not listed here: https://github.com/linkedin/datahub/blob/master/metadata-dao-impl/restli-dao/src/main/java/com/linkedin/metadata/dao/RequestBuilders.java
    g
    c
    a
    • 4
    • 9
  • b

    brave-appointment-76997

    03/31/2021, 11:21 AM
    Hi there, My usecase is to capture the data lineage from the Spark jobs which run using KubernetedPodOpearator in Airflow. is this integration with Airflow being supported in Datahub. I am a newbie to Datahub. Any help is appreciated! Thnks
    l
    g
    a
    • 4
    • 19
12345...144Latest