DataHub #ingestion

incalculable-ocean-74010

02/16/2021, 12:53 PM

It is being referenced in a GMS client implementation class:

Copy code

@RestLiCollection(name = "streams", namespace = "com.linkedin.stream", keyName = "stream")
public final class Streams extends BaseBrowsableEntityResource<
        // @formatter:off
        ComplexResourceKey<StreamKey, EmptyRecord>,
        Stream,
        StreamUrn,
        StreamSnapshot,
        StreamAspect,
        StreamDocument> {

    @Inject
    @Named("streamBrowseDao")
    private BaseBrowseDAO _browseDAO;

    @Inject
    @Named("streamSearchDao")
    private BaseSearchDAO _searchDAO;

    @Inject
    @Named("streamDao")
    private BaseLocalDAO _localDAO;
...

orange-night-91387

02/16/2021, 9:58 PM

Hi! I'm running into an unsafe type-cast error in trying to convert a in-memory MetadataChangeEvent to a GenericRecord to be sent to Kafka using EventUtils (pegasusToAvroMCE). The MetadataChangeEvent has DataMap format for some of the fields (ex: URN fields). In DataTranslator.java from the REST.li project this is causing the issue:

Copy code

line 531-533:        
case STRING:
          result = new Utf8((String) value);
          break;

Value in this case is a DataMap representing a DatasetUrn. The Avro schema defines that as as a String type expecting something like "urnlidataset:..." but since it is in the form "{ platform: {...}, origin: ... , name: ...}" this case results in a ClassCastException. Is there a different way I can generate the GenericRecord with the format I have? NOTE: This is NOT master, this is a separate development branch that I'm working on in a forked repo. Not a bug report, looking for advice 🙂

curved-magazine-23582

02/17/2021, 3:26 AM

is there a way to add a new custom dataplatform?

mammoth-bear-12532

02/18/2021, 8:04 AM

Ingestion enthusiasts: Wanted to let you know that we've landed some big improvements in the Python ingestion suite for DataHub (including support for Airflow based ingestion scheduling). Check it out here (https://github.com/linkedin/datahub/tree/master/metadata-ingestion) and let us know how it can be improved further! We'll do a tour of this in the town-hall on Friday, so do attend if you are curious about it!

🎉 5

👍 8

powerful-egg-69769

02/22/2021, 4:11 PM

is it possible to ingest metadata with a regular HTTP request on the DataHub REST API?

acoustic-printer-83045

02/25/2021, 9:45 PM

Hi everyone! I’m using DataHub as a data catalog hackathon project at InvisionApp. I was able to adapt the data ingest scripts to pull from redshift and I have the contents of my warehouse listed as datasets in DataHub, I’m now trying to use the manifest file from DBT (getdbt.com / data build tool) to assign lineage to at least a subset of my data. DataHub is working great, however I’m struggling a bit with the MCE definitions to add upstream lineage. Right now I’ve modified the metadata-ingestion componentry to append lineage based on my DBT data. Right now it’s just hardcoded while I figure out how to make it all work. The object output for the upstream lineage object I’ve appended to ‘aspects’ that I’m seeing on send to the datahub rest point is:

Copy code

{'upstreams': [
                    {'auditStamp': {'time': 0, 'actor': '', 'impersonator': None
                        }, 'dataset': 'urn:li:dataset:(urn:li:dataPlatform:redshift,events.analytics_dev_garylucas.carr_quarterly,PROD)', 'type': 'TRANSFORMED'
                    }
                ]
            }

I don’t see an error from that but when I go to load lineage I get the following error in the back end (+ a UI error on the front end)

Copy code

datahub-frontend        | 21:36:25 [application-akka.actor.default-dispatcher-313] ERROR application - Fetch Dataset upstreams error
datahub-frontend        | com.linkedin.data.template.TemplateOutputCastException: Invalid URN syntax: Urn doesn't start with 'urn:'. Urn:  at index 0:
datahub-frontend        | 	at com.linkedin.common.urn.UrnCoercer.coerceOutput(UrnCoercer.java:25)
datahub-frontend        | 	at com.linkedin.common.urn.UrnCoercer.coerceOutput(UrnCoercer.java:11)
datahub-frontend        | 	at com.linkedin.data.template.DataTemplateUtil.coerceOutput(DataTemplateUtil.java:954)
datahub-frontend        | 	at com.linkedin.data.template.RecordTemplate.obtainCustomType(RecordTemplate.java:365)
datahub-frontend        | 	at com.linkedin.common.AuditStamp.getActor(AuditStamp.java:159)
datahub-frontend        | 	at com.linkedin.datahub.util.DatasetUtil.toLineageView(DatasetUtil.java:97)
datahub-frontend        | 	at com.linkedin.datahub.dao.table.LineageDao.lambda$getUpstreamLineage$1(LineageDao.java:39)
datahub-frontend        | 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
datahub-frontend        | 	at java.util.Iterator.forEachRemaining(Iterator.java:116)
datahub-frontend        | 	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
datahub-frontend        | 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
datahub-frontend        | 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
datahub-frontend        | 	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
datahub-frontend        | 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
datahub-frontend        | 	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
datahub-frontend        | 	at com.linkedin.datahub.dao.table.LineageDao.getUpstreamLineage(LineageDao.java:40)
datahub-frontend        | 	at controllers.api.v2.Dataset.getDatasetUpstreams(Dataset.java:250)
datahub-frontend        | 	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$28$$anonfun$apply$28.apply(Routes.scala:910)
datahub-frontend        | 	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$28$$anonfun$apply$28.apply(Routes.scala:910)
datahub-frontend        | 	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:134)
datahub-frontend        | 	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:133)
datahub-frontend        | 	at play.core.routing.HandlerInvokerFactory$JavaActionInvokerFactory$$anon$8$$anon$2$$anon$1.invocation(HandlerInvoker.scala:108)

I’m pretty sure that I’ve misconfigured my upstream lineage object, however it passes validation on the way in. Any suggestions on how to troubleshoot this further? Thanks in advance and I appreciate any insight

🎉 1

incalculable-ocean-74010

03/01/2021, 10:25 AM

Also I had to install thrift (

pip install thrift

) in my python environment to get this far

incalculable-ocean-74010

03/01/2021, 5:34 PM

Namely, LDAP or Kerberos?

white-chef-85966

03/02/2021, 8:38 AM

Hi there, please anyone can tell me how to manually update the relationships(upstream/downstream) of datasets? I know there are APIs/kafka messages can help but I hope there can be pages to do so.

incalculable-ocean-74010

03/02/2021, 12:11 PM

Hello, when running the hive crawler is it normal to have the following warnings: •

unable to map type DATE to metadata schema

•

unable to map type TIMESTAMP to metadata schema

•

unable to map type DECIMAL to metadata schema

calm-sunset-28996

03/02/2021, 2:28 PM

I have a few questions wrt. this recently created file: https://github.com/linkedin/datahub/blob/master/docker/datahub-ingestion/Dockerfile Is this meant to be a standalone deployment or part of the compose ecosystem? Because as I’m doing something similar at work, I’m trying to use this component. However I’m hitting some issues when building the image. (Examples is that gradle is not pre-installed in the openjdk8, so I switched this out for the gradle one.) Not sure if this is the purpose or if I’m just doing something wrong here. 🙂

incalculable-ocean-74010

03/02/2021, 5:43 PM

When using the ingestion framework is it expected when specifying a database for the crawler to work through all databases but prefix the DatasetURN of each entity with the database defined in the crawling config.

calm-sunset-28996

03/04/2021, 3:23 PM

Can we delete /modify this pydantic check? https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/configuration/kafka.py#L21 Our bootstrap name has dots in the name so it does not pass, so we patched it. For the rest everything works fine, so nice work!

brief-toothbrush-55766

03/05/2021, 12:37 PM

Copy code

pip install -e .
Obtaining file:///home/gama/SDAP/datahub/metadata-ingestion
    ERROR: Command errored out with exit status 1:
     command: /home/gama/SDAP/datahub/metadata-ingestion/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/gama/SDAP/datahub/metadata-ingestion/setup.py'"'"'; __file__='"'"'/home/gama/SDAP/datahub/metadata-ingestion/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-fs5vr9kr
         cwd: /home/gama/SDAP/datahub/metadata-ingestion/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/gama/SDAP/datahub/metadata-ingestion/setup.py", line 57, in <module>
        packages=setuptools.find_namespace_packages(where="./src"),
    AttributeError: module 'setuptools' has no attribute 'find_namespace_packages'
    ----------------------------------------
WARNING: Discarding file:///home/gama/SDAP/datahub/metadata-ingestion. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

incalculable-ocean-74010

03/05/2021, 6:02 PM

Hello, is anyone in the community working on creating a helm chart for the metadata-ingestion module?

incalculable-ocean-74010

03/05/2021, 6:02 PM

As a follow-up, how mature are the helm charts for datahub?

brief-toothbrush-55766

03/06/2021, 12:31 PM

Is datahub ingestion able to handle sources with spatial columns i.e geometry? Seems like its not. Got the following error while ingesting metadata from a Postgres(postgis) source with a 'geom' column with 'geometry' type.:

brief-toothbrush-55766

03/08/2021, 8:22 PM

Running into this error:

ImportError: cannot import name 'TagSnapshotClass'

while trying to ingest a dataset with source:postgres-> sink:datahub-rest. Again, this worked before, then I did a git pull, started the venv(also installed GeoAlchemy2) and tried to ingest as before. Anything am missingß

breezy-glass-7892

03/09/2021, 9:20 AM

Hi team, I’ve just deployed the app and ran the

datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml

; I don’t see the dataset in

<http://localhost:9001>

. I also loaded the data from bigQuery

Copy code

source:
  type: bigquery
  config:
    project_id: data-sandbox-123
    # options:
      # credentials_path: "/service_account_key.json"
sink:
  type: "datahub-rest"
  config:
    server: '<http://localhost:8080>'

Something I might be missing here?

calm-sunset-28996

03/09/2021, 7:32 PM

Got a question, how are you all handling secrets? Because we can’t really commit these recipes to git with a password in full text 😄 So I patched the yaml config for now to fetch from ssm (we use AWS) whenever a path is prefixed with “ssm;//” . Not sure if anybody has a better way or idea? Seemed a bit cleaner than rewriting them on the fly. (As the ingest entrypoint expects a file and not a loaded config object.)

gentle-exabyte-43102

03/11/2021, 7:52 PM

Good Morning! Anyone seen thrift errors like this before?

thrift.transport.TTransport.TTransportException: Bad status: 78 (b'5.7.22-log')

incalculable-ocean-74010

03/12/2021, 3:10 PM

Hello, does the ingestion framework use https://www.python.org/dev/peps/pep-0249/ to crawl metadata using sqlalchemy?

curved-crayon-1929

03/16/2021, 5:16 AM

Hi All, Could someone please confirm if ingestion from Mongo DB is supported? if yes please help me with the respective YAML file thanks. could someone help me as this is important for us to proceed further. Thanks

calm-lawyer-777

03/17/2021, 11:19 AM

Hi Guys, quick question: we successfully importing hive (kerberized) metadata. now we want to update the dataset inside the datahub with lineage information. how to do that? currently we extract the upstream and downstream information from hive sql history.

incalculable-ocean-74010

03/26/2021, 4:51 PM

Hello, is anyone working on a way to persist manual field descriptions if the underlying databases do not have them in the tables definitions?

wonderful-quill-11255

03/28/2021, 1:45 PM

Hi. Is the ingestion library published to PyPi? If not, is there a plan for doing that?

calm-lawyer-777

03/30/2021, 10:30 AM

Hi team, I want to ask is datahub maintain schema versioning?

able-jelly-81126

03/30/2021, 2:36 PM

hey! 👋 we’ve been adding support for AWS Glue over the last day and are getting ready to open a PR in the near future, is there any guides on what documentation we need to add and how/where to add it?

high-hospital-85984

03/31/2021, 10:00 AM

Just checking to make sure I’ve understood this correctly: we cant create tags via MCE’s becuase the builder is not listed here: https://github.com/linkedin/datahub/blob/master/metadata-dao-impl/restli-dao/src/main/java/com/linkedin/metadata/dao/RequestBuilders.java

brave-appointment-76997

03/31/2021, 11:21 AM

Hi there, My usecase is to capture the data lineage from the Spark jobs which run using KubernetedPodOpearator in Airflow. is this integration with Airflow being supported in Datahub. I am a newbie to Datahub. Any help is appreciated! Thnks