DataHub #ingestion

steep-airplane-62865

03/10/2020, 6:24 PM

@crooked-vegetable-89645 Do you observe any error in

datahub-mce-consumer

container logs? If not, could you check the DB if the record for

schema

aspect is already there?

bumpy-keyboard-50565

03/12/2020, 2:36 PM

@most-tent-28381 1. Can we ingest our data as well or does datahub only support metadata ingestion ? Are you asking "can I call the rest.li API to ingest data directly instead of using Kafka event"? If so the answer is YES 2. How is the metadata updated in datahub ? Is it real-time ? It's real-time if you're updating via rest.li API. It's near real-time in the case of Kafka events due async nature of Kafka processing 3. How is the lineage generated - do we have to manually specify parent and children relationships ? Yes. We have internal integration with systems like Gobblin which automatically emits events for lineages. However, if such integration isn't available you'll need to do your own processing to derive the lineage 4. Is there any future plan to include statistics about the data present in the databases ? Yes. We're actively working on that internally. Will open source that once it's stabilized.

bumpy-keyboard-50565

03/16/2020, 1:27 PM

@eager-wall-19224 it's a bit difficult to be a human avro parse 🙂 Do you mind to try building up some simpler message first (e.g. with only one aspect)? You can utilize this script to quickly debug your message: https://github.com/linkedin/datahub/tree/master/metadata-ingestion/mce-cli

eager-wall-19224

03/17/2020, 2:28 AM

@bumpy-keyboard-50565 thanks, I changed the description. I would try to use mce_cli.py, Actually, i used some other Avro schema validator, it told me something useful, too.

ancient-animal-17306

03/24/2020, 9:03 AM

does ETL scripts are enough for data ingestion ? is that how LinkedIn internally uses it ? or is there any heavy weight framework like Gobblin that can probably also proactively scans the dataset and populates/push to datahub ?

most-tent-28381

03/25/2020, 6:14 AM

do we have to specify the schema before we can actually ingest the data to datahub ? i am using the rdbms_etl.py script and am getting serialzation error when trying to publish data using a postgres database

agreeable-boots-73250

03/26/2020, 2:39 PM

able to resolve this

agreeable-boots-73250

03/27/2020, 10:09 AM

@bumpy-keyboard-50565 I have started the data hub services and they Look fine

handsome-grass-34789

04/02/2020, 6:28 AM

Hi i am getting the following error when i try to push my own data following the instructions from the metadata-ingestion README. Is there any other step that im missing out on apart from the ones given ?

error.txt

fast-exabyte-18411

04/16/2020, 9:16 PM

Hi all! Working on implementing redshift ingestion and running into some problems getting the schema to appear in the UI. It seems like this is connected to the platform_schema part of the mce. https://github.com/linkedin/datahub/blob/master/metadata-ingestion/mysql-etl/mysql_etl.py#L30 Is there anywhere I can go for logs associated with this? I don't see anything popping up in the docker-compose logs and the entities are being registered, so from my perspective this is a silent failure atm.

agreeable-boots-73250

04/24/2020, 4:06 PM

Hi, If anyone has tried connecting to Azure SQL server please let me know, I am getting the following error

agreeable-boots-73250

05/04/2020, 1:39 PM

<!channel> I am trying to develop the MSSQL pipeline for DATAHUB, I changed some things : Code attached for the reference but I am getting the following error

agreeable-boots-73250

05/04/2020, 1:41 PM

MSSQL (Microsoft server ) pipeline for ingestion , referring to the mysql-ingestion.python

Untitled

agreeable-boots-73250

05/06/2020, 3:32 PM

<!here> I have tweeked the mysql-pipeline.py to injest the metadata for MSSQL Server, but it does not insert the values as per screenshot

mssql_pipeline.py

✔️ 1

miniature-ability-75189

07/01/2020, 8:46 AM

Got the same question as @strong-analyst-47204, columns don't appear after mysql metadata was ingested, although they do appear in the produced message

billowy-eye-48149

07/06/2020, 2:28 PM

<!here> I am trying to ingest data into datahub with a special character in the dataset urn like urnlidataPlatform:abc/xyz . The schema platform is also given with the same name. The dataset is created after ingestion but when I go inside the dataset, I am getting the below error. Could someone help me to understand the reason for the error?

billowy-eye-48149

07/27/2020, 9:15 PM

Hi Team, I am trying to ingest tags field for a dataset. The tags are not reflecting in the frontend after the ingestion. Am I missing something here ? ("com.linkedin.pegasus2avro.dataset.DatasetProperties", {"description":"Test table for Kafka","tags":["Label1","Label2"]})

silly-cat-58453

08/01/2020, 1:56 PM

Hi All, I am not using docker (docker commands) to ingest data, so i have produced the data to "MetadataChangeEvent" topic, which is persisted to MYSQL through GMS, so Can anyone suggest me how to use the "MetadataAuditEvent" ?

aloof-window-16847

09/29/2020, 11:33 PM

Hello! I have a question about updating existing Dataset, for example its list of owners. Is there a way to add a new Owner to the property "owners" of Ownership aspect without listing the entire list of owners during ingestion?

microscopic-receptionist-23548

10/01/2020, 5:57 PM

https://github.com/linkedin/datahub/tree/master/metadata-ingestion#prerequisites

average-city-12965

11/30/2020, 2:26 PM

I've been trying out Datahub, and I can't quite understand some parts of the metadata change events. For example, the field version in SchemaMetadata is mandatory in the MCE Avro schema, but at the same time the GMS is also responsible for automatically incrementing the dataset version for new changes. Are these versions related? Or is the version field used for some domain-specific style of versioning that is not related to Datahub?

clever-journalist-89046

12/15/2020, 11:07 AM

However

bq_demo

is a dataset under the project, but not a project that has been specified in

bigquery_etl.py

gentle-plumber-6625

12/16/2020, 12:28 PM

for mssql cat mssql_etl.py from common import run # See https://github.com/m32/sqlalchemy-tds for more details URL = 'mssql+pytds://test_user:test123@127.0.0.1:1433/testdb' OPTIONS = {} PLATFORM = 'mssql' run(URL, OPTIONS, PLATFORM)

narrow-ocean-33634

01/11/2021, 8:02 AM

Hey everyone, I am just setup datahub using docker I have this error

Copy code

datahub-mae-consumer    | 2021/01/11 08:01:11 Problem with dial: dial tcp: address <http://b-1.amazonaws.com:9092,b-2.amazonaws.com:9092,b-3.amazonaws.com:9092|b-1.amazonaws.com:9092,b-2.amazonaws.com:9092,b-3.amazonaws.com:9092>: too many colons in address. Sleeping 1s

my setting is

Copy code

KAFKA_BOOTSTRAP_SERVER=<http://b-1.amazonaws.com:9092,b-2.amazonaws.com:9092,b-3.amazonaws.com:9092|b-1.amazonaws.com:9092,b-2.amazonaws.com:9092,b-3.amazonaws.com:9092>

Could you help on this?

bright-zebra-85377

01/25/2021, 10:21 AM

Hi, can support aws s3 for dataset?

bright-zebra-85377

01/27/2021, 2:57 PM

Hi is there a way that datahub ingest ldap data from openldap ?

curved-magazine-23582

01/29/2021, 3:52 PM

hello, i am completely new to DataHub. I am trying to run that mssql ingestion etl, with

URL = '<mssql+pytds://user>:password@sqlhost/database:1433'

, but getting error below. Is my mssql URL in bad format? Any help is appreciated.

Copy code

sqlalchemy.exc.ArgumentError: Could not parse rfc1738 URL from string ''

curved-magazine-23582

02/04/2021, 4:23 AM

Hello newbie question again. I see there is GMA API for adding/updating datasets, users, etc. I assume this is equivalent of sending corresponding MCE data to that kafka topic, i.e. Elasticsearch and neo4j will be populated eventually with either of the ingestion methods?

lemon-analyst-37781

02/09/2021, 1:57 AM

I ran

pip3 install --user -r common.txt -r snowflake.txt --no-cache-dir

before this

incalculable-ocean-74010

02/15/2021, 4:04 PM

Hello, I've followed the entity onboarding process. When trying to search something in the UI I get the following stack trace which seems unrelated:

Copy code

datahub-gms                    | 16:01:54.314 [qtp626202354-22] INFO  c.l.parseq.TaskDescriptorFactory - No provider found for TaskDescriptor, falling back to DefaultTaskDescriptor
datahub-frontend               | 16:02:00 [application-akka.actor.default-dispatcher-33] ERROR application - Fail to get data platforms
datahub-frontend               | java.lang.NullPointerException: null
datahub-frontend               | 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
datahub-frontend               | 	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
datahub-frontend               | 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
datahub-frontend               | 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
datahub-frontend               | 	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
datahub-frontend               | 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
datahub-frontend               | 	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
datahub-frontend               | 	at com.linkedin.datahub.dao.table.DataPlatformsDao.getAllPlatforms(DataPlatformsDao.java:23)
datahub-frontend               | 	at controllers.api.v2.Dataset.getDataPlatforms(Dataset.java:280)
datahub-frontend               | 	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$29$$anonfun$apply$29.apply(Routes.scala:916)
datahub-frontend               | 	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$29$$anonfun$apply$29.apply(Routes.scala:916)
datahub-frontend               | 	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:134)
datahub-frontend               | 	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:133)

Does this ring a bell to anyone?