https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • s

    steep-airplane-62865

    03/10/2020, 6:24 PM
    @crooked-vegetable-89645 Do you observe any error in
    datahub-mce-consumer
    container logs? If not, could you check the DB if the record for
    schema
    aspect is already there?
    c
    • 2
    • 3
  • b

    bumpy-keyboard-50565

    03/12/2020, 2:36 PM
    @most-tent-28381 1. Can we ingest our data as well or does datahub only support metadata ingestion ? Are you asking "can I call the rest.li API to ingest data directly instead of using Kafka event"? If so the answer is YES 2. How is the metadata updated in datahub ? Is it real-time ? It's real-time if you're updating via rest.li API. It's near real-time in the case of Kafka events due async nature of Kafka processing 3. How is the lineage generated - do we have to manually specify parent and children relationships ? Yes. We have internal integration with systems like Gobblin which automatically emits events for lineages. However, if such integration isn't available you'll need to do your own processing to derive the lineage 4. Is there any future plan to include statistics about the data present in the databases ? Yes. We're actively working on that internally. Will open source that once it's stabilized.
    m
    • 2
    • 2
  • b

    bumpy-keyboard-50565

    03/16/2020, 1:27 PM
    @eager-wall-19224 it's a bit difficult to be a human avro parse 🙂 Do you mind to try building up some simpler message first (e.g. with only one aspect)? You can utilize this script to quickly debug your message: https://github.com/linkedin/datahub/tree/master/metadata-ingestion/mce-cli
    w
    • 2
    • 1
  • e

    eager-wall-19224

    03/17/2020, 2:28 AM
    @bumpy-keyboard-50565 thanks, I changed the description. I would try to use mce_cli.py, Actually, i used some other Avro schema validator, it told me something useful, too.
    b
    • 2
    • 1
  • a

    ancient-animal-17306

    03/24/2020, 9:03 AM
    does ETL scripts are enough for data ingestion ? is that how LinkedIn internally uses it ? or is there any heavy weight framework like Gobblin that can probably also proactively scans the dataset and populates/push to datahub ?
    b
    • 2
    • 3
  • m

    most-tent-28381

    03/25/2020, 6:14 AM
    do we have to specify the schema before we can actually ingest the data to datahub ? i am using the rdbms_etl.py script and am getting serialzation error when trying to publish data using a postgres database
    b
    • 2
    • 1
  • a

    agreeable-boots-73250

    03/26/2020, 2:39 PM
    able to resolve this
    b
    • 2
    • 2
  • a

    agreeable-boots-73250

    03/27/2020, 10:09 AM
    @bumpy-keyboard-50565 I have started the data hub services and they Look fine
    b
    s
    • 3
    • 13
  • h

    handsome-grass-34789

    04/02/2020, 6:28 AM
    Hi i am getting the following error when i try to push my own data following the instructions from the metadata-ingestion README. Is there any other step that im missing out on apart from the ones given ?
    error.txt
    b
    s
    +2
    • 5
    • 10
  • f

    fast-exabyte-18411

    04/16/2020, 9:16 PM
    Hi all! Working on implementing redshift ingestion and running into some problems getting the schema to appear in the UI. It seems like this is connected to the platform_schema part of the mce. https://github.com/linkedin/datahub/blob/master/metadata-ingestion/mysql-etl/mysql_etl.py#L30 Is there anywhere I can go for logs associated with this? I don't see anything popping up in the docker-compose logs and the entities are being registered, so from my perspective this is a silent failure atm.
    b
    o
    w
    • 4
    • 30
  • a

    agreeable-boots-73250

    04/24/2020, 4:06 PM
    Hi, If anyone has tried connecting to Azure SQL server please let me know, I am getting the following error
    b
    a
    • 3
    • 7
  • a

    agreeable-boots-73250

    05/04/2020, 1:39 PM
    <!channel> I am trying to develop the MSSQL pipeline for DATAHUB, I changed some things : Code attached for the reference but I am getting the following error
    b
    • 2
    • 2
  • a

    agreeable-boots-73250

    05/04/2020, 1:41 PM
    MSSQL (Microsoft server ) pipeline for ingestion , referring to the mysql-ingestion.python
    Untitled
    a
    • 2
    • 2
  • a

    agreeable-boots-73250

    05/06/2020, 3:32 PM
    <!here> I have tweeked the mysql-pipeline.py to injest the metadata for MSSQL Server, but it does not insert the values as per screenshot
    mssql_pipeline.py
    ✔️ 1
    a
    • 2
    • 1
  • m

    miniature-ability-75189

    07/01/2020, 8:46 AM
    Got the same question as @strong-analyst-47204, columns don't appear after mysql metadata was ingested, although they do appear in the produced message
    s
    s
    a
    • 4
    • 7
  • b

    billowy-eye-48149

    07/06/2020, 2:28 PM
    <!here> I am trying to ingest data into datahub with a special character in the dataset urn like urnlidataPlatform:abc/xyz . The schema platform is also given with the same name. The dataset is created after ingestion but when I go inside the dataset, I am getting the below error. Could someone help me to understand the reason for the error?
    s
    • 2
    • 7
  • b

    billowy-eye-48149

    07/27/2020, 9:15 PM
    Hi Team, I am trying to ingest tags field for a dataset. The tags are not reflecting in the frontend after the ingestion. Am I missing something here ? ("com.linkedin.pegasus2avro.dataset.DatasetProperties", {"description":"Test table for Kafka","tags":["Label1","Label2"]})
    m
    r
    • 3
    • 12
  • s

    silly-cat-58453

    08/01/2020, 1:56 PM
    Hi All, I am not using docker (docker commands) to ingest data, so i have produced the data to "MetadataChangeEvent" topic, which is persisted to MYSQL through GMS, so Can anyone suggest me how to use the "MetadataAuditEvent" ?
    b
    m
    • 3
    • 23
  • a

    aloof-window-16847

    09/29/2020, 11:33 PM
    Hello! I have a question about updating existing Dataset, for example its list of owners. Is there a way to add a new Owner to the property "owners" of Ownership aspect without listing the entire list of owners during ingestion?
    b
    a
    s
    • 4
    • 22
  • m

    microscopic-receptionist-23548

    10/01/2020, 5:57 PM
    https://github.com/linkedin/datahub/tree/master/metadata-ingestion#prerequisites
    s
    • 2
    • 2
  • a

    average-city-12965

    11/30/2020, 2:26 PM
    I've been trying out Datahub, and I can't quite understand some parts of the metadata change events. For example, the field version in SchemaMetadata is mandatory in the MCE Avro schema, but at the same time the GMS is also responsible for automatically incrementing the dataset version for new changes. Are these versions related? Or is the version field used for some domain-specific style of versioning that is not related to Datahub?
    m
    a
    • 3
    • 6
  • c

    clever-journalist-89046

    12/15/2020, 11:07 AM
    However
    bq_demo
    is a dataset under the project, but not a project that has been specified in
    bigquery_etl.py
    b
    • 2
    • 4
  • g

    gentle-plumber-6625

    12/16/2020, 12:28 PM
    for mssql cat mssql_etl.py from common import run # See https://github.com/m32/sqlalchemy-tds for more details URL = 'mssql+pytds://test_user:test123@127.0.0.1:1433/testdb' OPTIONS = {} PLATFORM = 'mssql' run(URL, OPTIONS, PLATFORM)
    b
    • 2
    • 3
  • n

    narrow-ocean-33634

    01/11/2021, 8:02 AM
    Hey everyone, I am just setup datahub using docker I have this error
    Copy code
    datahub-mae-consumer    | 2021/01/11 08:01:11 Problem with dial: dial tcp: address <http://b-1.amazonaws.com:9092,b-2.amazonaws.com:9092,b-3.amazonaws.com:9092|b-1.amazonaws.com:9092,b-2.amazonaws.com:9092,b-3.amazonaws.com:9092>: too many colons in address. Sleeping 1s
    my setting is
    Copy code
    KAFKA_BOOTSTRAP_SERVER=<http://b-1.amazonaws.com:9092,b-2.amazonaws.com:9092,b-3.amazonaws.com:9092|b-1.amazonaws.com:9092,b-2.amazonaws.com:9092,b-3.amazonaws.com:9092>
    Could you help on this?
    b
    m
    • 3
    • 3
  • b

    bright-zebra-85377

    01/25/2021, 10:21 AM
    Hi, can support aws s3 for dataset?
    m
    • 2
    • 3
  • b

    bright-zebra-85377

    01/27/2021, 2:57 PM
    Hi is there a way that datahub ingest ldap data from openldap ?
    w
    a
    b
    • 4
    • 9
  • c

    curved-magazine-23582

    01/29/2021, 3:52 PM
    hello, i am completely new to DataHub. I am trying to run that mssql ingestion etl, with
    URL = '<mssql+pytds://user>:password@sqlhost/database:1433'
    , but getting error below. Is my mssql URL in bad format? Any help is appreciated.
    Copy code
    sqlalchemy.exc.ArgumentError: Could not parse rfc1738 URL from string ''
    e
    l
    b
    • 4
    • 20
  • c

    curved-magazine-23582

    02/04/2021, 4:23 AM
    Hello newbie question again. I see there is GMA API for adding/updating datasets, users, etc. I assume this is equivalent of sending corresponding MCE data to that kafka topic, i.e. Elasticsearch and neo4j will be populated eventually with either of the ingestion methods?
    b
    • 2
    • 9
  • l

    lemon-analyst-37781

    02/09/2021, 1:57 AM
    I ran
    pip3 install --user -r common.txt -r snowflake.txt --no-cache-dir
    before this
    m
    g
    • 3
    • 20
  • i

    incalculable-ocean-74010

    02/15/2021, 4:04 PM
    Hello, I've followed the entity onboarding process. When trying to search something in the UI I get the following stack trace which seems unrelated:
    Copy code
    datahub-gms                    | 16:01:54.314 [qtp626202354-22] INFO  c.l.parseq.TaskDescriptorFactory - No provider found for TaskDescriptor, falling back to DefaultTaskDescriptor
    datahub-frontend               | 16:02:00 [application-akka.actor.default-dispatcher-33] ERROR application - Fail to get data platforms
    datahub-frontend               | java.lang.NullPointerException: null
    datahub-frontend               | 	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    datahub-frontend               | 	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
    datahub-frontend               | 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    datahub-frontend               | 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    datahub-frontend               | 	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    datahub-frontend               | 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    datahub-frontend               | 	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    datahub-frontend               | 	at com.linkedin.datahub.dao.table.DataPlatformsDao.getAllPlatforms(DataPlatformsDao.java:23)
    datahub-frontend               | 	at controllers.api.v2.Dataset.getDataPlatforms(Dataset.java:280)
    datahub-frontend               | 	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$29$$anonfun$apply$29.apply(Routes.scala:916)
    datahub-frontend               | 	at router.Routes$$anonfun$routes$1$$anonfun$applyOrElse$29$$anonfun$apply$29.apply(Routes.scala:916)
    datahub-frontend               | 	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:134)
    datahub-frontend               | 	at play.core.routing.HandlerInvokerFactory$$anon$3.resultCall(HandlerInvoker.scala:133)
    Does this ring a bell to anyone?
    b
    • 2
    • 10
12345...144Latest