DataHub #ingestion

white-beach-27328

06/10/2021, 7:03 PM

I’m noticing some odd behavior. I ingested some data via the

hive

ingestion recipe leveraging

acryl-datahub[hive, datahub-kafka]==0.8.1.1

which created a

DatasetSnapshot

through the MCE Consumer which I can retrieve from the GMS with a request to the

/datasets?action=getSnapshot

endpoint using the

urn

I see in the kafka message. However, when I look in the datahub frontend, I can’t find the dataset anywhere and it doesn’t come back when search for the dataSet’s name. Kind of confused as to what the problem would be, any ideas?

steep-pizza-15641

06/10/2021, 7:14 PM

Apologies for what might be a philosophical question. Let's say we use Airflow to perform ML operations, there are a number of operations that are a good fit for DataHub - for example any data engineering that creates tabular data as inputs to model training or any tabular data outputed as part of model inference. DataHub gives me a great way of visualizing the pipeline and data. There are aspects of the pipeline that DataHub does not allow me to vizualize however: 1. Lets say I download zipped data from an FTP site using airflow for example, we do not seem to have an emitter for FTP sites with raw files or raw files sitting in S3 2. Once my data ie engineered, I might use it to train an ML model. It would be good to be able to visualize the output ML model in my dependencies as well Thoughts?

curved-magazine-23582

06/14/2021, 2:20 AM

Hello team. I am trying to ingest a dataJob through GMS API, but getting validation error, complaining about dataJobInfo.type. I tried setting that field to values like "SQL" or "HIVE" as indicated by sample datajob data in below link, but still failing:

Copy code

<https://github.com/linkedin/datahub/blob/0b75b4a96a801a91fe434c87f6c737d24d63eb14/metadata-ingestion/examples/mce_files/bootstrap_mce.json#L996>

Validation errors from GMS API:

Copy code

[HTTP Status:400]: Parameters of method 'ingest' failed validation with error 'ERROR :: /snapshot/aspects/1/com.linkedin.datajob.DataJobInfo/type :: union type is not backed by a DataMap or null\n'\at com.linkedin.restli.server.RestLiServiceException.fromThrowable(RestLiServiceException.java:315)\n\tat com.linkedin.restli.server.BaseRestLiServer.buildPreRoutingError(BaseRestLiServer.java:158)\n\tat com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:198)

curved-magazine-23582

06/14/2021, 2:23 AM

Example json passed to GMS API:

Copy code

{
  "snapshot": {
    "urn": "urn:li:dataJob:(urn:li:dataFlow:(glue,logistics-load,PROD),logistics-load)",
    "aspects": [
      {
        "com.linkedin.common.Ownership": {
          "owners": [
            {
              "owner": "urn:li:corpuser:dataservices",
              "type": "DATAOWNER"
            }
          ],
          "lastModified": {
            "time": 1581407189000,
            "actor": "urn:li:corpuser:dataservices"
          }
        }
      },
      {
        "com.linkedin.datajob.DataJobInfo": {
          "name": "logistics-load",
          "description": "Tranform and load logistics data into Redshift",
          "type": "SQL"
        }
      },
      {
        "com.linkedin.datajob.DataJobInputOutput": {
          "inputDatasets": [
            "urn:li:dataset:(urn:li:dataPlatform:s3,logistics_raw.shipment,PROD)"
          ],
          "outputDatasets": [
            "urn:li:dataset:(urn:li:dataPlatform:redshift,redshift_edw_production.edw_logistics_box,PROD)"
          ]
        }
      }
    ]
  }
}

brief-lizard-77958

06/14/2021, 12:23 PM

Is it possible to reverse an ingestion once it's been succesfully ingested? I've had a whole section of the application stop working after ingesting an improperly formated json and had to nuke everything before getting it to work again. Are rollbacks of any sort possible?

🙌 1

gifted-student-48095

06/15/2021, 9:28 AM

Does Datahub allow ingestion of metadata from salesforce, powerBI and talend (etl)?

handsome-airplane-62628

06/15/2021, 2:41 PM

Hello! We are working on a way of backing up the data that we've manually entered via the datahub UI (particularly tags and column descriptions when not present in our data warehouse). To date we've been making backups of the mysql db and then restore a backup if ever we were to need to migrate datahub (ie from local to cloud/etc). However when restoring using this method elastic search needs to be re-indexed. The tags/descriptions show up, but can no-longer search for a particular tag. 1. Is there a way to re-index elastic search? 2. Is there a different way to backup this metadata that was not directly ingested but manually entered through the UI such that we don't lose if if out datahub instance ever goes down?

wonderful-quill-11255

06/16/2021, 6:34 AM

Hello. We are planning to evaluate datahub entities (mainly datasets) against company compliance rules which could be reported back to the dataset owners. When I think about it that seems very similar to the ingestion pipeline but reversed such that the source would be datahub, the transformers would be the compliance rules and the output a db, console, api etc. Has anyone else done something like this already? Would it make sense to make a feature request?

average-autumn-35845

06/16/2021, 12:41 PM

Hello! I have a question about data lineage: Besides visualizing data lineage on UI, is there anyway to use that lineage for detecting job failure and prevent data flowing from impaired upstream source ?

faint-hair-91313

06/16/2021, 1:51 PM

Hi guys, any plans to ingest views with the Oracle connector. At this stage it's only picking up tables.

straight-noon-75819

06/16/2021, 4:37 PM

Hello team, I would like to ask if DataHub currently support lineage for DockerOperator from airflow, I've been looking into some example but I couldnt find any.

straight-noon-75819

06/16/2021, 6:09 PM

@gray-shoe-75895, yah I confirmed DockerOperator works, just need inlets, outlets as you said. So what happened to me is a bit weird.. I had to run

nuke.sh

to clear all and run

quickstart.sh

again. Does anyone see any weird behavior like this?

millions-jelly-76272

06/17/2021, 8:32 AM

Hi hows it going? I was scouring through your documentation to find an example to ingest Airflow DAG metadata (not lineage), but was unsuccessful (high chance I overlooked something). Inspired by your demo (https://demo.datahubproject.io/browse/pipelines/airflow/prod) would love to know how to see Airflow DAG and task metadata in in DataHub. Any guidance will be appreciated. Thank you in advance.

gifted-bird-57147

06/17/2021, 11:32 AM

in the table or schema_pattern variables in the ingest recipes is it possible to use regex expressions? There's a bunch of 'system' tables in my schema that I want to exclude from ingesting. Their naming convention is i<sequencenr> so would be easiest to exclude with a regex pattern...

cuddly-lunch-28022

06/17/2021, 12:39 PM

Hello! i have test json /opt/datahub/metadata-ingestion/examples/mce_files/bootstrap_mce.json 1000 rows, how you can use huge file? i would like to add table

miniature-airport-96424

06/17/2021, 1:31 PM

from the log i see

Copy code

2021-06-17 13:31:21.912:WARN:oejs.HttpChannel:qtp544724190-9: /health
datahub-datahub-gms-748884b4db-69cg2 datahub-gms 2021-06-17T13:31:21.918950727Z javax.servlet.ServletException: javax.servlet.UnavailableException: Servlet Not Initialized

glamorous-kite-95510

06/18/2021, 2:24 AM

Why did i get this error ? I had start DataHub successfully and could access and browse on ui. But i got this error when i tried to ingest metadata. I've never got this error before

cuddly-lunch-28022

06/18/2021, 7:31 AM

Hello! delete dataset "aspects": [ { "com.linkedin.pegasus2avro.common.Status": { "removed": null } please help delete schema

icy-holiday-55016

06/18/2021, 2:09 PM

Hi folks, when do you trigger an ingestion from a data source? For example, if I want to get the metadata about a Kafka broker (https://datahubproject.io/docs/metadata-ingestion/#kafka-metadata-kafka), I can run the 'datahub' CLI command manually and get that data. Is it expected that this would be set up to run on a schedule? Or perhaps it could be triggered as part of a DAG?

better-orange-49102

06/19/2021, 5:13 PM

was wondering if anyone could shed some light on this. im trying to programmatically create datasets using the example in /datahub/metadata-ingestion/examples/library/lineage_emitter_rest.py instead of using the CLI method. i could create institutional memory, dataset properties of a new dataset by modifying the example of lineage, and the dataset could be seen in the browser, not differently compared to the quickstart datasets. But there seem to be an extra step that I missed when i try creating a SchemaMetadataClass. the REST API accepted my request, but i cant browse the newly created dataset. the error message in the browser read:

Copy code

Exception while fetching data (/browse) : java.lang.RuntimeException: Failed to execute browse: entity type DATASET, path [prod, goonrtpe], filters: null, start: 0, count: 10

mae-consumer logs says:

Copy code

datahub-mae-consumer      | org.springframework.kafka.listener.ListenerExecutionFailedException: Listener method 'public void com.linkedin.metadata.kafka.DataHubUsageEventsProcessor.consume(org.apache.kafka.clients.consumer.ConsumerRecord<java.lang.String, java.lang.String>)' threw exception; nested exception is java.lang.ClassCastException: com.linkedin.metadata.key.CorpUserKey cannot be cast to com.linkedin.identity.CorpUserInfo; nested exception is java.lang.ClassCastException: com.linkedin.metadata.key.CorpUserKey cannot be cast to com.linkedin.identity.CorpUserInfo

i've compared the data stored in MySQL for the datasets created using programmatically generated ones and the pipeline created ones and dont see a difference. I'm using a slightly older version of datahub v0.8.1

glamorous-kite-95510

06/20/2021, 6:46 AM

Hi, Can i ask where is my metadata is stored on my local machine ? How can i backup my metadata just in case something bad happened ? If it's okay, can you give me the directory ? Once I start docker, everything runs perfectly and I have no clue where it is ?

glamorous-kite-95510

06/21/2021, 2:05 AM

Hi, Can i ask what is the default mechanism of metadata ingestion , either over Kafka or Rest ?

cuddly-lunch-28022

06/21/2021, 11:30 AM

Hello! please could you tell me find out numValuesFieldName https://github.com/linkedin/datahub/search?q=numValuesFieldName basely i would like to create dataprocess this inputs and outputs https://github.com/linkedin/datahub/blob/97e966003710aba18f7a2ecf5af0686504359da5/[…]s/src/main/pegasus/com/linkedin/dataprocess/DataProcessInfo.pdl

adorable-hairdresser-61775

06/21/2021, 8:34 PM

Hello everyone. Is there any way to add authentication to the GMA rest API?

brief-lizard-77958

06/22/2021, 7:03 AM

[Fixed] Hello. When ingesting data for Charts, I'm trying to define multiple Data Sources but only one gets ingested. Is there a way I can fix this? Picture shows the json I'm ingesting and the final result (it only ingests the source I have defined last)

brief-lizard-77958

06/22/2021, 8:49 AM

Will UTF-8 character encoding system be supported at any point? Currently non english characters get warped to random signs.

steep-pizza-15641

06/22/2021, 12:42 PM

Hi the airflow integration in datahub is very cool. I wonder if there is much interest/demand in similar integration with Lyft Flyte?

👍 1

chilly-holiday-80781

06/23/2021, 11:33 PM

<!here> A couple people have been asking for tips on ingesting metadata from AWS S3, so we've put together a guide on using AWS Glue to crawl S3 buckets, which can then be ingested into DataHub. This setup prevents us from having to crawl large S3 buckets directly and also leverages Glue's powerful built-in classifiers. We now also support ingesting jobs and pipelines from Glue by default, so you'll be able to view the complete flow of information in DataHub. Feel free to message me with any questions!

🎉 3

🙌 3

fancy-helmet-32669

06/24/2021, 7:36 PM

Copy code

Traceback (most recent call last):
........
  File "/usr/local/lib/python3.8/site-packages/avrogen/avrojson.py", line 272, in <listcomp>
    return [self._generic_from_json(x, writers_schema.items, readers_schema.items)
  File "/usr/local/lib/python3.8/site-packages/avrogen/avrojson.py", line 248, in _generic_from_json
    result = self._union_from_json(json_obj, writers_schema, readers_schema)
  File "/usr/local/lib/python3.8/site-packages/avrogen/avrojson.py", line 304, in _union_from_json
    raise schema.AvroException('Datum union type not in schema: %s', value_type)
avro.schema.AvroException: ('Datum union type not in schema: %s', 'com.linkedin.pegasus2avro.common.BrowsePaths')

clever-smartphone-69649

06/25/2021, 6:51 PM

Integration Survey: What orchestrator are you running?