Hi all! Working on implementing redshift ingestion...
# ingestion
f
Hi all! Working on implementing redshift ingestion and running into some problems getting the schema to appear in the UI. It seems like this is connected to the platform_schema part of the mce. https://github.com/linkedin/datahub/blob/master/metadata-ingestion/mysql-etl/mysql_etl.py#L30 Is there anywhere I can go for logs associated with this? I don't see anything popping up in the docker-compose logs and the entities are being registered, so from my perspective this is a silent failure atm.
b
Do you see any error in
Copy code
docker logs datahub-mce-consumer
f
Nope just an info
datahub-mae-consumer | 012004.509 [mae-consumer-job-client-StreamThread-1] INFO c.l.m.k.MetadataAuditEventsProcessor - {com.linkedin.metadata.snapshot.DatasetSnapshot={urn=urnlidataset:(urnlidataPlatform:redshift,<Omitted>,PROD), aspects=[{com.linkedin.common.Ownership={owners=[{owner=urnlicorpuser:datahub, type=DATAOWNER}], lastModified={actor=urnlicorpuser:datahub, time=0}}}]}} datahub-gms | 012004.513 [kafka-producer-network-thread | producer-1] INFO c.l.m.d.p.KafkaProducerCallback - Kafka producer callback: datahub-gms | 012004.513 [kafka-producer-network-thread | producer-1] INFO c.l.m.d.p.KafkaProducerCallback - Metadata: MetadataAuditEvent-0@587 datahub-gms | 012004.513 [kafka-producer-network-thread | producer-1] INFO c.l.m.d.p.KafkaProducerCallback - Exception: null datahub-mae-consumer | 012004.578 [mae-consumer-job-client-StreamThread-1] INFO c.l.m.k.MetadataAuditEventsProcessor - {com.linkedin.metadata.snapshot.DatasetSnapshot={urn=urnlidataset:(urnlidataPlatform:redshift,<Omitted>,PROD), aspects=[{com.linkedin.dataset.DatasetProperties={tags=[], customProperties={}}}]}}
oh I see thats mae consumer
I don't see any logs from mce-consumer after the startup logs
Restarted everything from scratch. The datasets themselves continue to show up in the ui, but I don't see any logs from the mce consumer
b
I see. Could you share an example MCE you're sending that includes the schema metadata?
f
Got debug logging turned on and it definitely seems that the mce consumer is processing the messages
I'll get an mce for you
{'auditHeader': None, 'proposedSnapshot': ('com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot', {'urn': 'urnlidataset:(urnlidataPlatform:redshift,test_schema.test_table,PROD)', 'aspects': [{'lastModified': {'actor': 'urnlicorpuser:datahub', 'time': 0}, 'owners': [{'owner': 'urnlicorpuser:datahub', 'type': 'DATAOWNER'}]}, {'platform': 'urnlidataPlatform:redshift', 'version': '1', 'hash': '', 'fields': [{'fieldPath': 'test_column_one', 'type': {'type': {'com.linkedin.pegasus2avro.schema.StringType': {}}}, 'nativeDataType': 'integer'}, {'fieldPath': 'test_column_two', 'type': {'type': {'com.linkedin.pegasus2avro.schema.StringType': {}}}, 'nativeDataType': 'character varying'}, {'fieldPath': 'test_column_3', 'type': {'type': {'com.linkedin.pegasus2avro.schema.StringType': {}}}, 'nativeDataType': 'character varying'}], 'created': {'actor': 'urnlicorpuser:datahub', 'time': 1587136066}, 'lastModified': {'actor': 'urnlicorpuser:datahub', 'time': 1587136066}, 'schemaName': 'test_schema.test_table', 'platformSchema': {'OtherSchema': "{'namespace': 'com.linkedin.dataset', 'type': 'record', 'name': 'RedshiftSchema', 'fields': [{'type': 'integer', 'name': 'test_column_one'}, {'type': 'character varying', 'name': 'test_column_two'}, {'type': 'character varying', 'name': 'test_column_3'}], 'doc': 'Test Redshift Data'}"}}]}), 'proposedDelta': None}
apologies for these walls of text
o
Does it show up in MySQL or in ES when manually querying those sources? You can ssh into the MySQL docker image using docker exec and use the configured creds to look into it. ES exposes a REST API to port 9200 on localhost, by default using quickstart it has no access control. If you modify the docker-compose command you can also enable debug mode for GMS and see if the request to persist is coming through by enabling a breakpoint in BaseVersionedAspectResource or EbeanLocalDAO. Hopefully one of those gives you some more information to help debug 🙂
f
Thanks! I'm trying to use kibana to look at some of the data in ES. Do you happen to know what index pattern I should use?
The message I'm seeing is here: In order to use Kibana you must configure at least one index pattern. Index patterns are used to identify the Elasticsearch index to run search and analytics against. They are also used to configure fields.
Ok I found the records in mysql
So everythings definitely getting processed. It doesn't look like my schema or my fields are showing up in the metadata columns.
Seems like I'm definitely formatting something wrong or something
ah ok, so no com.linkedin.schema.SchemaMetadata aspects are showing up in mysql
b
Could you try to emit a MCE that contains only the schema metadata to see if it makes a difference?
f
sure! let me give that a spin
Ok with just the schema I get a log from the consumer saying Got MCE, but nothing shows up in mysql at all
I also get a log from the mae saying this datahub-mae-consumer  | 175422.294 [mae-consumer-job-client-StreamThread-1] INFO c.l.m.k.MetadataAuditEventsProcessor - {com.linkedin.metadata.snapshot.DatasetSnapshot={urn=urnlidataset:(urnlidataPlatform:redshift,test_schema.test_table,PROD), aspects=[{com.linkedin.dataset.DatasetProperties={tags=[], customProperties={}}}]}}
basically it's not seeing the schema as an aspect
So to be clear should I be using the keyword otherSchema as tableSchema or documentSchema
??
Seems like the generated avro file implies that's the catch-all for datasets that aren't yet supported
b
OtherSchema
should be fine. It sounds like the MCE consumer somehow isn't ingesting the event. Could you create an issue on GitHub with these information and sample MCE so we can look into further? Thanks!
f
Yah no problem. Thanks for the help!
w
Would you be able to check if this can fix your issue? https://github.com/linkedin/datahub/issues/1606
f
I just got things working by copying data from bootstrap_mce.dat and just parametrizing things until it looks right. Notably this leaves the timestamps fixed, which is enough for my POC in the short-term. So the "bug" you linked might be the issue. I realize now that the schema aspects were never even making it into the kafka messages (should probably have figured that out from the logs). For some reason or another it appears the conlfuent_kafka python producer drops things from arrays silently if they don't match the avro schema. That's pretty unfortunate behavior, and to my mind is the real underlying issue here. If I end up going back and making the timestamps variable again I'll check back here. Might save somebody some time one day. Thanks for all the help everyone!