Hi, did anyone manage to resolve the AWS MSK manag...
# ingestion
b
Hi, did anyone manage to resolve the AWS MSK managed Kafka timeout issue when ingesting? I'm getting the same error.
m
@blue-holiday-20644: this one is still open... we haven't gotten to the bottom of it yet.
e
cc @curved-jordan-15657 @adventurous-scooter-52064 After experimentation, we realized that replication factor was the issue. If the replication factor for the MCE topic is 1, it times out while if we set it to 2 or more it doesn’t. Created a quick PR to change replication factor during kafka-setup, but modifying it for an existing topic is not trivial. https://docs.confluent.io/platform/current/kafka/post-deployment.html#increasing-replication-factor
You can also try deleting the topics and rerunning kafka-setup with replication set to a higher number
c
Hi @early-lamp-41924! I see, but according to the AWS docs (https://docs.aws.amazon.com/msk/latest/developerguide/msk-default-configuration.html), the default.replication.factor property is “3 for 3-AZ clusters, 2 for 2-AZ clusters”. In my case, i have 3 brokers and 3 A-Z. So that means, my default.replication.factor=3. I didn’t typed this in configuration file but i use default msk configuration. But you were talking about MCE topic… Do i need to do it manually for a specific topic?
b
Thanks for the update- I'm running a 2-zone MSK with default replication config but I'll look into what I can adjust in that area.
Also- is it possible to configure the AWS Glue schema registry in the dockerised Datahub similar to the helm version?
I managed to get my 2-node MSK cluster running with Docker Datahub with these MSK settings. Setting them to 2 didn't resolve the timeout. I also had my kafka-setup.sh set to : ${PARTITIONS:=2} and : ${REPLICATION_FACTOR:=2} which I might need to tune down again.
I'll have to see if I can increase replication for production, but just getting ingests to run is progress.
m
@blue-holiday-20644 : I’m not able to fully follow the end state. Were you able to run ingestion successfully? Were you able to increase replication factor to a number greater than 1?
@curved-jordan-15657 : yes you have to set this manually for the existing kafka topics that are set up by the setup script. Since they have already been setup with the replication factor :1. It seems like there is some interaction between that setting and the client not being able to produce to these topics.
b
@mammoth-bear-12532 Yes I managed to get ingestion recipes using the kafka sink to run without the timeout issue. I initially tried setting my replication settings to 2 in the MSK configs which didn't resolve the timeout. For a 2-node MSK topology it seems to work by setting them to 1 as above- maybe this is an N-1 situation for replicating across N nodes?
MSK configurations seem to be static and not derived from the size of your cluster, so you have to create and apply specific configs to change the default values which may resolve issues for cluster sizes other than the default of 3 nodes.
m
Oh interesting. For our three node cluster, replication factor 2 and 3 both seem to work.
b
I changed it, cleared out all the topics and it still gave me the timeout.
e
On the Glue side, yes you can, but you can’t use it for kafka-ingestion unfortunately 😞 it doesn’t support python kafka apis
Did you delete the topic and recreate it with the new replication factor?
b
I manually deleted the topics and the kafka-setup.sh recreated them with 2 partitions and replication. Just testing with them reverted back to 1 each.
e
So that needs to be 2 or more in a regular MSK setup
b
: ${PARTITIONS:=1} : ${REPLICATION_FACTOR:=1}- just tested with these original kafka-setup settings and it still worked. So I guess my MSK configs were causing the issues...?
e
Interesting hmn
b
My 2-node MSK config looks like this currently
I wonder if min.insync.replicas set to 2 would never succeed for a 2-node setup?