Can someone point me to a resource to help build a flink sou Apache Flink #random

Join Slack

Can someone point me to a resource to help build a...

# random

Scott Fauerbach

09/18/2023, 4:06 PM

Can someone point me to a resource to help build a flink source?

Nathanael England

09/18/2023, 4:28 PM

https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/sourcessinks/ would be a good place to start

Nathanael England

09/18/2023, 4:29 PM

And https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/sources/ for datastream API

Scott Fauerbach

09/18/2023, 4:30 PM

Thank you been there. Not much help. Currently looking at this article https://medium.com/@SelimAbidin/how-flink-sources-work-and-how-to-implement-one-70b52fcfeb29 the Kafka source implementation and the flink connector project code

Scott Fauerbach

09/18/2023, 4:32 PM

If I have to I'll go to legacy RichSourceFunction, but I don't want to

Martijn Visser

09/18/2023, 4:37 PM

What type of source is it? Bounded or unbounded?

Scott Fauerbach

09/18/2023, 4:37 PM

unbounded from polling source

Scott Fauerbach

09/18/2023, 4:38 PM

I'm trying to build a source for NATS, I'm the lead Java developer for the NATS Java client

Scott Fauerbach

09/18/2023, 4:39 PM

I'm close, but had been going around in circles yesterday

Scott Fauerbach

09/18/2023, 4:40 PM

actually unbounded from a async source, so I'll have to manage when multiple messages come in

Scott Fauerbach

09/18/2023, 4:41 PM

to queue them up or something

Martijn Visser

09/18/2023, 4:41 PM

Im going to have a talk at Current next week about the Flink connectors and ecosystem, so I’m curious what you’re running into. I’m not too familiar with NATS though.

Scott Fauerbach

09/18/2023, 4:42 PM

so NATS is simple messages. Pub/sub, request/reply and streaming. I'm working just on subscribe right now, I've already got a sink working

Scott Fauerbach

09/18/2023, 4:43 PM

I have to connect to a server and listen. For instance not sure where I should make my connection. In the reader?

Martijn Visser

09/18/2023, 4:43 PM

Any luck so far with the Splits/SplitEnumerator?

Scott Fauerbach

09/18/2023, 4:44 PM

yeah, I've got State, I've got a split, which is just the subject I'm listening too

Scott Fauerbach

09/18/2023, 4:44 PM

My state has assigned and unassigned subjects, working on getting the next one for handleSplitRequest

Scott Fauerbach

09/18/2023, 4:45 PM

so I'll pull one out of unassigned

Scott Fauerbach

09/18/2023, 4:46 PM

just working through the enumerator interface

Martijn Visser

09/18/2023, 4:46 PM

Gotcha. Yeah we really lack a good blog on how to write a proper source and sink

Martijn Visser

09/18/2023, 4:46 PM

Or docs

Scott Fauerbach

09/18/2023, 4:47 PM

been there. My docs aren't awesome either.

Scott Fauerbach

09/18/2023, 4:48 PM

that one article is helping and looking at the kafka code helps, but it's more complicated than pub sub. I will write an article when I'm done

Scott Fauerbach

09/18/2023, 4:48 PM

We are also discussing on contributing this to the flink project. A customer is paying for it but the contract says it's OSS

Scott Fauerbach

09/18/2023, 4:48 PM

plus we know it's needed

Martijn Visser

09/18/2023, 4:48 PM

I can see the benefits for the Flink community too

Martijn Visser

09/18/2023, 4:50 PM

Have you looked at other sources that use the source interface, like Pulsar? There’s also new work being done on porting the Kinesis source to the new interface

Martijn Visser

09/18/2023, 4:50 PM

Perhaps @Hong Teoh can share some of his thoughts on the latter

👀 1

Scott Fauerbach

09/18/2023, 4:51 PM

Thanks for responding. I'll try to keep working this and ask for help as little as possible. I did not look at the those. Are those code bases near the kafka connector, if so I'll find them

Hong Teoh

09/18/2023, 4:59 PM

Interesting to find out about NATS!

I have to connect to a server and listen. For instance not sure where I should make my connection. In the reader?

This needs to be done in the SplitReader. The threading model here will be interesting. Is there a client for NATS that handles polling records for a given “partition”?

Scott Fauerbach

09/18/2023, 5:01 PM

currently I can start a background thread that just gets messages for a subject

👍 1

Scott Fauerbach

09/18/2023, 5:01 PM

I needed to handle when many come across fast.

Hong Teoh

09/18/2023, 5:02 PM

IIUC, in the Kafka consumer, most of the heavy lifting is actually delegated to the KafkaConsumer. The SplitReader.fetch() is actually called via a SingleThreadedManager that polls a single KafkaConsumer instance (which has its own thread pool). This means any connection can be sync/blocking, because the thread pool actually making the connection is separate from the Flink job thread. Sounds like you’re implementing something similar to Kafka source then

Hong Teoh

09/18/2023, 5:05 PM

For Kinesis source we do something slightly less efficient 👀. Most of the heavy lifting is actually done in the Flink job itself. The SplitReader.fetch() will make a call to Kinesis endpoint to retrieve records. (Single threaded manager is used as well) But we do this because we have another mode (EFOConsumer) that has the same model as the KafkaConsumer (manages its own thread pool)

Hong Teoh

09/18/2023, 5:10 PM

An initial simple implementation for NATS could be like the Kinesis one, then iteratively improved on. I made a simple diagram when trying to understand the source framework that you might find helpful! 😄 Key points: • Green bits are the bits you have to implement for each Source • SplitEnumerator -> Discovers the smallest unit (partition, shard etc) • Split -> The actual unit (partition, shard) • SplitReader -> Reads from the assigned shard • SplitFetcherManager -> Spins up threads to run the SplitReaders. Both Kafka and Kinesis use single threaded FetcherManagers. • RecordEmitter -> Outputs records from Source to the actual Flink job graph. Key point here is to ensure the Source state is only updated as “read” after emitting the records to the job graph. Otherwise exactly-once semantics are violated.

Hong Teoh

09/18/2023, 5:12 PM

Sorry for the splurge 😆 Happy to answer any questions that you have! Especially around exactly once semantics / state handling

Scott Fauerbach

09/18/2023, 5:12 PM

ok I'm in a meeting will look at this soon

Scott Fauerbach

09/18/2023, 7:15 PM

So eventually maybe more like the Kafka source when I'm talking to stream. But this isn't really helping much either. I have no state except my connection and the subject I'm listening too. Looking in the kinesis code My splits are basically covers a single subject(topic) I don't get this in handleSplitRequest.

Copy code

// Do nothing, since we assign splits eagerly

What happens splits are added back? Something else is running that reassigns a split?

Scott Fauerbach

09/18/2023, 7:17 PM

And then in start(), there are a couple context.CallAsync. I mean I feel like a complete idiot, I'm just not seeing it, I just want to have something start a reader. If that reader fails, fine, let it be pulled, record somewhere that that subject/topic isn't being read, and start a new one.

Scott Fauerbach

09/18/2023, 7:19 PM

Do I even need split state? I have none except the split itself

Scott Fauerbach

09/18/2023, 7:21 PM

I can make my reader polling or async too, not sure which way to go. Polling is easier, but I don't want messages to get backed up

Scott Fauerbach

09/18/2023, 7:21 PM

async I could add a queue to hold messages I'm given and then hand them out when asked

Scott Fauerbach

09/18/2023, 7:22 PM

I think Kafka might be closer to async as described above.

Hong Teoh

09/19/2023, 3:30 PM

https://github.com/apache/flink-connector-aws/blob/main/flink-connector-aws/flink-[…]/flink/connector/kinesis/source/examples/SourceFromKinesis.java

3 Views

Open in Slack

Previous Next