Pulsar functions should be able to return none and multiple values #19657

KIC · 2019-06-10T07:07:59Z

KIC
Jun 10, 2019

First of all, I love pulsar functions! It is almost exactly what I have always wanted to build. But just by myself and as a hobby project I never really finished it. So I am really happy someone else did it! However I am missing one important feature.

Is your feature request related to a problem? Please describe.
As I understand the current implementation of pulsar functions is a 1:1 relation ship. One event in -> one event out. This is very limiting as one can not even write a function to filter out events. Also there are use cases when you get a batched message and you need to "unpack" it into single events. Or when you need to interpolate values from a previous event (which is held in the state).

Describe the solution you'd like
I propose that the interface should be something along the lines Function<I, ? extends Collection<O>>. This way you can either return nothing (an empty list), a single element, but also a collection of like interpolated values.

Describe alternatives you've considered
If you consider a PublishFunction then I see here the following problem. In the very moment you also need to store a state via the Context you get a timing issue. What if you stored the state but then for some reason you are not able to send to the topic. Or even worse what if you could send n of m messages and then the network fails? I would be clean and easier when pulsar handles all these cases outside of the function implementation.

Additional context
One not necessarily needs to use atomic transactions over different storage solutions for this use case. Functions just need to be deterministic. So during startup (or retry) you just need to know what is needed to reproduce the failed "state" (nacked message) and you need to know what was the last message which has been sent to the target topic. You then store the state and only send the missing messages after the last one which was already sent.

sijie · 2019-06-13T08:46:26Z

sijie
Jun 13, 2019
Collaborator

@KIC sounds a good proposal

0 replies

jerrypeng · 2019-06-13T17:20:41Z

jerrypeng
Jun 13, 2019
Collaborator

@KIC thank you for taking the time to think about how to improve Pulsar Functions!

Just to clarify somethings

As I understand the current implementation of pulsar functions is a 1:1 relation ship. One event in -> one event out. This is very limiting as one can not even write a function to filter out events. Also there are use cases when you get a batched message and you need to "unpack" it into single events. Or when you need to interpolate values from a previous event (which is held in the state).

You are able to return "null" in a function for filtering purposes

I propose that the interface should be something along the lines Function<I, ? extends Collection>.

We can have such an interface

If you consider a PublishFunction then I see here the following problem. In the very moment you also need to store a state via the Context you get a timing issue. What if you stored the state but then for some reason you are not able to send to the topic. Or even worse what if you could send n of m messages and then the network fails? I would be clean and easier when pulsar handles all these cases outside of the function implementation.

To send a message to any topic from a function:

context.newOutputMessage(publishTopic, Schema.STRING).value(output).sendAsync();

This method returns a CompletableFuture. You can always wait for the CompletableFuture to complete before updating the state. If there is a send failure, throw an exception, and in EFFECTIVELY_ONCE, the function instance will restart it self and replay the last message. Thus, your state doesn't get updated for a message that didn't get sent out. Of course updating the state could fail in theory and you would have sent a message but not updated the state.

Alternatively, you can also use another Pulsar Topic as a K/V state store and publish state updates to the state to that topic. By using message sequence IDs and idempotent producing, you can achieve exactly-once state updates. This solution will take more implementation on the user's part.

@sijie is adding transaction support in Pulsar, so we can also see if we can update consume, update function state, and publish message(s) all in a single transaction

0 replies

KIC · 2019-06-14T06:08:57Z

KIC
Jun 14, 2019
Author

@jerrypeng thanks for the clarification, I did not get that returning null is an option :-)

regarding returning multiple events, I am not saying it is impossible but it is a bit trickier then that:

This method returns a CompletableFuture. You can always wait for the CompletableFuture to complete before updating the state. If there is a send failure, throw an exception, and in EFFECTIVELY_ONCE, the function instance will restart it self and replay the last message.

Imagine if you have 10 futures, 5 completed, 5 failed
You did not store the state (you would only if all futures completed)
You throw an exception
Function restarts (EFFECTIVELY_ONCE)

Now you have the same sate as the one you have had but not stored (function needs to be deterministic). But you have sent 5 of 10 events already, so you need to know that you only have to send messages 6 - 10 plus finally store the state. This could be done by querying the last message/event on the target topic. And this would even hold if only the state store would fail. Because in such a case you throw an exception at stateStore(), in the redo call of the function you see 10 of 10 messages were sent, so you just store the state.

However I just think it would be much more user friendly if this is all handled by the function caller and you could optimize accordingly while the project evolves.

0 replies

tisonkun · 2023-02-28T04:02:37Z

tisonkun
Feb 28, 2023
Collaborator

Converted to Discussions since no one seems actively working on this topic and there is no tech design also.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulsar functions should be able to return none and multiple values #19657

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pulsar functions should be able to return none and multiple values #19657

KIC Jun 10, 2019

Replies: 4 comments

sijie Jun 13, 2019 Collaborator

jerrypeng Jun 13, 2019 Collaborator

KIC Jun 14, 2019 Author

tisonkun Feb 28, 2023 Collaborator

KIC
Jun 10, 2019

sijie
Jun 13, 2019
Collaborator

jerrypeng
Jun 13, 2019
Collaborator

KIC
Jun 14, 2019
Author

tisonkun
Feb 28, 2023
Collaborator