Using Akka Streamlets
An Akka-based streamlet has the following responsibilities:
Capturing your stream processing logic.
Publishing metadata which will be used by the
sbt-cloudflowplugin to verify that a blueprint is correct. This metadata consists of the shape of the streamlet (
StreamletShape) defined by the inlets and outlets of the streamlet. Connecting streamlets need to match on inlets and outlets to make a valid cloudflow topology, as mentioned previously in Composing streamlets using blueprints. Cloudflow automatically extracts the shapes from the streamlets to verify this match.
For the Cloudflow runtime, it needs to provide metadata so that it can be configured, scaled, and run as part of an application.
The inlets and outlets of a
Streamlethave two functions:
To specify to Cloudflow that the streamlet needs certain data streams to exist at runtime, which it will read from and write to, and
To provide handles inside the stream processing logic to connect to the data streams that Cloudflow provides. The
StreamletLogicprovides methods that take an inlet or outlet argument to read from or write to. These will be the specific data streams that Cloudflow has set up for you.
The next sections will go into the details of defining an Akka Streamlet:
defining inlets and outlets
creating a streamlet shape from the inlets and outlets
creating a streamlet logic that uses the inlets and outlets to read and write data
A streamlet can have one or more inlets and outlets. Cloudflow offers classes for
Outlets based on the
codec of the data they manipulate. Currently Cloudflow supports Avro and hence the classes are named
Each outlet also allows the user to define a partitioning function that will be used to partition the data.
StreamletShape captures the connectivity and compatibility details of an Akka-based streamlet.
It captures which—and how many—inlets and outlets the streamlet has and provides it to the runtime as metadata.
This metadata is used to verify that the blueprint connects the streamlets correctly.
To define the
StreamletShape, we first create one or more inlets and outlets.
Each inlet and outlet can be defined separately with specific names.
Outlets use a partitioning function to ensure effective data partitioning logic.
If not specified, the partitioning function defaults to the
RoundRobinPartitioner, which evenly distributes data between all partitions.
When you build your own Akka streamlet, you need to define the shape by listing all defined inlets and outlets in a
StreamletShape instance. You will learn how to do this in Building an Akka streamlet.
The next section describes how the
StreamletLogic captures stream processing logic.
The stream processing logic is captured in a
StreamletLogic, which is an abstract class. It provides:
ActorSystemto run Akka components in, a
Materializerfor Akka Stream graphs and an
ExecutionContextfor asynchronous calls (in Scala these are in implicit scope inside the
Configloaded from the classpath, from which can be read configuration settings. It is accessed using the
A subset of the Typesafe
Configthat is reserved for the streamlet, which can be used for deployment-time, user-configurable settings. It is accessed through the
Hooks to access data streams to read from inlets and write to outlets, for when you want to build your own custom streamlets.
StreamletLogic will often setup an Akka Stream graph and run it. The
RunnableGraphStreamletLogic, which extends
StreamletLogic, makes this very easy, you only need to provide the
RunnableGraph that you want to run.
You can access the items described above, like actor system and config, directly through methods on the
StreamletLogic, or through implicit scope.
StreamletLogic is only constructed once the streamlet is run, so it is safe to put instance values and variables in it that you would like to
use when the streamlet is running.
When you build your own Akka streamlet, you define both a
StreamletShape and a subclass of
StreamletLogic. More on that later.
In general terms, when you publish and deploy a Cloudflow application, each streamlet definition becomes a physical deployment as one or more Kubernetes artifacts. In the case of Akka-based streamlets, each streamlet is submitted as a Deployment resource. Upon deployment, it becomes a set of running pods depending in the scale factor for that streamlet.
A so-called runner runs inside every pod. The runner connects the streamlet to data streams, at which point the streamlet starts consuming from inlets. The Streamlet advances through the data streams that are provided on inlets and writes data to outlets.
If you make any changes to the streamlets and deploy the application again, existing pods will be stopped, new pods will be started to reflect the changes. It could be that pods are restarted in other ways, for instance by administrators.
This means that streamlets can get stopped, started or restarted at any moment in time. The next section about message delivery semantics explains the options that are available for how to continue processing data after a restart.
If you only want to process data that is arriving in real-time—ignoring data from the past—an Akka streamlet can simply start from the most recently arrived record. In this case, an at-most-once semantics is used (see Message delivery semantics). The streamlet will process data as it arrives. Data that arrives while the streamlet is not running will not be processed.
At-most-once message delivery semantics are applied to Akka streamlets that use standard Akka Stream graphs like
Akka Streamlets also supports at-least-once semantics for cases where the above mentioned options are not viable. In this case, offsets of incoming records are tracked, so that the streamlet can start from a specific offset, per inlet, on restart. Once records are written to outlets, the associated incoming record offsets are considered.
Akka streamlets that want to support at-least-once semantics, must use an Akka Stream
FlowWithContext that propagates
Committables (both offsets and batches of offsets) through the flow.
Cloudflow automatically propagates
Committables (offset details) through the flow and commits these in batches, acknowledging processed messages. Committed batches are not reread after a restart.
FlowWithContext provides a constrained set of operators compared to the Akka Stream
Flow. The subset of operators process records in-order.
There are some side effects to this approach. For instance, if you filter records, only offsets associated to records that are not filtered will be committed. This means that on restart, more re-processing—of previously filtered records—can occur than you might expect.
Next, Building an Akka Streamlet provides more details.