Understanding Streaming Data Applications with Cloudflow

Stream processing is the discipline and related set of techniques used to extract information from unbounded data. Streaming applications apply such techniques to create applications and systems able to extract actionable insights from data as it arrives into our processing system.

The growing popularity of this technology is driven by the increasing availability of data from many sources and the need of enterprises to speed up their reaction time to that data.

We characterize streaming applications as a connected graph of stream-processing components, where each component specializes on a particular task, using the 'right tool for the job' premise.

Fig. 1 - An Abstract Streaming Application

In An Abstract Streaming Application, we illustrate — generically — an application that processes data events.

At the left, we recognize an initial stage where we want to capture or accept data. It could be an HTTP endpoint to accept data from remote clients, a connection to a Kafka topic, or to an internal system in an enterprise. The following component represents a processing phase that applies some logic to the data, like business rules, statistical data analysis, or a machine learning model that implements the business aspect of the application. This component may add additional information to the event and send it as valid data to an external system or flag the data as invalid and report it. We show these two different data output paths with the green and red stages to the right.

Each of these stages presents different application requirements, scalability concerns, and usually, also different deployment strategies on Kubernetes.

The Cloudflow Approach to Creating Streaming Applications

Cloudflow offers a novel solution to approach the creation, deployment, and management of streaming applications on Kubernetes. For the developer, it offers a toolbox that facilitates the application creation. On the deployment side, it comes with a set of extensions that makes Cloudflow streaming applications native to Kubernetes.

Let us break down the typical develop-test-package-deploy lifecycle of an application and see how Cloudflow helps you at each stage to accelerate the end-to-end process.

Develop with Streamlets: Focus on the Code that Matters

When creating streaming applications, a large portion of the development effort goes into the mechanics of streaming, like configuring and setting up connections to the messaging system, de/ser/ializing messages, setting up fault-recovery storage, etc, to finally add our business logic to it.

Fig 2. - Business Logic in a Streaming Application

Cloudflow introduces a new component model, called Streamlets. The Streamlet model offers an abstraction to:

Identify the connections of a component as inlets and outlets
Attach the schema of the data for each connection, and
Provide an entry point for the business logic of the component.

Cloudflow offers Streamlet backend implementations in Akka Streams, Apache Spark Structured Streaming, and Apache Flink.

For each supported Streamlet implementation, the users can write their business logic in the native API of the backend, using either Java or Scala.

Apache Spark Structured Streaming is currently offered only in Scala.

Let us illustrate this important point with an example:

Spark Streamlet Example

class MovingAverageSparklet extends SparkStreamlet { (1)

  val in = AvroInlet[Data]("in")
  val out = AvroOutlet[Agg]("out", _.src)
  val shape = StreamletShape(in, out) (2)

  override def createLogic() = new SparkStreamletLogic {
    override def buildStreamingQueries = {  (3)

      val groupedData = readStream(in)  (4)
        .withColumn("ts", $"timestamp".cast(TimestampType))
        .withWatermark("ts", "1 minutes")
        .groupBy(window($"ts", "1 minute", "30 seconds"), $"src", $"gauge").agg(avg($"value") as "avg")
      val query = groupedData.select($"src", $"gauge", $"avg" as "value").as[Agg]

      writeStream(query, out, OutputMode.Append).toQueryExecution
    }
  }
}

1	`SparkStreamlet` is the base class that defines `Streamlet` for the Apache Spark backend
2	The `StreamletShape` defines the inlet(s) and outlet(s) of the `Streamlet`, each inlet/outlet is declared with its corresponding data type.
3	`buildStreamletQueries` is the entry point for the `Streamlet` logic
4	The code provided is written in pure Spark Structured Streaming code, minus the boilerplate to create sessions, connections, checkpoints, etc.

Akka Streams and Flink-based Streamlets follow the same pattern. Additionally, Cloudflow can be extended with new streaming backends.

Composition and Testing: Sandbox Execution of a Complete Application

Once we have developed the different components of our application, we compose the end-to-end flow by creating a blueprint. The blueprint declares the Streamlet instances that belong to an application and how each Streamlet instance connects to the others. The following example shows a simple blueprint definition:

blueprint example

blueprint {
  streamlets { (1)
    ingress = sensors.SparkRandomGenDataIngress
    process = sensors.MovingAverageSparklet
    egress = sensors.SparkConsoleEgress
  }
  connections { (2)
    ingress.out = [process.in]
    process.out = [egress.in]
  }
}

1	`streamlets` section: declares instances from the `Streamlets` available in the application (or its dependencies)
2	`connections` section: declares how the inlets/outlets of a streamlet should be connected.

It is interesting to note that the declaration of instances in the streamlets section lets us use multiple instances of a Streamlet, with potentially different configurations, leading to component reuse.

With the blueprint file in place, we can verify whether our application is properly connected. For that, we use the verifyBlueprint function, delivered by the sbt-cloudflow plugin.

$ sbt verifyBlueprint
[info] Loading settings for project global-plugins from plugins.sbt ...
[info] Loading global plugins from /home/maasg/.sbt/1.0/plugins
[info] Loading settings for project spark-sensors-build from cloudflow-plugins.sbt,plugins.sbt ...
[info] Loading project definition from cloudflow/examples/spark-sensors/project
[info] Loading settings for project sparkSensors from build.sbt ...
[info] Set current project to spark-sensors (in build file:cloudflow/examples/spark-sensors/)
[info] Streamlet 'sensors.MovingAverageSparklet' found
[info] Streamlet 'sensors.SparkConsoleEgress' found
[info] Streamlet 'sensors.SparkRandomGenDataIngress' found
[success] /cloudflow/examples/spark-sensors/src/main/blueprint/blueprint.conf verified.

The blueprint verification checks that all the connections between Streamlets are compatible, using the schema information provided by the Streamlet.

Once the blueprint verification succeeds, we know that the components of our streaming application can talk to each other. We are now ready to run the complete application.

Enter the Sandbox

Cloudflow comes with a local execution mode called Sandbox. The Sandbox instantiates all Streamlets of an application’s blueprint with their connections in a single, local JVM.

We can see it in action in the following screencast.

Fig 3. - Running a Cloudflow App Locally

The Sandbox provides you with a minimalistic operational version of the complete application. You can use it to exercise the functionality of the application end-to-end and verify that it behaves as expected.

The Sandbox gives you a blazing fast feedback loop for the functionality you are developing, removing the need to have to go through the full package, deploy, and launch on a remote cluster.

Packaging: Build-generated Artifacts

Once we are confident that the application functions as we expect, we can build a package. Cloudflow applications are packaged as a single docker image that contains the necessary dependencies to run the different Streamlets on their respective backends. That image gets published to a docker repository of your choice.

Deployment: `kubectl` Extensions for a YAML-less experience

At this stage, we are ready to deploy our application to a Cloudflow-enabled Kubernetes cluster. In contrast with the usual YAML-full experience that typical K8s deployments require, with Cloudflow we use the blueprint information and the Streamlet definitions to auto-generate an application deployment.

Cloudflow also comes with a kubectl plugin that augments the capabilities of your local kubectl installation to work with Cloudflow applications. You use your usual kubectl commands to auth against your target cluster. Then, with the kubectl cloudflow plugin we can deploy and manage a Cloudflow application as a single logical unit.

$ kubectl cloudflow deploy docker-registry/app-image:version

This method is not only dev-friendly, but also compatible with the typical CI/CD deployments to allow you to take the application from dev to production in a controlled way.

Conclusion

As a developer, Cloudflow gives you a set of powerful tools to accelerate the application development process: - The Streamlet API, lets you focus on business value and use your knowledge of widely popular streaming runtimes, like Akka Streams, Apache Spark Structured Streaming, and Apache Flink to create full-fledged streaming applications. - The blueprint lets you easily compose your application with the peace of mind that a verification phase, informed by schema definitions, provides. - The Sandbox lets you exercise the complete application in seconds, giving you a real-time feedback loop to speed up the debugging and validation phases.

And with a fully developed application, the kubectl cloudflow plugin gives you the ability to deploy and control the lifecycle of your application on an enabled K8s cluster.

Cloudflow takes away the pain of creating and deploying distributed applications on Kubernetes, speeds up your development process, and gives you full control over the operational deployment.

In a nutshell, it gives you distributed application development super-powers on Kubernetes.