Apache Flink is a platform for efficient, distributed, general-purpose data processing.

Note
Apache Flink originated from the Stratosphere project and is currently moving to the Apache Incubator. The first Apache release is under preparation - the latest stable release of the system is from the Stratosphere project.

Project Overview

Flink features powerful programming abstractions in Java and Scala, a high-performance runtime, and automatic program optimization. It has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.

Concise and Expressive APIs

Flink allows you to express algorithms in a concise fashion in the programming languages Java and Scala. Programs may freely compose many operations to long pipelines and mix and match built-in operations and user-defined functions.

Flink programs have support for highly efficient iterative algorithms, allowing the system to model complex tasks efficiently.

DataSet<String> input = env.readTextFile(inputPath);

input.flatMap(new FlatMapFunction() {
   public void flatMap(String value, Collector out) {
       for (String s : value.split(" ")) {
           out.collect(new Tuple2<String, Long>(s, 1L);
       }
   }
})
.groupBy(0)
.sum(1)
.writeAsText(outputPath);

System Stack

The Apache Flink stack consists of

  • Programming APIs for different languages (Java, Scala) and paradigms (record-oriented, graph-oriented).
  • A program optimizer that decides how to execute the program for good performance. It decides among other things about data movement and caching strategies.
  • A distributed runtime that executes programs in parallel distributed over many machines.

Flink runs independently from Hadoop, but integrates seamlessly with YARN (Hadoop's next-generation scheduler). Various file systems (including the Hadoop Distributed File System) can act as data sources.

Flink Stack