Apache Flink is a platform for efficient, distributed, general-purpose data processing.

Recent Blog Posts

Stratosphere version 0.5 available (31 May 2014)
Stratosphere accepted as Apache Incubator Project (16 Apr 2014)
Stratosphere got accepted for Google Summer of Code 2014 (24 Feb 2014)
Use Stratosphere with Amazon Elastic MapReduce (18 Feb 2014)
Accessing Data Stored in MongoDB with Stratosphere (28 Jan 2014)
Optimizer Plan Visualization Tool (26 Jan 2014)

Note

Apache Flink originated from the Stratosphere project and is currently moving to the Apache Incubator. The first Apache release is under preparation - the latest stable release of the system is from the Stratosphere project.

Project Overview

Flink features powerful programming abstractions in Java and Scala, a high-performance runtime, and automatic program optimization. It has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.

Concise and Expressive APIs

Flink allows you to express algorithms in a concise fashion in the programming languages Java and Scala. Programs may freely compose many operations to long pipelines and mix and match built-in operations and user-defined functions.

Flink programs have support for highly efficient iterative algorithms, allowing the system to model complex tasks efficiently.

DataSet<String> input = env.readTextFile(inputPath);

input.flatMap(new FlatMapFunction() {
   public void flatMap(String value, Collector out) {
       for (String s : value.split(" ")) {
           out.collect(new Tuple2<String, Long>(s, 1L);
       }
   }
})
.groupBy(0)
.sum(1)
.writeAsText(outputPath);

System Stack

The Apache Flink stack consists of

Programming APIs for different languages (Java, Scala) and paradigms (record-oriented, graph-oriented).
A program optimizer that decides how to execute the program for good performance. It decides among other things about data movement and caching strategies.
A distributed runtime that executes programs in parallel distributed over many machines.

Flink runs independently from Hadoop, but integrates seamlessly with YARN (Hadoop's next-generation scheduler). Various file systems (including the Hadoop Distributed File System) can act as data sources.