Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Flink features powerful programming abstractions in Java and Scala, a high-performance runtime, and automatic program optimization. It has native support for iterations, incremental iterations, and programs consisting of large DAGs of operations.
Flink allows you to express algorithms in a concise fashion in the programming languages Java and Scala. Programs may freely compose many operations to long pipelines and mix and match built-in operations and user-defined functions.
Flink programs have support for highly efficient iterative algorithms, allowing the system to model complex tasks efficiently.
DataSet<String> input = env.readTextFile(inputPath);
input.flatMap(new FlatMapFunction() {
public void flatMap(String value, Collector out) {
for (String s : value.split(" ")) {
out.collect(new Tuple2<String, Long>(s, 1L);
}
}
})
.groupBy(0)
.sum(1)
.writeAsText(outputPath);
The Apache Flink stack consists of
Flink runs independently from Hadoop, but integrates seamlessly with YARN (Hadoop's next-generation scheduler). Various file systems (including the Hadoop Distributed File System) can act as data sources.