Features

More Operators

Stratosphere extends the well-known MapReduce model with new operators. These operators represent many common data analysis tasks more naturally and efficiently. All operators will start working in memory and gracefully go out of core under memory pressure.

Advanced Data Flow Graphs

Stratosphere allows to model analysis programs as advanced data flow graphs. For many applications, this is a more natural fit than the constrained MapReduce interface (strictly Map followed by Reduce). Furthermore, data pipelining and in-memory data transfers increase performance by drastically reducing disk and network I/O.

See how Hadoop does complex data flows

Complex Data Flows in Hadoop

Executing the plan shown on the left using MapReduce leads to a composition of multiple MapReduce jobs. Intermediate results are stored in HDFS after each job. This causes a lot of network and disk I/O. Remember also that a just the setup of a MapReduce job itself takes some time. This example shows that many real world applications do not fit the MapReduce model. Also, the implementation of complex data flows using MapReduce is very time-consuming.

Stratosphere is able to natively execute the job. Everything is processed in-memory. Only if the data does not fit into the memory anymore, it starts using the local hard disks.

Powerful Programming Interfaces

You can write data analysis programs for Stratosphere in Java or Scala. Both APIs provide a powerful yet easy-to-use abstraction to compose data analysis programs by applying customizable transformations such as map, filter, reduce, and join on data sets. Stratosphere's high-level APIs hide the complexities of parallel programming and efficient data processing from the user. Behind the scenes, the Stratosphere optimizer compiles such programs into efficient, parallel data flows which are executed on a cluster or a local machine.

See our Quickstart guides for Scala and Java

Word count in Stratosphere using Scala:

val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
  .groupBy { word => word }
  .count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))

Support for Iterative Algorithms

Data Mining, Machine Learning and Graph processing algorithms often require to loop over the working data multiple times. Stratosphere supports iterative algorithms in its core. (The runtime allows for very fast iteration times and the optimizer deals with caching loop-invariant data.) The advanced incremental iterations support algorithms that focus on the "hot part" of the evolving solution and may converge even faster.

Iterative algorithms with Hadoop

Iterations with Hadoop

Iterative algorithms are implemented in Hadoop MapReduce using a central driver that spawns MapReduce jobs until the result has been computed. This approach has many disadvantages:

Each MapReduce Job needs at least 30 seconds for setup
Data is transferred only between jobs using HDFS (in-memory would be much faster)
Everything has to be read over and over again, even if it has already finished
Very time consuming to implement

Stratosphere natively executes iterative algorithms. The result of the last operator is fed back to the input of the first operator (in-memory). It is not required to start a new job on each iteration. Stratosphere detects which parts of the data need processing for further iterations. Only those are loaded.

High-Performance Execution Runtime

Stratosphere features an own high-performance, massively-parallel execution runtime which has been built from ground up leveraging processing techniques of parallel database systems. The engine supports low-latency processing concepts such as pipelined execution, in-memory processing, and push-based data shipping as well as sort- and hash-based processing algorithms which go gracefully out-of-core if main memory is not sufficient.

Built-In Optimizer

Stratosphere comes with an optimizer that is independent of the actual programming interface. It chooses a fitting execution strategy depending on the inputs and operations. For example the "Join" operator will choose between partitioning and broadcasting the data, as well as between running a sort-merge-join or a hybrid hash join algorithm.

Cost-based optimizer choice of operators and shipping strategies.
In-memory pipelining of operators
Reduction of shipped and written data volume
Global memory distribution
Input Sampling to determine cardinalities

Focus on your application logic rather than parallel execution.

System Stack

Stratosphere seamlessly integrates into existing Hadoop setups and runs side-by-side with Hadoop's TaskTrackers and DataNodes. Stratosphere can read data from Hadoop sources, but comes with its own efficient runtime. Similar to Hadoop, Stratosphere scales by adding more machines to the cluster.
Stratosphere runs also on Hadoop 2.2 (YARN), so you do not need to change your infrastructure.
The Local execution mode allows to debug and analyze your application right from your favorite IDE, without having Stratosphere installed.

Open Source Community and Support

Stratosphere is an active, community driven open-source project. It is licensed under the Apache License.
Our friendly community is always open to new users and developers. Join us and shape the future of Big Data.

Mailing List (Dev)

Open an Issue on GitHub

Next Steps

Download and try Stratosphere. Our Quickstart scripts make it easy for developers to create an empty Stratosphere program skeleton to start from. Dependencies are seamlessly handled by Maven without any installation. Testing and debugging is possible directly inside the IDE with Stratosphere's embedded mode. Ready-made binaries for cluster setups are available as well.

Visit bigdataclass.org for Stratosphere programming exercises.

Stratosphere

Stratosphere is the next generation Big Data Analytics Platform.

Easy to Install

Easy to Use

Advanced Analytics

Run in the Cloud

Performance

Empowering Data Scientists