Skip to content

Answer how Stratosphere compares to Apache Spark #36

@rmetzger

Description

@rmetzger

This message from our mailing list, posted by @fhueske might be a good skeleton:

Similar to Spark, Stratosphere is a complete data processing system, i.e., it has a programming API, a program compiler (optimizer), and an own execution runtime.
It is also an alternative for Hadoop MapReduce and in several design points quite similar to Spark:

  • Programs are executed as DAGs
  • Higher-level programming primitives (compared to Hadoop MR)
  • APIs in Scala and Java
  • Reads data from external data stores (has no own data storage), e.g., HDFS, S3, RDBMS, ...

However, Stratosphere is also different in some aspects:

  • Database-inspired processing using pipelining, gradually going to disk if memory is not sufficient (Hybridhash Joins, external sorts)
  • Sophisticated cost-based optimizer choosing execution strategies (broadcasting vs. partitioning, sort vs. hash joins, ...)
  • Implemented in Java (in contrast to Spark which uses Scala)
  • No intermediate result materialization in memory (this is on the roadmap)

Stratosphere and Spark can be rather seen as alternatives.
We do not build on any of Sparks components as we have our own programming API and execution engine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions