Spark

Spark

Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It is a data processing framework that can quickly perform processing tasks on very large data sets. Also, it can distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the marshalling of massive computing power to crunch through large data stores. Spark also takes some of the programming burdens of these tasks off the shoulders of developers with an easy-to-use API. This abstracts away much of the grunt work of distributed computing and big data processing. 

Why Spark?

  • Supports an array of programming languages like Java, Python, Scala, SQL, and R. 
  • Offers SQL analytics which ensure fast execution and distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses. 
  • Spark’s in-memory data engine means that it can perform tasks up to one hundred times faster than MapReduce in certain situations. This is particularly so when compared with multi-stage jobs that require the writing of state back out to disk between stages. 
  • Apache Spark API is very friendly to developers. It hides much of the complexity of a distributed processing engine behind simple method calls. 
  • Its security aids authentication through a shared secret. Spark authentication is the configuration parameter through which authentication can be configured. 
  • Secured by using https/SSL setting and by using javax servlet filters through spark.vi.filters settings. 
  • Spark support SASL encryption and SSL for HTTP protocols. It supports AES based encryption for RPC connections.