Getting Started with Apache Spark

Getting Started with Apache Spark

Apache Spark is a high performance general engine used to process large scale data. It is an open source framework used for cluster computing. The aim of this framework is to make data analytics faster ? both in terms of development and execution. This article will discuss Apache Spark and the various aspects of this framework.

Introduction

Apache Spark is an open source framework for cluster computing. It is built on top of the Hadoop Distributed File System (HDFS). It doesn’t use the two stage Map Reduce paradigm, but it does promise up to 100 times faster performance for certain applications. Spark also provides the initial leads for cluster computing within the memory. This enables the application programs to load the data into the memory of a cluster so that it can be queried repeatedly. This in-memory computation makes Spark one of the most important components in the big data computation world.

Features

Now let’s briefly discuss the features included in Apache Spark:

  • APIs based on Java, Scala and Python.
  • Scalability ranging from 80 to 100 nodes.
  • Ability to cache dataset within the memory for interactive data set. For example, extract a working set, cache it and query it repeatedly.
  • Efficient library for stream processing.
  • Efficient library for machine learning and graph processing.

When discussing Spark in the context of data science, note that Spark has the ability to maintain the resident data in the memory. This approach enhances the performance as compared with Map Reduce. Looking from the top, Spark contains a driver program that runs the main method of the client and executes various operations in parallel mode on a clustered environment.

Spark provides resilient distributed dataset (RDD) that is a collection of elements that are distributed across the different nodes of cluster, so that they can be executed in parallel. Spark has the ability to store an RDD in the memory, thus allowing it to be reused efficiently across parallel execution. RDDs can also automatically recover in case of the node failure.

Spark also provides shared variables that are used in parallel operations. When Spark runs in parallel as a set of tasks on different nodes, it transfers a copy of each variable to every task. These variables are also shared across different tasks. In Spark we have two types of shared variables:

  • broadcast variables – used to cache a value in memory
  • accumulators – used in case of counters and sums

Configuring Spark

Spark provides three main areas for configuration:

  • Spark properties – This controls most of the application and can be set by either using the SparkConf object or with the help of Java system properties.
  • Environment Variables – These can be used to configure machine based settings such as ip address with the help of conf/spark-env.sh script on every single node.
  • Logging – This can be configured using the standard log4j properties.

Spark Properties: Spark properties control most of the application settings and should be configured separately for separate applications. These properties can be set using the SparkConf object and is passed to the SparkContext. SparkConf allows us to configure most of the common properties to initialize. Using the set () method of SparkConf class we can also set key value pairs. A sample code using the set () method is shown below:

Listing 1: Sample showing the Set method

val conf = new SparkConf ()             . setMaster( "aws" )              . setAppName( "My Sample SPARK application" )             . set( "spark.executor.memory" , "1g" )val sc = new SparkContext (conf)

Some of the common properties are:

  • spark.executor.memory – This indicates the amount of memory to be used per executor. ?
  • spark.serializer – Class used to serialize objects that will be sent over the network. Since the default java serialization is quite slow, it is recommended to use the org.apache.spark.serializer.JavaSerializer class to get a better performance.
  • spark.kryo.registrator – Class used to register the custom classes if we use the Kyro serialization
  • spark.local.dir – locations that Spark uses as scratch space to store the map output files.
  • spark.cores.max – Used in standalone mode to specify the maximum amount of CPU cores to request.

Environment Variables: Some of the Spark settings can be configured using the environment variables that are defined in the conf/spark-env.sh script file. These are machine specific settings, such as library search path, java path, etc. Some of the commonly used environment variables are:

  • JAVA_HOME – Location where JAVA is installed on your system.
  • PYSPARK_PYTHON – The python library used for PYSPARK.
  • SPARK_LOCAL_IP – IP address of the machine that is to be bound.
  • SPARK_CLASSPATH – Used to add the libraries that are used at runtime to execute.
  • SPARK_JAVA_OPTS – Used to add the JVM options

Logging: Spark uses the standard Log4j API for logging that can be configured using the log4j. properties file.

Initializing Spark

To start with a Spark program, the first thing is to create a JavaSparkContext object that tells Spark to access the cluster. To create a Spark context we first create Spark conf object as shown below:

Listing 2: Initializing the Spark context object

SparkConf config = new SparkConf().setAppName(applicationName).setMaster(master);JavaSparkContext conext = new JavaSparkContext(config);

The parameter applicationName is the name of our application that is shown on the cluster UI. The parameter master is the cluster URL or a local string used to run in local mode.

Resilient Distributed Datasets (RDDs)

Spark is based on the concept of resilient distributed dataset or RDD. RDD is a fault-tolerant collection of elements that can be operated in parallel. RDD can be created using either of the following two methods:

  • By Parallelizing an existing collection – Parallelized collections are created by calling the parallelize method of the JavaSparkContext class in the driver program. Elements of the collection are copied from an existing collection that can be operated in parallel.
  • By Referencing the dataset on an external storage system – Spark has the ability to create distributed datasets from any Hadoop supported storage space, HDFS, Cassandra, Hbase, etc.

RDD Operations

RDD supports two types of operations:

  • Transformations – Used to create new datasets from an existing one.
  • Actions – This returns a value to the driver program after executing the code on the dataset.

In RDD the transformations are lazy. Transformations do not compute their results right away. Rather they just remember the transformations that are applied to the base datasets.

Summary

Hopefully you are now more familiar with the different aspects of the Apache Spark framework and its implementation. The performance of Spark over a common MapReduce job is also one of the most important aspects we should understand clearly.

?

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.

devx-admin

devx-admin

Share the Post:
Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India,

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with

Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in and explore the leaders in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India, and kickstart your journey to

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner for your online project. Your

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with minimal coding. These platforms not

Cybersecurity Strategy

Five Powerful Strategies to Bolster Your Cybersecurity

In today’s increasingly digital landscape, businesses of all sizes must prioritize cyber security measures to defend against potential dangers. Cyber security professionals suggest five simple technological strategies to help companies

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019 tech firms, as per data

Huawei Electric Dazzle

Huawei Dazzles with Electric Vehicles and Wireless Earbuds

During a prominent unveiling event, Huawei, the Chinese telecommunications powerhouse, kept quiet about its enigmatic new 5G phone and alleged cutting-edge chip development. Instead, Huawei astounded the audience by presenting

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to customers around the world. Rising

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created eight fintech firms, attracting renowned

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used in design, coding, content writing

Generative AI Revolution

Is Generative AI the Next Internet?

The increasing demand for Generative AI models has led to a surge in its adoption across diverse sectors, with healthcare, automotive, and financial services being among the top beneficiaries. These

Microsoft Laptop

The New Surface Laptop Studio 2 Is Nuts

The Surface Laptop Studio 2 is a dynamic and robust all-in-one laptop designed for creators and professionals alike. It features a 14.4″ touchscreen and a cutting-edge design that is over

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1