SlideShare a Scribd company logo
C H A P T E R 0 5 : L O A D I N G A N D S A V I N G Y O U R
D A T A
Learning Spark
by Holden Karau et. al.
Overview: Loading and saving your data
 Motivation
 File Formats
 Text files
 JSON
 CSV and TSV
 SequenceFiles
 Object Files
 Hadoop Input and Output Formats
 File Compression
 Filesystems
 Local/"Regular" FS
 Amazon S3
 HDFS
Overview: Loading and saving your data
 Structured Data with Spark SQL
 Apache Hive
 JSON
 Databases
 Java Database Connectivity
 Cassandra
 HBase
 Elasticsearch
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
5.1 Motivation
 Spark supports a wide range of input and output
sources.
 Spark can access data through the InputFormat and
OutputFormat interfaces used by Hadoop
MapReduce, which are available for many common
file formats and storage systems (e.g., S3, HDFS,
Cassandra, HBase, etc.)
 You will want to use higher-level APIs built on top of
these raw interfaces.
5.2 File Formats
5.2.1 Text Files
Example 5-1. Loading a text file in Python
input = sc.textFile
("file:///home/holden/repos/spark/README.md")
Example 5-2. Loading a text file in Scala
val input = sc.textFile
("file:///home/holden/repos/spark/README.md")
Example 5-3. Loading a text file in Java
JavaRDD<String> input =
sc.textFile("file:///home/holden/repos/spark/README.md")
5.2.1 Text Files
Example 5-4. Average value per file in Scala
val input = sc.wholeTextFiles("file://home/holden/salesFiles")
val result = input.mapValues{y =>
val nums = y.split(" ").map(x => x.toDouble)
nums.sum / nums.size.toDouble
}
Example 5-5. Saving as a text file in Python
result.saveAsTextFile(outputFile)
5.2.2 JSON
Example 5-6. Loading unstructured JSON in Python
import json
data = input.map(lambda x: json.loads(x))
5.2.2 JSON
Example 5-7. Loading JSON in Scala
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature
...
case class Person(name: String, lovesPandas: Boolean) // Must be a top-level class
...
// Parse it into a specific case class. We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
// list with one element if everything is ok (Some(_)).
val result= input.flatMap(record => {
try {
Some(mapper.readValue(record, classOf[Person]))
} catch {
case e: Exception => None
}})
5.2.2 JSON
Example 5-8. Loading JSON in Java
class ParseJson implements FlatMapFunction<Iterator<String>, Person> {
public Iterable<Person> call(Iterator<String> lines) throws Exception {
ArrayList<Person> people = new ArrayList<Person>(); ObjectMapper
mapper = new ObjectMapper();
while (lines.hasNext()) {
String line = lines.next(); try {
people.add(mapper.readValue(line, Person.class)); } catch (Exception e) {
// skip records on failure
} }
return people; }
}
JavaRDD<String> input = sc.textFile("file.json"); JavaRDD<Person>
result = input.mapPartitions(new ParseJson());
5.2.2 JSON
Example 5-9. Saving JSON in Python
(data.filter(lambda x: x['lovesPandas']).map(lambda x:
json.dumps(x)) .saveAsTextFile(outputFile))
Example 5-10. Saving JSON in Scala
result.filter(p =>
P.lovesPandas).map(mapper.writeValueAsString(_))
.saveAsTextFile(outputFile)
5.2.2 JSON
Example 5-11. Saving JSON in Java
class WriteJson implements FlatMapFunction<Iterator<Person>, String> {
public Iterable<String> call(Iterator<Person> people) throws Exception {
ArrayList<String> text = new ArrayList<String>(); ObjectMapper mapper =
new ObjectMapper();
while (people.hasNext()) {
Person person = people.next();
text.add(mapper.writeValueAsString(person)); }
return text; }
}
JavaRDD<Person> result = input.mapPartitions(new ParseJson()).filter( new
LikesPandas());
JavaRDD<String> formatted = result.mapPartitions(new WriteJson());
formatted.saveAsTextFile(outfile);
5.2.3 CSV/TSV
Example 5-12. Loading CSV with textFile() in Python
import csv
import StringIO
...
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name",
"favouriteAnimal"]) return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
5.2.3 CSV/TSV
Example 5-13. Loading CSV with textFile() in Scala
import Java.io.StringReader
import au.com.bytecode.opencsv.CSVReader...
val input = sc.textFile(inputFile)
val result = input.map{ line =>
val reader = new CSVReader(new StringReader(line));
reader.readNext(); }
5.2.3 CSV/TSV
Example 5-14. Loading CSV with textFile() in Java
import au.com.bytecode.opencsv.CSVReader;
import Java.io.StringReader;
...
public static class ParseLine implements Function<String,
String[]>{
public String[] call(String line) throws Exception { CSVReader
reader = new CSVReader(new StringReader(line));return
reader.readNext();
} }
JavaRDD<String> csvFile1 = sc.textFile(inputFile);
JavaPairRDD<String[]> csvData = csvFile1.map(new ParseLine());
5.2.3 CSV/TSV
Example 5-15. Loading CSV in full in Python
def loadRecords(fileNameContents):
"""Load all the records in a given file"""
input = StringIO.StringIO(fileNameContents[1])
reader = csv.DictReader(input, fieldnames=["name",
"favoriteAnimal"]) return reader
fullFileData= sc.wholeTextFiles(inputFile).flatMap(loadRecords)
5.2.3 CSV/TSV
Example 5-16. Loading CSV in full in Scala
case class Person(name: String, favoriteAnimal: String)
val input = sc.wholeTextFiles(inputFile) val result = input.flatMap{
case (_, txt) =>
val reader = new CSVReader(new StringReader(txt));
reader.readAll().map(x => Person(x(0), x(1))) }
5.2.3 CSV/TSV
Example 5-17. Loading CSV in full in Java
public static class ParseLine
implements FlatMapFunction<Tuple2<String, String>, String[]> {
public Iterable<String[]> call(Tuple2<String , String> file ) throws
Exception {
CSVReader reader = new CSVReader(new StringReader(file._2()));
return reader.readAll(); }
}
JavaPairRDD<String, String> csvData =
sc.wholeTextFiles(inputFile); JavaRDD<String[]> keyedRDD =
csvData.flatMap(new ParseLine());
5.2.3 CSV/TSV
Example 5-18. Writing CSV in Python
def writeRecords(records):
"""Write out CSV lines"""
output = StringIO.StringIO()
writer = csv.DictWriter(output, fieldnames=["name",
"favoriteAnimal"]) for record in records:
writer.writerow(record) return [output.getvalue()]
pandaLovers.mapPartitions(writeRecords).saveAsTextFile(outputFil
e)
5.2.3 CSV/TSV
Example 5-19. Writing CSV in Scala
pandaLovers.map(person => List(person.name,
person.favoriteAnimal).toArray) .mapPartitions{people =>
val stringWriter = new StringWriter();
val csvWriter = new CSVWriter(stringWriter);
csvWriter.writeAll(people.toList) Iterator(stringWriter.toString)
}.saveAsTextFile(outFile)
5.2.4 SequenceFiles
 SequenceFiles are a popular Hadoop format
composed of flat files with key/value pairs.
 SequenceFiles are a common input/output format
for Hadoop MapReduce jobs.
 SequenceFiles consist of elements that implement
Hadoop’s Writable interface, as Hadoop uses a
custom serialization framework.
5.2.4 SequenceFiles
5.2.4 SequenceFiles
Example 5-20. Loading a SequenceFile in
Python
val data = sc.sequenceFile(inFile,
"org.apache.hadoop.io.Text",
"org.apache.hadoop.io.IntWritable")
Example 5-21. Loading a SequenceFile in Scala
val data = sc.sequenceFile(inFile, classOf[Text],
classOf[IntWritable]). map{case (x, y) => (x.toString,
y.get())}
5.2.4 SequenceFiles
Example 5-22. Loading a SequenceFile in Java
public static class ConvertToNativeTypes implements
PairFunction<Tuple2<Text, IntWritable>, String, Integer>
{
public Tuple2<String, Integer> call(Tuple2<Text,
IntWritable> record) {
return new Tuple2(record._1.toString(), record._2.get()); }
}
JavaPairRDD<Text, IntWritable> input=
sc.sequenceFile(fileName, Text.class, IntWritable.class);
JavaPairRDD<String, Integer> result = input.mapToPair(
new ConvertToNativeTypes());
5.2.4 SequenceFiles
Example 5-23. Saving a SequenceFile in Scala
val data = sc.parallelize(List(("Panda", 3), ("Kay", 6),
("Snail", 2))) data.saveAsSequenceFile(outputFile)
5.2.5 Object Files
 Object files are a deceptively simple wrapper around
SequenceFiles that allows us to save our RDDs
containing just values. Unlike with SequenceFiles, with
object files the values are written out using Java
Serialization.
 Using Java Serialization for object files has a number of
implications. Unlike with normal SequenceFiles, the
output will be different than Hadoop outputting the same
objects. Unlike the other formats, object files are mostly
intended to be used for Spark jobs communicating with
other Spark jobs. Java Serialization can also be quite
slow.
5.2.5 Object Files
 Saving an object file is as simple as calling
saveAsObjectFile on an RDD. Reading an object file
back is also quite simple: the function objectFile() on
the SparkCon‐ text takes in a path and returns an
RDD.
 Object files are not available in Python, but the
Python RDDs and SparkContext support methods
called saveAsPickleFile() and pickleFile() instead.
These use Python’s pickle serialization library.
5.2.6 Hadoop Input and Output Formats
Example 5-24. Loading
KeyValueTextInputFormat() with old-style API
in Scala
val input = sc.hadoopFile[Text, Text,
KeyValueTextInputFormat](inputFile).map{ case (x,
y) => (x.toString, y.toString)
}
5.2.6 Hadoop Input and Output Formats
Example 5-25. Loading LZO-compressed JSON
with Elephant Bird in Scala
val input = sc.newAPIHadoopFile(inputFile,
classOf[LzoJsonInputFormat], classOf[LongWritable],
classOf[MapWritable], conf)
// Each MapWritable in "input" represents a JSON
object
5.2.6 Hadoop Input and Output Formats
Example 5-26. Saving a SequenceFile in Java
public static class ConvertToWritableTypes implements
PairFunction<Tuple2<String, Integer>, Text, IntWritable> {
public Tuple2<Text, IntWritable> call(Tuple2<String, Integer>
record) {
return new Tuple2(new Text(record._1), new
IntWritable(record._2)); }
}
JavaPairRDD<String, Integer> rdd = sc.parallelizePairs(input);
JavaPairRDD<Text, IntWritable> result= rdd.mapToPair(new
ConvertToWritableTypes());
result.saveAsHadoopFile(fileName, Text.class,
IntWritable.class,
SequenceFileOutputFormat.class);
5.2.7 File Compression
5.2.7 File Compression
5.3 File Systems
 Spark supports a large number of filesystems for
reading and writing to, which we can use with any of
the file formats we want.
5.3.1 Local/"Regular" FS
 While Spark supports loading files from the local
filesystem, it requires that the files are available at
the same path on all nodes in your cluster.
Example 5-29. Loading a compressed text file from the
local filesystem in Scala
val rdd = sc.textFile
("file:///home/holden/happypandas.gz")
5.3.2 Amazon S3
 To access S3 in Spark, you should first set the
AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY environment
variables to your S3 credentials.
 Then pass a path starting with s3n:// to Spark’s file
input methods, of the form s3n://bucket/path-
within-bucket.
5.3.3 HDFS
 Using Spark with HDFS is as simple as specifying
hdfs://master:port/path for your input and output.
5.4 Structured data with Spark SQL
 Spark SQL supports multiple structured data sources
as input, and because it understands their schema.
 We give Spark SQL a SQL query to run on the data
source (selecting some fields or a function of the
fields), and we get back an RDD of Row objects, one
per record.
5.4.1 Apache Hive
 To connect Spark SQL to an existing Hive
installation, you need to provide a Hive
configuration. You do so by copying your hive-
site.xml file to Spark’s ./conf/ directory.
5.4.1 Apache Hive
Example 5-30. Creating a HiveContext and
selecting data in Python
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
rows = hiveCtx.sql("SELECT name, age FROM users")
firstRow = rows.first()
print firstRow.name
5.4.1 Apache Hive
Example 5-31. Creating a HiveContext and
selecting data in Scala
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new
org.apache.spark.sql.hive.HiveContext(sc) val rows =
hiveCtx.sql("SELECT name, age FROM users")
val firstRow =
rows.first()println(firstRow.getString(0)) // Field 0 is
the name
5.4.1 Apache Hive
Example 5-32. Creating a HiveContext and
selecting data in Java
import org.apache.spark.sql.hive.HiveContext; import
org.apache.spark.sql.Row;
import org.apache.spark.sql.SchemaRDD;
HiveContext hiveCtx = new HiveContext(sc);
SchemaRDD rows = hiveCtx.sql("SELECT name, age
FROM users");
Row firstRow = rows.first();
System.out.println(firstRow.getString(0)); // Field 0 is the
name
5.4.2 JSON
Example 5-33. Sample tweets in JSON
{"user": {" name": "Holden", "location": "San
Francisco"}, "text": "Nice day out today"} {"user": {"
name": "Matei", "location": "Berkeley"}, "text": "Even
nicer here :)"}
5.4.2 JSON
Example 5-34. JSON loading with Spark SQL in
Python
tweets = hiveCtx.jsonFile("tweets.json")
tweets.registerTempTable("tweets") results =
hiveCtx.sql("SELECT user.name, text FROM tweets")
5.4.2 JSON
Example 5-35. JSON loading with Spark SQL in
Scala
val tweets = hiveCtx.jsonFile("tweets.json")
tweets.registerTempTable("tweets")val results =
hiveCtx.sql("SELECT user.name, text FROM tweets")
Example 5-36. JSON loading with Spark SQL in
Java
SchemaRDD tweets = hiveCtx.jsonFile(jsonFile);
tweets.registerTempTable("tweets"); SchemaRDD results =
hiveCtx.sql("SELECT user.name, text FROM tweets");
5.5 Databases
 Spark can access several popular databases using
either their Hadoop connectors or custom Spark
connectors.
5.5.1 JDBC
Example 5-37. JdbcRDD in Scala
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver").newInstance();
DriverManager.getConnection("jdbc:mysql://localhost/test?
user=holden");
}
def extractValues(r: ResultSet) = { (r.getInt(1), r.getString(2))
}
val data = new JdbcRDD(sc,
createConnection, "SELECT * FROM panda WHERE ? <= id
AND id <= ?", lowerBound = 1, upperBound = 3, numPartitions
= 2, mapRow = extractValues)
println(data.collect().toList)
5.5.2 Cassandra
Example 5-38. sbt requirements for Cassandra connector
"com.datastax.spark" %% "spark-cassandra-connector" %
"1.0.0-rc5", "com.datastax.spark" %% "spark-cassandra-
connector-java" % "1.0.0-rc5"
Example 5-39. Maven requirements for Cassandra
connector
<dependency> <!-- Cassandra -->
<groupId>com.datastax.spark</groupId> <artifactId>spark-
cassandra-connector</artifactId> <version>1.0.0-rc5</version>
</dependency>
<dependency> <!-- Cassandra -->
<groupId>com.datastax.spark</groupId> <artifactId>spark-
cassandra-connector-java</artifactId> <version>1.0.0-rc5</version>
</dependency>
5.5.2 Cassandra
Example 5-40. Setting the Cassandra property in
Scala
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host ", "hostname")
val sc = new SparkContext(conf)
Example 5-41. Setting the Cassandra property in Java
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", cassandraHost);
JavaSparkContext sc = new JavaSparkContext(
sparkMaster, "basicquerycassandra", conf);
5.5.2 Cassandra
Example 5-42. Loading the entire table as an RDD
with key/value data in Scala
// Implicits that add functions to the SparkContext &
RDDs .
import com.datastax.spark.connector._
// Read entire table as an RDD .Assumes your table test
was created as // CREATE TABLE test.kv(key text
PRIMARY KEY, value int);val data =
sc.cassandraTable("test", "kv")
// Print some basic stats on the value field.
data.map(row => row.getInt("value")).stats()
5.5.2 Cassandra
Example 5-43. Loading the entire table as an RDD with key/value
data in Java
import com.datastax.spark.connector.CassandraRow;
import static
com.datastax.spark.connector.CassandraJavaUtil.javaFunctions;
// Read entire table as an RDD. Assumes your table test was created as
// CREATE TABLE test.kv(key text PRIMARY KEY, value
int);JavaRDD<CassandraRow> data =
javaFunctions(sc).cassandraTable("test" , " kv"); // Print some basic stats.
System.out.println(data.mapToDouble(new
DoubleFunction<CassandraRow>() { public double call(CassandraRow row) {
return row.getInt("value"); }
}). stats());
5.5.2 Cassandra
Example 5-44. Saving to Cassandra in Scala
val rdd = sc.parallelize(List(Seq("moremagic", 1)))
rdd.saveToCassandra("test" , "kv", SomeColumns
("key", " value"))
5.5.3 HBase
Example 5-45. Scala example of reading from HBase
import org.apache.hadoop.hbase.HBaseConfiguration import
org.apache.hadoop.hbase.client.Result import
org.apache.hadoop.hbase.io.ImmutableBytesWritable import
org.apache.hadoop.hbase.mapreduce.TableInputFormat
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "tablename") //
which table to scan
val rdd = sc.newAPIHadoopRDD(
conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result])
5.5.4 Elastic Search
Example 5-46. Elasticsearch output in Scala
val jobConf= new JobConf(sc.hadoopConfiguration)
jobConf.set("mapred.output.format.class",
"org.elasticsearch.hadoop.mr.EsOutputFormat")
jobConf.setOutputCommitter(classOf[FileOutputCommitte
r])
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRIT
E, "twitter/tweets")
jobConf.set(ConfigurationOptions.ES_NODES,
"localhost") FileOutputFormat.setOutputPath(jobConf,
new Path("-")) output.saveAsHadoopDataset(jobConf)
5.5.4 Elastic Search
Example 5-47. Elasticsearch input in Scala
def mapWritableToInput(in: MapWritable): Map[String, String] = {
in.map{case (k, v) => (k.toString, v.toString)}.toMap
}
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set(ConfigurationOptions.ES_RESOURCE_READ, args(1))
jobConf.set(ConfigurationOptions.ES_NODES, args(2))
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]], classOf[Object],
classOf[MapWritable]) // Extract only the map
// Convert the MapWritable[Text, Text] to Map[String, String]
val tweets = currentTweets.map{ case (key, value) =>
mapWritableToInput(value) }
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala

More Related Content

PPTX
Learning spark ch09 - Spark SQL
PPTX
Learning spark ch04 - Working with Key/Value Pairs
PPTX
Learning spark ch10 - Spark Streaming
PPTX
Learning spark ch06 - Advanced Spark Programming
PPTX
Learning spark ch01 - Introduction to Data Analysis with Spark
PDF
Learning spark ch01 - Introduction to Data Analysis with Spark
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
Productionizing your Streaming Jobs
Learning spark ch09 - Spark SQL
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch10 - Spark Streaming
Learning spark ch06 - Advanced Spark Programming
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Productionizing your Streaming Jobs

What's hot (20)

PDF
Apache Spark RDDs
PDF
Apache Spark Tutorial
ODP
Introduction to Spark with Scala
PDF
Apache Spark Tutorial
PPTX
Spark Study Notes
PPTX
Apache Spark
PPTX
Introduction to Apache Spark
PDF
Apache Spark Introduction - CloudxLab
PPT
Scala and spark
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Apache Spark Introduction
PPTX
Introduction to Apache Spark Developer Training
PDF
Spark overview
PDF
Introduction to Apache Spark
PPTX
Transformations and actions a visual guide training
PDF
Intro to apache spark stand ford
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PPTX
Apache Spark An Overview
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Apache Spark RDDs
Apache Spark Tutorial
Introduction to Spark with Scala
Apache Spark Tutorial
Spark Study Notes
Apache Spark
Introduction to Apache Spark
Apache Spark Introduction - CloudxLab
Scala and spark
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Apache Spark Introduction
Introduction to Apache Spark Developer Training
Spark overview
Introduction to Apache Spark
Transformations and actions a visual guide training
Intro to apache spark stand ford
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Apache Spark An Overview
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Ad

Viewers also liked (19)

PPT
HBase In Action - Chapter 10 - Operations
PPT
Mobile Security - Wireless hacking
PPT
Authentication in wireless - Security in Wireless Protocols
PPT
Session 1 Tp1
PPTX
Lightning fast analytics with Cassandra and Spark
PPT
COM Introduction
PPTX
enterprise java bean
PPT
Hibernate Tutorial
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
PPT
Firewall - Network Defense in Depth Firewalls
PPT
Hacking web applications
PDF
JPA and Hibernate
PPT
Introduction to hibernate
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PPT
Intro To Hibernate
KEY
Hibernate performance tuning
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
DOC
Hibernate tutorial for beginners
HBase In Action - Chapter 10 - Operations
Mobile Security - Wireless hacking
Authentication in wireless - Security in Wireless Protocols
Session 1 Tp1
Lightning fast analytics with Cassandra and Spark
COM Introduction
enterprise java bean
Hibernate Tutorial
Spark cassandra connector.API, Best Practices and Use-Cases
Firewall - Network Defense in Depth Firewalls
Hacking web applications
JPA and Hibernate
Introduction to hibernate
File Format Benchmark - Avro, JSON, ORC & Parquet
Intro To Hibernate
Hibernate performance tuning
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Hibernate tutorial for beginners
Ad

Similar to Learning spark ch05 - Loading and Saving Your Data (20)

PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PPTX
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
PPTX
Scala 20140715
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPTX
Introduction to Apache Spark
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
PDF
A Deep Dive Into Spark
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
PDF
Introduction to Apache Spark
PPTX
Big Data processing with Spark, Scala or Java?
PDF
Artigo 81 - spark_tutorial.pdf
PDF
Apache Spark: What? Why? When?
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
PPTX
Spark core
PPTX
spark example spark example spark examplespark examplespark examplespark example
PPT
11. From Hadoop to Spark 2/2
PPTX
CSV JSON and XML files in Python.pptx
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Getting The Best Performance With PySpark
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Scala 20140715
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Introduction to Apache Spark
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
A Deep Dive Into Spark
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Introduction to Apache Spark
Big Data processing with Spark, Scala or Java?
Artigo 81 - spark_tutorial.pdf
Apache Spark: What? Why? When?
A fast introduction to PySpark with a quick look at Arrow based UDFs
Spark core
spark example spark example spark examplespark examplespark examplespark example
11. From Hadoop to Spark 2/2
CSV JSON and XML files in Python.pptx
Alpine academy apache spark series #1 introduction to cluster computing wit...
Getting The Best Performance With PySpark

More from phanleson (19)

PPT
E-Commerce Security - Application attacks - Server Attacks
PPTX
HBase In Action - Chapter 04: HBase table design
PPT
Hbase in action - Chapter 09: Deploying HBase
PPTX
Learning spark ch11 - Machine Learning with MLlib
PPT
Learning spark ch07 - Running on a Cluster
PPT
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
PPT
Lecture 1 - Getting to know XML
PPTX
Lecture 4 - Adding XTHML for the Web
PPT
Lecture 2 - Using XML for Many Purposes
PPTX
SOA Course - SOA governance - Lecture 19
PPTX
Lecture 18 - Model-Driven Service Development
PPTX
Lecture 15 - Technical Details
PPTX
Lecture 10 - Message Exchange Patterns
PPTX
Lecture 9 - SOA in Context
PPTX
Lecture 07 - Business Process Management
PPTX
Lecture 04 - Loose Coupling
PPTX
Lecture 2 - SOA
PPTX
Lecture 3 - Services
PPTX
Lecture 01 - Motivation
E-Commerce Security - Application attacks - Server Attacks
HBase In Action - Chapter 04: HBase table design
Hbase in action - Chapter 09: Deploying HBase
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch07 - Running on a Cluster
Hướng Dẫn Đăng Ký LibertaGia - A guide and introduciton about Libertagia
Lecture 1 - Getting to know XML
Lecture 4 - Adding XTHML for the Web
Lecture 2 - Using XML for Many Purposes
SOA Course - SOA governance - Lecture 19
Lecture 18 - Model-Driven Service Development
Lecture 15 - Technical Details
Lecture 10 - Message Exchange Patterns
Lecture 9 - SOA in Context
Lecture 07 - Business Process Management
Lecture 04 - Loose Coupling
Lecture 2 - SOA
Lecture 3 - Services
Lecture 01 - Motivation

Recently uploaded (20)

PDF
Complications of Minimal Access Surgery at WLH
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Pharma ospi slides which help in ospi learning
PPTX
master seminar digital applications in india
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Trump Administration's workforce development strategy
PPTX
Cell Structure & Organelles in detailed.
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
01-Introduction-to-Information-Management.pdf
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
A systematic review of self-coping strategies used by university students to ...
O7-L3 Supply Chain Operations - ICLT Program
Pharma ospi slides which help in ospi learning
master seminar digital applications in india
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Trump Administration's workforce development strategy
Cell Structure & Organelles in detailed.
human mycosis Human fungal infections are called human mycosis..pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
GDM (1) (1).pptx small presentation for students
Weekly quiz Compilation Jan -July 25.pdf
Microbial diseases, their pathogenesis and prophylaxis
Abdominal Access Techniques with Prof. Dr. R K Mishra
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Anesthesia in Laparoscopic Surgery in India
01-Introduction-to-Information-Management.pdf

Learning spark ch05 - Loading and Saving Your Data

  • 1. C H A P T E R 0 5 : L O A D I N G A N D S A V I N G Y O U R D A T A Learning Spark by Holden Karau et. al.
  • 2. Overview: Loading and saving your data  Motivation  File Formats  Text files  JSON  CSV and TSV  SequenceFiles  Object Files  Hadoop Input and Output Formats  File Compression  Filesystems  Local/"Regular" FS  Amazon S3  HDFS
  • 3. Overview: Loading and saving your data  Structured Data with Spark SQL  Apache Hive  JSON  Databases  Java Database Connectivity  Cassandra  HBase  Elasticsearch
  • 4. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 5. 5.1 Motivation  Spark supports a wide range of input and output sources.  Spark can access data through the InputFormat and OutputFormat interfaces used by Hadoop MapReduce, which are available for many common file formats and storage systems (e.g., S3, HDFS, Cassandra, HBase, etc.)  You will want to use higher-level APIs built on top of these raw interfaces.
  • 7. 5.2.1 Text Files Example 5-1. Loading a text file in Python input = sc.textFile ("file:///home/holden/repos/spark/README.md") Example 5-2. Loading a text file in Scala val input = sc.textFile ("file:///home/holden/repos/spark/README.md") Example 5-3. Loading a text file in Java JavaRDD<String> input = sc.textFile("file:///home/holden/repos/spark/README.md")
  • 8. 5.2.1 Text Files Example 5-4. Average value per file in Scala val input = sc.wholeTextFiles("file://home/holden/salesFiles") val result = input.mapValues{y => val nums = y.split(" ").map(x => x.toDouble) nums.sum / nums.size.toDouble } Example 5-5. Saving as a text file in Python result.saveAsTextFile(outputFile)
  • 9. 5.2.2 JSON Example 5-6. Loading unstructured JSON in Python import json data = input.map(lambda x: json.loads(x))
  • 10. 5.2.2 JSON Example 5-7. Loading JSON in Scala import com.fasterxml.jackson.module.scala.DefaultScalaModule import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.jackson.databind.DeserializationFeature ... case class Person(name: String, lovesPandas: Boolean) // Must be a top-level class ... // Parse it into a specific case class. We use flatMap to handle errors // by returning an empty list (None) if we encounter an issue and a // list with one element if everything is ok (Some(_)). val result= input.flatMap(record => { try { Some(mapper.readValue(record, classOf[Person])) } catch { case e: Exception => None }})
  • 11. 5.2.2 JSON Example 5-8. Loading JSON in Java class ParseJson implements FlatMapFunction<Iterator<String>, Person> { public Iterable<Person> call(Iterator<String> lines) throws Exception { ArrayList<Person> people = new ArrayList<Person>(); ObjectMapper mapper = new ObjectMapper(); while (lines.hasNext()) { String line = lines.next(); try { people.add(mapper.readValue(line, Person.class)); } catch (Exception e) { // skip records on failure } } return people; } } JavaRDD<String> input = sc.textFile("file.json"); JavaRDD<Person> result = input.mapPartitions(new ParseJson());
  • 12. 5.2.2 JSON Example 5-9. Saving JSON in Python (data.filter(lambda x: x['lovesPandas']).map(lambda x: json.dumps(x)) .saveAsTextFile(outputFile)) Example 5-10. Saving JSON in Scala result.filter(p => P.lovesPandas).map(mapper.writeValueAsString(_)) .saveAsTextFile(outputFile)
  • 13. 5.2.2 JSON Example 5-11. Saving JSON in Java class WriteJson implements FlatMapFunction<Iterator<Person>, String> { public Iterable<String> call(Iterator<Person> people) throws Exception { ArrayList<String> text = new ArrayList<String>(); ObjectMapper mapper = new ObjectMapper(); while (people.hasNext()) { Person person = people.next(); text.add(mapper.writeValueAsString(person)); } return text; } } JavaRDD<Person> result = input.mapPartitions(new ParseJson()).filter( new LikesPandas()); JavaRDD<String> formatted = result.mapPartitions(new WriteJson()); formatted.saveAsTextFile(outfile);
  • 14. 5.2.3 CSV/TSV Example 5-12. Loading CSV with textFile() in Python import csv import StringIO ... def loadRecord(line): """Parse a CSV line""" input = StringIO.StringIO(line) reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"]) return reader.next() input = sc.textFile(inputFile).map(loadRecord)
  • 15. 5.2.3 CSV/TSV Example 5-13. Loading CSV with textFile() in Scala import Java.io.StringReader import au.com.bytecode.opencsv.CSVReader... val input = sc.textFile(inputFile) val result = input.map{ line => val reader = new CSVReader(new StringReader(line)); reader.readNext(); }
  • 16. 5.2.3 CSV/TSV Example 5-14. Loading CSV with textFile() in Java import au.com.bytecode.opencsv.CSVReader; import Java.io.StringReader; ... public static class ParseLine implements Function<String, String[]>{ public String[] call(String line) throws Exception { CSVReader reader = new CSVReader(new StringReader(line));return reader.readNext(); } } JavaRDD<String> csvFile1 = sc.textFile(inputFile); JavaPairRDD<String[]> csvData = csvFile1.map(new ParseLine());
  • 17. 5.2.3 CSV/TSV Example 5-15. Loading CSV in full in Python def loadRecords(fileNameContents): """Load all the records in a given file""" input = StringIO.StringIO(fileNameContents[1]) reader = csv.DictReader(input, fieldnames=["name", "favoriteAnimal"]) return reader fullFileData= sc.wholeTextFiles(inputFile).flatMap(loadRecords)
  • 18. 5.2.3 CSV/TSV Example 5-16. Loading CSV in full in Scala case class Person(name: String, favoriteAnimal: String) val input = sc.wholeTextFiles(inputFile) val result = input.flatMap{ case (_, txt) => val reader = new CSVReader(new StringReader(txt)); reader.readAll().map(x => Person(x(0), x(1))) }
  • 19. 5.2.3 CSV/TSV Example 5-17. Loading CSV in full in Java public static class ParseLine implements FlatMapFunction<Tuple2<String, String>, String[]> { public Iterable<String[]> call(Tuple2<String , String> file ) throws Exception { CSVReader reader = new CSVReader(new StringReader(file._2())); return reader.readAll(); } } JavaPairRDD<String, String> csvData = sc.wholeTextFiles(inputFile); JavaRDD<String[]> keyedRDD = csvData.flatMap(new ParseLine());
  • 20. 5.2.3 CSV/TSV Example 5-18. Writing CSV in Python def writeRecords(records): """Write out CSV lines""" output = StringIO.StringIO() writer = csv.DictWriter(output, fieldnames=["name", "favoriteAnimal"]) for record in records: writer.writerow(record) return [output.getvalue()] pandaLovers.mapPartitions(writeRecords).saveAsTextFile(outputFil e)
  • 21. 5.2.3 CSV/TSV Example 5-19. Writing CSV in Scala pandaLovers.map(person => List(person.name, person.favoriteAnimal).toArray) .mapPartitions{people => val stringWriter = new StringWriter(); val csvWriter = new CSVWriter(stringWriter); csvWriter.writeAll(people.toList) Iterator(stringWriter.toString) }.saveAsTextFile(outFile)
  • 22. 5.2.4 SequenceFiles  SequenceFiles are a popular Hadoop format composed of flat files with key/value pairs.  SequenceFiles are a common input/output format for Hadoop MapReduce jobs.  SequenceFiles consist of elements that implement Hadoop’s Writable interface, as Hadoop uses a custom serialization framework.
  • 24. 5.2.4 SequenceFiles Example 5-20. Loading a SequenceFile in Python val data = sc.sequenceFile(inFile, "org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable") Example 5-21. Loading a SequenceFile in Scala val data = sc.sequenceFile(inFile, classOf[Text], classOf[IntWritable]). map{case (x, y) => (x.toString, y.get())}
  • 25. 5.2.4 SequenceFiles Example 5-22. Loading a SequenceFile in Java public static class ConvertToNativeTypes implements PairFunction<Tuple2<Text, IntWritable>, String, Integer> { public Tuple2<String, Integer> call(Tuple2<Text, IntWritable> record) { return new Tuple2(record._1.toString(), record._2.get()); } } JavaPairRDD<Text, IntWritable> input= sc.sequenceFile(fileName, Text.class, IntWritable.class); JavaPairRDD<String, Integer> result = input.mapToPair( new ConvertToNativeTypes());
  • 26. 5.2.4 SequenceFiles Example 5-23. Saving a SequenceFile in Scala val data = sc.parallelize(List(("Panda", 3), ("Kay", 6), ("Snail", 2))) data.saveAsSequenceFile(outputFile)
  • 27. 5.2.5 Object Files  Object files are a deceptively simple wrapper around SequenceFiles that allows us to save our RDDs containing just values. Unlike with SequenceFiles, with object files the values are written out using Java Serialization.  Using Java Serialization for object files has a number of implications. Unlike with normal SequenceFiles, the output will be different than Hadoop outputting the same objects. Unlike the other formats, object files are mostly intended to be used for Spark jobs communicating with other Spark jobs. Java Serialization can also be quite slow.
  • 28. 5.2.5 Object Files  Saving an object file is as simple as calling saveAsObjectFile on an RDD. Reading an object file back is also quite simple: the function objectFile() on the SparkCon‐ text takes in a path and returns an RDD.  Object files are not available in Python, but the Python RDDs and SparkContext support methods called saveAsPickleFile() and pickleFile() instead. These use Python’s pickle serialization library.
  • 29. 5.2.6 Hadoop Input and Output Formats Example 5-24. Loading KeyValueTextInputFormat() with old-style API in Scala val input = sc.hadoopFile[Text, Text, KeyValueTextInputFormat](inputFile).map{ case (x, y) => (x.toString, y.toString) }
  • 30. 5.2.6 Hadoop Input and Output Formats Example 5-25. Loading LZO-compressed JSON with Elephant Bird in Scala val input = sc.newAPIHadoopFile(inputFile, classOf[LzoJsonInputFormat], classOf[LongWritable], classOf[MapWritable], conf) // Each MapWritable in "input" represents a JSON object
  • 31. 5.2.6 Hadoop Input and Output Formats Example 5-26. Saving a SequenceFile in Java public static class ConvertToWritableTypes implements PairFunction<Tuple2<String, Integer>, Text, IntWritable> { public Tuple2<Text, IntWritable> call(Tuple2<String, Integer> record) { return new Tuple2(new Text(record._1), new IntWritable(record._2)); } } JavaPairRDD<String, Integer> rdd = sc.parallelizePairs(input); JavaPairRDD<Text, IntWritable> result= rdd.mapToPair(new ConvertToWritableTypes()); result.saveAsHadoopFile(fileName, Text.class, IntWritable.class, SequenceFileOutputFormat.class);
  • 34. 5.3 File Systems  Spark supports a large number of filesystems for reading and writing to, which we can use with any of the file formats we want.
  • 35. 5.3.1 Local/"Regular" FS  While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster. Example 5-29. Loading a compressed text file from the local filesystem in Scala val rdd = sc.textFile ("file:///home/holden/happypandas.gz")
  • 36. 5.3.2 Amazon S3  To access S3 in Spark, you should first set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables to your S3 credentials.  Then pass a path starting with s3n:// to Spark’s file input methods, of the form s3n://bucket/path- within-bucket.
  • 37. 5.3.3 HDFS  Using Spark with HDFS is as simple as specifying hdfs://master:port/path for your input and output.
  • 38. 5.4 Structured data with Spark SQL  Spark SQL supports multiple structured data sources as input, and because it understands their schema.  We give Spark SQL a SQL query to run on the data source (selecting some fields or a function of the fields), and we get back an RDD of Row objects, one per record.
  • 39. 5.4.1 Apache Hive  To connect Spark SQL to an existing Hive installation, you need to provide a Hive configuration. You do so by copying your hive- site.xml file to Spark’s ./conf/ directory.
  • 40. 5.4.1 Apache Hive Example 5-30. Creating a HiveContext and selecting data in Python from pyspark.sql import HiveContext hiveCtx = HiveContext(sc) rows = hiveCtx.sql("SELECT name, age FROM users") firstRow = rows.first() print firstRow.name
  • 41. 5.4.1 Apache Hive Example 5-31. Creating a HiveContext and selecting data in Scala import org.apache.spark.sql.hive.HiveContext val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc) val rows = hiveCtx.sql("SELECT name, age FROM users") val firstRow = rows.first()println(firstRow.getString(0)) // Field 0 is the name
  • 42. 5.4.1 Apache Hive Example 5-32. Creating a HiveContext and selecting data in Java import org.apache.spark.sql.hive.HiveContext; import org.apache.spark.sql.Row; import org.apache.spark.sql.SchemaRDD; HiveContext hiveCtx = new HiveContext(sc); SchemaRDD rows = hiveCtx.sql("SELECT name, age FROM users"); Row firstRow = rows.first(); System.out.println(firstRow.getString(0)); // Field 0 is the name
  • 43. 5.4.2 JSON Example 5-33. Sample tweets in JSON {"user": {" name": "Holden", "location": "San Francisco"}, "text": "Nice day out today"} {"user": {" name": "Matei", "location": "Berkeley"}, "text": "Even nicer here :)"}
  • 44. 5.4.2 JSON Example 5-34. JSON loading with Spark SQL in Python tweets = hiveCtx.jsonFile("tweets.json") tweets.registerTempTable("tweets") results = hiveCtx.sql("SELECT user.name, text FROM tweets")
  • 45. 5.4.2 JSON Example 5-35. JSON loading with Spark SQL in Scala val tweets = hiveCtx.jsonFile("tweets.json") tweets.registerTempTable("tweets")val results = hiveCtx.sql("SELECT user.name, text FROM tweets") Example 5-36. JSON loading with Spark SQL in Java SchemaRDD tweets = hiveCtx.jsonFile(jsonFile); tweets.registerTempTable("tweets"); SchemaRDD results = hiveCtx.sql("SELECT user.name, text FROM tweets");
  • 46. 5.5 Databases  Spark can access several popular databases using either their Hadoop connectors or custom Spark connectors.
  • 47. 5.5.1 JDBC Example 5-37. JdbcRDD in Scala def createConnection() = { Class.forName("com.mysql.jdbc.Driver").newInstance(); DriverManager.getConnection("jdbc:mysql://localhost/test? user=holden"); } def extractValues(r: ResultSet) = { (r.getInt(1), r.getString(2)) } val data = new JdbcRDD(sc, createConnection, "SELECT * FROM panda WHERE ? <= id AND id <= ?", lowerBound = 1, upperBound = 3, numPartitions = 2, mapRow = extractValues) println(data.collect().toList)
  • 48. 5.5.2 Cassandra Example 5-38. sbt requirements for Cassandra connector "com.datastax.spark" %% "spark-cassandra-connector" % "1.0.0-rc5", "com.datastax.spark" %% "spark-cassandra- connector-java" % "1.0.0-rc5" Example 5-39. Maven requirements for Cassandra connector <dependency> <!-- Cassandra --> <groupId>com.datastax.spark</groupId> <artifactId>spark- cassandra-connector</artifactId> <version>1.0.0-rc5</version> </dependency> <dependency> <!-- Cassandra --> <groupId>com.datastax.spark</groupId> <artifactId>spark- cassandra-connector-java</artifactId> <version>1.0.0-rc5</version> </dependency>
  • 49. 5.5.2 Cassandra Example 5-40. Setting the Cassandra property in Scala val conf = new SparkConf(true) .set("spark.cassandra.connection.host ", "hostname") val sc = new SparkContext(conf) Example 5-41. Setting the Cassandra property in Java SparkConf conf = new SparkConf(true) .set("spark.cassandra.connection.host", cassandraHost); JavaSparkContext sc = new JavaSparkContext( sparkMaster, "basicquerycassandra", conf);
  • 50. 5.5.2 Cassandra Example 5-42. Loading the entire table as an RDD with key/value data in Scala // Implicits that add functions to the SparkContext & RDDs . import com.datastax.spark.connector._ // Read entire table as an RDD .Assumes your table test was created as // CREATE TABLE test.kv(key text PRIMARY KEY, value int);val data = sc.cassandraTable("test", "kv") // Print some basic stats on the value field. data.map(row => row.getInt("value")).stats()
  • 51. 5.5.2 Cassandra Example 5-43. Loading the entire table as an RDD with key/value data in Java import com.datastax.spark.connector.CassandraRow; import static com.datastax.spark.connector.CassandraJavaUtil.javaFunctions; // Read entire table as an RDD. Assumes your table test was created as // CREATE TABLE test.kv(key text PRIMARY KEY, value int);JavaRDD<CassandraRow> data = javaFunctions(sc).cassandraTable("test" , " kv"); // Print some basic stats. System.out.println(data.mapToDouble(new DoubleFunction<CassandraRow>() { public double call(CassandraRow row) { return row.getInt("value"); } }). stats());
  • 52. 5.5.2 Cassandra Example 5-44. Saving to Cassandra in Scala val rdd = sc.parallelize(List(Seq("moremagic", 1))) rdd.saveToCassandra("test" , "kv", SomeColumns ("key", " value"))
  • 53. 5.5.3 HBase Example 5-45. Scala example of reading from HBase import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.client.Result import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.mapreduce.TableInputFormat val conf = HBaseConfiguration.create() conf.set(TableInputFormat.INPUT_TABLE, "tablename") // which table to scan val rdd = sc.newAPIHadoopRDD( conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
  • 54. 5.5.4 Elastic Search Example 5-46. Elasticsearch output in Scala val jobConf= new JobConf(sc.hadoopConfiguration) jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat") jobConf.setOutputCommitter(classOf[FileOutputCommitte r]) jobConf.set(ConfigurationOptions.ES_RESOURCE_WRIT E, "twitter/tweets") jobConf.set(ConfigurationOptions.ES_NODES, "localhost") FileOutputFormat.setOutputPath(jobConf, new Path("-")) output.saveAsHadoopDataset(jobConf)
  • 55. 5.5.4 Elastic Search Example 5-47. Elasticsearch input in Scala def mapWritableToInput(in: MapWritable): Map[String, String] = { in.map{case (k, v) => (k.toString, v.toString)}.toMap } val jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set(ConfigurationOptions.ES_RESOURCE_READ, args(1)) jobConf.set(ConfigurationOptions.ES_NODES, args(2)) val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) // Extract only the map // Convert the MapWritable[Text, Text] to Map[String, String] val tweets = currentTweets.map{ case (key, value) => mapWritableToInput(value) }
  • 56. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala

Editor's Notes

  • #7: Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. First, all libraries and higher- level components in the stack benefit from improvements at the lower layers. Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one. Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.
  • #8: Multipart inputs in the form of a directory containing all of the parts can be handled in two ways. We can just use the same textFile method and pass it a directory and it will load all of the parts into our RDD.  
  • #9: Sometimes it’s important to know which file which piece of input came from (such as time data with the key in the file) or we need to process an entire file at a time. If our files are small enough, then we can use the SparkContext.wholeTextFiles() method and get back a pair RDD where the key is the name of the input file. wholeTextFiles() can be very useful when each file represents a certain time period’s data. 
  • #10: Loading the data as a text file and then parsing the JSON data is an approach that we can use in all of the supported languages. This works assuming that you have one JSON record per row; if you have multiline JSON files, you will instead have to load the whole file and then parse each file. 
  • #12: Handling incorrectly formatted records can be a big problem, espe‐ cially with semistructured data like JSON. With small datasets it can be acceptable to stop the world (i.e., fail the program) on mal‐ formed input, but often with large datasets malformed input is simply a part of life. If you do choose to skip incorrectly formatted data, you may wish to look at using accumulators to keep track of the number of errors. 
  • #15: Comma-separated value (CSV) files are supposed to contain a fixed number of fields per line, and the fields are separated by a comma (or a tab in the case of tab-separated value, or TSV, files). Records are often stored one per line, but this is not always the case as records can sometimes span lines. CSV and TSV files can sometimes be inconsistent, most frequently with respect to handling newlines, escaping, and ren‐ dering non-ASCII characters, or noninteger numbers. CSVs cannot handle nested field types natively, so we have to unpack and pack to specific fields manually. Loading CSV/TSV data is similar to loading JSON data in that we can first load it as text and then process it. The lack of standardization of format leads to different ver‐ sions of the same library sometimes handling input in different ways. 
  • #18: If there are embedded newlines in fields, we will need to load each file in full and parse the entire segment, as shown in Examples 5-15 through 5-17. This is unfortu‐ nate because if each file is large it can introduce bottlenecks in loading and parsing. 
  • #30: One of the simplest Hadoop input formats is the KeyValueTextInputFormat, which can be used for reading in key/value data from text files (see Example 5-24). Each line is processed individually, with the key and value separated by a tab character. This format ships with Hadoop so we don’t have to add any extra dependencies to our project to use it. 
  • #31: Twitter’s Elephant Bird package supports a large number of data formats, including JSON, Lucene, Protocol Buffer–related formats, and others. The package also works with both the new and old Hadoop file APIs. To illustrate how to work with the new-style Hadoop APIs from Spark, we’ll look at loading LZO-compressed JSON data with Lzo JsonInputFormat in Example 5-25.