Cloudera CCA175 Exam
CCA Spark and Hadoop Developer Exam (Page 3 )

Updated On: 19-Jan-2026

Problem Scenario 15 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.

1. In mysql departments table please insert following record. Insert into departments
values(9999, '"Data Science"1);
2. Now there is a downstream system which will process dumps of this file. However,
system is designed the way that it can process only files if fields are enlcosed in(') single
quote and separate of the field should be (-} and line needs to be terminated by : (colon).
3. If data itself contains the " (double quote } than it should be escaped by \.
4. Please import the departments table in a directory called departments_enclosedby and
file should be able to process by downstream system.

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Connect to mysql database.
mysql --user=retail_dba -password=cloudera
show databases; use retail_db; show tables;
Insert record
Insert into departments values(9999, '"Data Science"'); select" from departments;
Step 2: Import data as per requirement.
sqoop import \
-connect jdbc:mysql;//quickstart:3306/retail_db \
~username=retail_dba \
--password=cloudera \
-table departments \
-target-dir /user/cloudera/departments_enclosedby \
-enclosed-by V -escaped-by \\ -fields-terminated-by--' -lines-terminated-by :
Step 3: Check the result.
hdfs dfs -cat/user/cloudera/departments_enclosedby/part"



Problem Scenario 70 : Write down a Spark Application using Python, In which it read a file "Content.txt" (On hdfs) with following content. Do the word count and save the results in a directory called "problem85" (On hdfs)
Content.txt
Hello this is ABCTECH.com
This is XYZTECH.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Create an application with following code and store it in problem84.py
# Import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf
# Create configuration object and set App name
conf = SparkConf().setAppName("CCA 175 Problem 85") sc = sparkContext(conf=conf)
#load data from hdfs
contentRDD = sc.textFile(MContent.txt")
#filter out non-empty lines
nonemptyjines = contentRDD.filter(lambda x: len(x) > 0)
#Split line based on space
words = nonempty_lines.ffatMap(lambda x: x.split(''}}
#Do the word count
wordcounts = words.map(lambda x: (x, 1)) \
reduceByKey(lambda x, y: x+y) \
map(lambda x: (x[1], x[0]}}.sortByKey(False}
for word in wordcounts.collect(): print(word)
#Save final data " wordcounts.saveAsTextFile("problem85")

Step 2: Submit this application
spark-submit -master yarn problem85.py



Problem Scenario 77 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of order table : (orderid , order_date , order_customer_id, order_status)
Columns of ordeMtems table : (order_item_id , order_item_order_ld , order_item_product_id, order_item_quantity, order_item_subtotal, order_ item_product_price)
Please accomplish following activities.

1. Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory
p92_orders and p92 order items .
2. Join these data using orderid in Spark and Python
3. Calculate total revenue perday and per order
4. Calculate total and average revenue for each date. - combineByKey
-aggregateByKey

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=orders --target-dir=p92_orders ­m 1 sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba - password=cloudera -table=order_items --target-dir=p92_order_items ­m1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs
Step 2: Read the data from one of the partition, created using above command, hadoop fs -cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-00000
Step 3: Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p92_orders") orderltems = sc.textFile("p92_order_items")
Step 4: Convert RDD into key value as (orderjd as a key and rest of the values as a value) #First value is orderjd
ordersKeyValue = orders.map(lambda line: (int(line.split(", ")[0]), line)) #Second value as an Orderjd
orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(", ")[1]), line))
Step 5: Join both the RDD using orderjd
joinedData = orderltemsKeyValue.join(ordersKeyValue)
#print the joined data
for line in joinedData.collect():
print(line)
Format of joinedData as below.
[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']
Step 6: Now fetch selected values Orderld, Order date and amount collected on this order. //Retruned row will contain ((order_date, order_id), amout_collected) revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M, M)[1], row[0]}, float(row[1][0].split(", ")[4])))
#print the result
for line in revenuePerDayPerOrder.collect():
print(line)
Step 7: Now calculate total revenue perday and per order
A. Using reduceByKey
totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda runningSum, value: runningSum + value)
for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line)
#Generate data as (date, amount_collected) (Ignore ordeMd) dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0], line[1]))
for line in dateAndRevenueTuple.sortByKey().collect(): print(line)
Step 8: Calculate total amount collected for each day. And also calculate number of days. #Generate output as (Date, Total Revenue for date, total_number_of_dates)
#Line 1 : it will generate tuple (revenue, 1)
#Line 2 : Here, we will do summation for all revenues at the same time another counter to maintain number of records.
#Line 3 : Final function to merge all the combiner
totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \
lambda revenue: (revenue, 1), \
lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1] + 1), \
lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \ for line in totalRevenueAndTotalCount.collect(): print(line)
Step 9: Now calculate average for each date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements:
(threeElements[0], threeElements[1][0]/threeElements[1][1]}} for line in averageRevenuePerDate.collect(): print(line)
Step 10: Using aggregateByKey
#line 1 : (Initialize both the value, revenue and count) #line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for each date)
#line 3 : Summing all partitions revenue and count
totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \ (0, 0), \
lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue, runningRevenueSumTuple[1] + 1), \
lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount:
(tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0], tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \ )
for line in totalRevenueAndTotalCount.collect(): print(line)
Step 11: Calculate the average revenue per date
averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements:
(threeElements[0], threeElements[1][0]/threeElements[1][1])) for line in averageRevenuePerDate.collect(): print(line)



Problem Scenario 52 : You have been given below code snippet.
val b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 2, 4, 2, 1, 1, 1, 1, 1))
Operation_xyz
Write a correct code snippet for Operation_xyz which will produce below output. scalaxollection.Map[lnt, Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 ->
1)

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
b.countByValue
countByValue
Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the information in a single reducer.)
Listing Variants
def countByValue(): Map[T, Long]



Problem Scenario 31 : You have given following two files

1. Content.txt: Contain a huge text file containing space separated words.
2. Remove.txt: Ignore/filter all the words given in this file (Comma Separated).

Write a Spark program which reads the Content.txt file and load as an RDD, remove all the words from a broadcast variables (which is loaded as an RDD of words from Remove.txt). And count the occurrence of the each word and save it as a text file in HDFS.
Content.txt

Hello this is ABCTech.com
This is TechABY.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
Remove.txt
Hello, is, this, the

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Create all three files in hdfs in directory called spark2 (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs
Step 2: Load the Content.txt file
val content = sc.textFile("spark2/Content.txt") //Load the text file
Step 3: Load the Remove.txt file
val remove = sc.textFile("spark2/Remove.txt") //Load the text file
Step 4: Create an RDD from remove, However, there is a possibility each word could have trailing spaces, remove those whitespaces as well. We have used two functions here flatMap, map and trim.
val removeRDD= remove.flatMap(x=> x.splitf', ") ).map(word=>word.trim)//Create an array of words
Step 5: Broadcast the variable, which you want to ignore val bRemove = sc.broadcast(removeRDD.collect().toList) // It should be array of Strings
Step 6: Split the content RDD, so we can have Array of String. val words = content.flatMap(line => line.split(" "))
Step 7: Filter the RDD, so it can have only content which are not present in "Broadcast
Variable". val filtered = words.filter{case (word) => !bRemove.value.contains(word)}
Step 8: Create a PairRDD, so we can have (word, 1) tuple or PairRDD. val pairRDD = filtered.map(word => (word, 1))
Step 9: Nowdo the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + _)
Step 10: Save the output as a Text file.
wordCount.saveAsTextFile("spark2/result.txt")



Viewing page 3 of 21
Viewing questions 11 - 15 out of 96 questions



Post your Comments and Discuss Cloudera CCA175 exam prep with other Community members:

Join the CCA175 Discussion