Cloudera CCA175 Exam
CCA Spark and Hadoop Developer Exam (Page 2 )

Updated On: 19-Jan-2026

CORRECT TEXT
Problem Scenario 28 : You need to implement near real time solutions for collecting information when submitted in file with below
Data
echo "IBM, 100, 20160104" >> /tmp/spooldir2/.bb.txt
echo "IBM, 103, 20160105" >> /tmp/spooldir2/.bb.txt
mv /tmp/spooldir2/.bb.txt /tmp/spooldir2/bb.txt
After few mins
echo "IBM, 100.2, 20160104" >> /tmp/spooldir2/.dr.txt
echo "IBM, 103.1, 20160105" >> /tmp/spooldir2/.dr.txt
mv /tmp/spooldir2/.dr.txt /tmp/spooldir2/dr.txt
You have been given below directory location (if not available than create it) /tmp/spooldir2 .
As soon as file committed in this directory that needs to be available in hdfs in /tmp/flume/primary as well as /tmp/flume/secondary location.
However, note that/tmp/flume/secondary is optional, if transaction failed which writes in this directory need not to be rollback.
Write a flume configuration file named flumeS.conf and use it to load data in hdfs with following additional properties .

1. Spool /tmp/spooldir2 directory
2. File prefix in hdfs sholuld be events
3. File suffix should be .log
4. If file is not committed and in use than it should have _ as prefix.
5. Data should be written as text to hdfs

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :Step 1: Create directory mkdir /tmp/spooldir2
Step 2: Create flume configuration file, with below configuration for source, sink and channel and save it in flume8.conf.
agent1 .sources = source1
agent1.sinks = sink1a sink1bagent1.channels = channel1a channel1b
agent1.sources.source1.channels = channel1a channel1b agent1.sources.source1.selector.type = replicating
agent1.sources.source1.selector.optional = channel1b agent1.sinks.sink1a.channel = channel1a
agent1 .sinks.sink1b.channel = channel1b
agent1.sources.source1.type = spooldir
agent1 .sources.sourcel.spoolDir = /tmp/spooldir2
agent1.sinks.sink1a.type = hdfs
agent1 .sinks, sink1a.hdfs. path = /tmp/flume/primary agent1 .sinks.sink1a.hdfs.tilePrefix = events
agent1 .sinks.sink1a.hdfs.fileSuffix = .log
agent1 .sinks.sink1a.hdfs.fileType = Data Stream
agent1 .sinks.sink1b.type = hdfs
agent1 .sinks.sink1b.hdfs.path = /tmp/flume/secondary agent1 .sinks.sink1b.hdfs.filePrefix = events
agent1.sinks.sink1b.hdfs.fileSuffix = .log
agent1 .sinks.sink1b.hdfs.fileType = Data Stream
agent1.channels.channel1a.type = file
agent1.channels.channel1b.type = memory
step 4 : Run below command which will use this configuration file and append data in hdfs.
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/flumeconf/flume8.conf --name age
Step 5: Open another terminal and create a file in /tmp/spooldir2/
echo "IBM, 100, 20160104" » /tmp/spooldir2/.bb.txt
echo "IBM, 103, 20160105" » /tmp/spooldir2/.bb.txt mv /tmp/spooldir2/.bb.txt /tmp/spooldir2/bb.txt
After few mins
echo "IBM.100.2, 20160104" »/tmp/spooldir2/.dr.txt
echo "IBM, 103.1, 20160105" » /tmp/spooldir2/.dr.txt mv /tmp/spooldir2/.dr.txt /tmp/spooldir2/dr.txt



Problem Scenario 87 : You have been given below three files
product.csv (Create this file in hdfs)
productID, productCode, name, quantity, price, supplierid
1001, PEN, Pen Red, 5000, 1.23, 501
1002, PEN, Pen Blue, 8000, 1.25, 501
1003, PEN, Pen Black, 2000, 1.25, 501
1004, PEC, Pencil 2B, 10000, 0.48, 502
1005, PEC, Pencil 2H, 8000, 0.49, 502
1006, PEC, Pencil HB, 0, 9999.99, 502
2001, PEC, Pencil 3B, 500, 0.52, 501
2002, PEC, Pencil 4B, 200, 0.62, 501
2003, PEC, Pencil 5B, 100, 0.73, 501
2004, PEC, Pencil 6B, 500, 0.47, 502
supplier.csv
supplierid, name, phone
501, ABC Traders, 88881111
502, XYZ Company, 88882222
503, QQ Corp, 88883333
products_suppliers.csv
productID, supplierID
2001, 501
2002, 501
2003, 501
2004, 502
2001, 503
Now accomplish all the queries given in solution.
Select product, its price , its supplier name where product price is less than 0.6 using SparkSQL

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :

Step 1:
hdfs dfs -mkdir sparksql2
hdfs dfs -put product.csv sparksq!2/
hdfs dfs -put supplier.csv sparksql2/
hdfs dfs -put products_suppliers.csv sparksql2/
Step 2: Now in spark shell
// this Is used to Implicitly convert an RDD to a DataFrame.
import sqlContext.impIicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// load the data into a new RDD
val products = sc.textFile("sparksql2/product.csv")
val supplier = sc.textFileC'sparksq^supplier.csv")
val prdsup = sc.textFile("sparksql2/products_suppliers.csv"} // Return the first element in this RDD
products.fi rst()
supplier.first{).
prdsup.first()
//define the schema using a case class
case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price:
Float, supplierid:lnteger)
case class Suplier(supplierid: Integer, name: String, phone: String) case class PRDSUP(productid: Integer.supplierid: Integer)
// create an RDD of Product objects
val prdRDD = products.map(_.split('\")).map(p =>
Product(p(0).tolnt, p(1), p(2), p(3).tolnt, p(4).toFloat, p(5).toint))
val supRDD = supplier.map(_.split(", ")).map(p => Suplier(p(0).tolnt, p(1), p(2)))
val prdsupRDD = prdsup.map(_.split(", ")).map(p => PRDSUP(p(0).tolnt, p(1}.tolnt}} prdRDD.first()
prdRDD.count()
supRDD.first() supRDD.count()
prdsupRDD.first() prdsupRDD.count(}
// change RDD of Product objects to a DataFrame
val prdDF = prdRDD.toDF()
val supDF = supRDD.toDF()
val prdsupDF = prdsupRDD.toDF()
// register the DataFrame as a temp table prdDF.registerTempTablef'products") supDF.registerTempTablef'suppliers")
prdsupDF.registerTempTablef'productssuppliers"}
//Select product, its price , its supplier name where product price is less than 0.6 val results = sqlContext.sql(......SELECT products.name, price, suppliers.name as sup_name FROM products JOIN suppliers ON products.supplierlD= suppliers.supplierlD WHERE price < 0.6......]
results. show()



Problem Scenario 35 : You have been given a file named spark7/EmployeeName.csv (id, name).
EmployeeName.csv
E01, Lokesh
E02, Bhupesh
E03, Amit
E04, Ratan
E05, Dinesh
E06, Pavan
E07, Tejas
E08, Sheela
E09, Kumar
E10, Venkat

1. Load this file from hdfs and sort it by name and save it back as (id,name) in results
directory. However, make sure while saving it should be able to write In a single file.

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution:
Step 1: Create file in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2: Load EmployeeName.csv file from hdfs and create PairRDDs val name = sc.textFile("spark7/EmployeeName.csv")
val namePairRDD = name.map(x=> (x.split(", ")(0), x.split(", ")(1)))
Step 3: Now swap namePairRDD RDD.
val swapped = namePairRDD.map(item => item.swap)

Step 4: Now sort the rdd by key.
val sortedOutput = swapped.sortByKey()
Step 5: Now swap the result back
val swappedBack = sortedOutput.map(item => item.swap}
Step 6: Save the output as a Text file and output must be written in a single file. swappedBack. repartition(1).saveAsTextFile("spark7/result.txt")



Problem Scenario 23 : You have been given log generating service as below.
Start_logs (It will generate continuous logs)
Tail_logs (You can check , what logs are being generated)
Stop_logs (It will stop the log service)
Path where logs are generated using above service : /opt/gen_logs/logs/access.log
Now write a flume configuration file named flume3.conf , using that configuration file dumps logs in HDFS file system in a directory called flumeflume3/%Y/%m/%d/%H/%M
Means every minute new directory should be created). Please us the interceptors to provide timestamp information, if message header does not have header info.
And also note that you have to preserve existing timestamp, if message contains it. Flume channel should have following property as well. After every 100 message it should be committed, use non-durable/faster channel and it should be able to hold maximum 1000
events.

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Create flume configuration file, with below configuration for source, sink and channel.
#Define source , sink , channel and agent,
agent1 .sources = source1
agent1 .sinks = sink1
agent1.channels = channel1
# Describe/configure source1
agent1 .sources.source1.type = exec
agentl.sources.source1.command = tail -F /opt/gen logs/logs/access.log
#Define interceptors
agent1 .sources.source1.interceptors=i1
agent1 .sources.source1.interceptors.i1.type=timestamp agent1 .sources.source1.interceptors.i1.preserveExisting=true
## Describe sink1
agent1 .sinks.sink1.channel = memory-channel
agent1 .sinks.sink1.type = hdfs
agent1 .sinks.sink1.hdfs.path = flume3/%Y/%m/%d/%H/%M agent1 .sinks.sjnkl.hdfs.fileType = Data Stream
# Now we need to define channel1 property.
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 1000
agent1.channels.channel1.transactionCapacity = 100
# Bind the source and sink to the channel
Agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
Step 2: Run below command which will use this configuration file and append data in
hdfs.
Start log service using : start_logs
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/flumeconf/flume3.conf -DfIume.root.logger=DEBUG, INFO, console ­name agent1
Wait for few mins and than stop log service.
stop logs



Problem Scenario 41 : You have been given below code snippet.
val aul = sc.parallelize(List (("a" , Array(1, 2)), ("b" , Array(1, 2))))
val au2 = sc.parallelize(List (("a" , Array(3)), ("b" , Array(2))))
Apply the Spark method, which will generate below output.
Array[(String, Array[lnt])] = Array((a, Array(1, 2)), (b, Array(1, 2)), (a(Array(3)), (b, Array(2)))

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution:
au1.union(au2)



Viewing page 2 of 21
Viewing questions 6 - 10 out of 96 questions



Post your Comments and Discuss Cloudera CCA175 exam prep with other Community members:

Join the CCA175 Discussion