Cloudera CCA175 Exam
CCA Spark and Hadoop Developer Exam (Page 7 )

Updated On: 19-Jan-2026

Problem Scenario 61 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog", "cat", "gnu", "salmon", "rabbit", "turkey", "wolf", "bear", "bee"), 3)
val d = c.keyBy(_.length) operationl
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, (String, Option[String]}}] = Array((6, (salmon, Some(salmon))), (6, (salmon, Some(rabbit))),
(6, (salmon, Some(turkey))), (6, (salmon, Some(salmon))), (6, (salmon, Some(rabbit))), (6, (salmon, Some(turkey))), (3, (dog, Some(dog))), (3, (dog, Some(cat))), (3, (dog, Some(dog))), (3, (dog, Some(bee))), (3, (rat, Some(dogg)), (3, (rat, Some(cat)j), (3, (rat.Some(gnu))). (3, (rat, Some(bee))), (8, (elephant, None)))

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
b.leftOuterJoin(d}.collect
leftOuterJoin [Pair]: Performs an left outer join using two key-value RDDs. Please note
that the keys must be generally comparable to make this work keyBy : Constructs two- component tuples (key-value pairs) by applying a function on each data item. Trie result of the function becomes the key and the original data item becomes the value of the newly created tuples.



Problem Scenario 73 : You have been given data in json format as below.
{"first_name":"Ankit", "last_name":"Jain"}
{"first_name":"Amir", "last_name":"Khan"}
{"first_name":"Rajesh", "last_name":"Khanna"}
{"first_name":"Priynka", "last_name":"Chopra"}
{"first_name":"Kareena", "last_name":"Kapoor"}
{"first_name":"Lokesh", "last_name":"Yadav"}
Do the following activity

1. create employee.json file locally.
2. Load this file on hdfs
3. Register this data as a temp table in Spark using Python.
4. Write select query and print this data.
5. Now save back this selected data in json format.

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: create employee.json tile locally.
vi employee.json (press insert) past the content.
Step 2: Upload this tile to hdfs, default location hadoop fs -put employee.json
Step 3: Write spark script
#lmport SQLContext
from pyspark import SQLContext
#Create instance of SQLContext sqIContext = SQLContext(sc)
#Load json file
employee = sqlContext.jsonFile("employee.json")
#Register RDD as a temp table employee.registerTempTablef'EmployeeTab"}
#Select data from Employee table
employeelnfo = sqlContext.sql("select * from EmployeeTab"}
#lterate data and print
for row in employeelnfo.collect():
print(row)
Step 4: Write dataas a Text file employeelnfo.toJSON().saveAsTextFile("employeeJson1")

Step 5: Check whether data has been created or not hadoop fs -cat employeeJsonl/part"



Problem Scenario 43 : You have been given following code snippet.
val grouped = sc.parallelize(Seq(((1, "twoM), List((3, 4), (5, 6)))))
val flattened = grouped.flatMap {A =>
groupValues.map { value => B }
}
You need to generate following output.
Hence replace A and B
Array((1, two, 3, 4), (1, two, 5, 6))

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
A case (key, groupValues)
B (key._1, key._2, value._1, value._2)



Problem Scenario 22 : You have been given below comma separated employee information.
name, salary, sex, age
alok, 100000, male, 29
jatin, 105000, male, 32
yogesh, 134000, male, 39
ragini, 112000, female, 35
jyotsana, 129000, female, 39
valmiki, 123000, male, 29
Use the netcat service on port 44444, and nc above data line by line. Please do the following activities.

1. Create a flume conf file using fastest channel, which write data in hive warehouse
directory, in a table called flumeemployee (Create hive table as well tor given data).
2. Write a hive query to read average salary of all employees.

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Create hive table forflumeemployee.'
CREATE TABLE flumeemployee
(
name string, salary int, sex string,
age int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ', ';
Step 2: Create flume configuration file, with below configuration for source, sink and channel and save it in flume2.conf.
#Define source , sink , channel and agent,
agent1 .sources = source1
agent1 .sinks = sink1
agent1.channels = channel1
# Describe/configure source1
agent1.sources.source1.type = netcat
agent1.sources.source1.bind = 127.0.0.1
agent1.sources.source1.port = 44444
## Describe sink1
agent1 .sinks.sink1.channel = memory-channel
agent1.sinks.sink1.type = hdfs
agent1 .sinks.sink1.hdfs.path = /user/hive/warehouse/flumeemployee hdfs-agent.sinks.hdfs-write.hdfs.writeFormat=Text
agent1 .sinks.sink1.hdfs.tileType = Data Stream
# Now we need to define channel1 property.
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 1000
agent1.channels.channel1.transactionCapacity = 100
# Bind the source and sink to the channel
Agent1 .sources.sourcel.channels = channell agent1 .sinks.sinkl.channel = channel1
Step 3: Run below command which will use this configuration file and append data in hdfs.
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/flumeconf/flume2.conf --name agent1
Step 4: Open another terminal and use the netcat service.
nc localhost 44444
Step 5: Enter data line by line.
alok, 100000.male, 29
jatin, 105000, male, 32
yogesh, 134000, male, 39
ragini, 112000, female, 35
jyotsana, 129000, female, 39
valmiki, 123000, male, 29
Step 6: Open hue and check the data is available in hive table or not.

Step 7: Stop flume service by pressing ctrl+c
Step 8: Calculate average salary on hive table using below query. You can use either hive command line tool or hue. select avg(salary) from flumeemployee;



Problem Scenario 4: You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
Import Single table categories (Subset data} to hive managed table , where category_id between 1 and 22

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Import Single table (Subset data)
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=categories -where "\'category_id\' between 1 and 22" --hive- import --m 1
Note: Here the ' is the same you find on ~ key
This command will create a managed table and content will be created in the following directory.
/user/hive/warehouse/categories
Step 2: Check whether table is created or not (In Hive) show tables;
select * from categories;



Viewing page 7 of 21
Viewing questions 25 - 28 out of 96 questions



Post your Comments and Discuss Cloudera CCA175 exam prep with other Community members:

Join the CCA175 Discussion