Free CCA175 Exam Braindumps (page: 9)

Page 8 of 25

Problem Scenario 40 : You have been given sample data as below in a file called spark15/file1.txt
3070811, 1963, 1096, , "US", "CA", , 1,
3022811, 1963, 1096, , "US", "CA", , 1, 56
3033811, 1963, 1096, , "US", "CA", , 1, 23

Below is the code snippet to process this tile.
val field= sc.textFile("spark15/f ilel.txt")
val mapper = field.map(x=> A)
mapper.map(x => x.map(x=> {B})).collect
Please fill in A and B so it can generate below final output
Array(Array(3070811, 1963, 109G, 0, "US", "CA", 0, 1, 0)
, Array(3022811, 1963, 1096, 0, "US", "CA", 0, 1, 56)
, Array(3033811, 1963, 1096, 0, "US", "CA", 0, 1, 23)
)

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
A) x.split(", "-1)
B) if (x. isEmpty) 0 else x



Problem Scenario 42 : You have been given a file (sparklO/sales.txt), with the content as given in below.
spark10/sales.txt
Department, Designation, costToCompany, State
Sales, Trainee, 12000, UP
Sales, Lead, 32000, AP
Sales, Lead, 32000, LA
Sales, Lead, 32000, TN
Sales, Lead, 32000, AP
Sales, Lead, 32000, TN
Sales, Lead, 32000, LA
Sales, Lead, 32000, LA
Marketing, Associate, 18000, TN
Marketing, Associate, 18000, TN
HR, Manager, 58000, TN
And want to produce the output as a csv with group by Department, Designation, State with additional columns with sum(costToCompany) and TotalEmployeeCountt
Should get result like
Dept, Desg, state, empCount, totalCost
Sales, Lead, AP, 2, 64000
Sales.Lead.LA.3.96000
Sales, Lead, TN, 2, 64000

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Create a file first using Hue in hdfs.
Step 2: Load tile as an RDD
val rawlines = sc.textFile("spark10/sales.txt")
Step 3: Create a case class, which can represent its column fileds. case class Employee(dep: String, des: String, cost: Double, state: String)
Step 4: Split the data and create RDD of all Employee objects. val employees = rawlines.map(_.split(", ")).map(row=>Employee(row(0), row{1), row{2).toDouble, row{3)))

Step 5: Create a row as we needed. All group by fields as a key and value as a count for each employee as well as its cost, val keyVals = employees.map( em => ((em.dep, em.des, em.state), (1 , em.cost)))
Step 6: Group by all the records using reduceByKey method as we want summation as well. For number of employees and their total cost, val results = keyVals.reduceByKey{ (a, b) => (a._1 + b._1, a._2 + b._2)} // (a.count + b.count, a.cost + b.cost)}
Step 7: Save the results in a text file as below.
results.repartition(1).saveAsTextFile("spark10/group.txt")



Problem Scenario 37 : ABCTECH.com has done survey on their Exam Products feedback using a web based form. With the following free text field as input in web ui.
Name: String
Subscription Date: String
Rating : String
And servey data has been saved in a file called spark9/feedback.txt
Christopher|Jan 11, 2015|5
Kapil|11 Jan, 2015|5
Thomas|6/17/2014|5
John|22-08-2013|5
Mithun|2013|5
Jitendra||5
Write a spark program using regular expression which will filter all the valid dates and save in two separate file (good record and bad record)

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Create a file first using Hue in hdfs.
Step 2: Write all valid regular expressions sysntex for checking whether records are having valid dates or not.
val regl =......(\d+)\s(\w{3})(, )\s(\d{4}).......r//11 Jan, 2015 val reg2 =......(\d+)(U)(\d+)(U)(\d{4})......s II 6/17/2014 val reg3 =......(\d+)(-)(\d+)(-)(\d{4})""".r//22-08-2013 val reg4 =......(\w{3})\s(\d+)(, )\s(\d{4})......s II Jan 11, 2015
Step 3: Load the file as an RDD.
val feedbackRDD = sc.textFile("spark9/feedback.txt"}
Step 4: As data are pipe separated , hence split the same. val feedbackSplit = feedbackRDD.map(line => line.split('|'))
Step 5: Now get the valid records as well as , bad records.
val validRecords = feedbackSplit.filter(x =>
(reg1.pattern.matcher(x(1).trim).matches|reg2.pattern.matcher(x(1).trim).matches|reg3.patt ern.matcher(x(1).trim).matches | reg4.pattern.matcher(x(1).trim).matches)) val badRecords = feedbackSplit.filter(x =>
!(reg1.pattern.matcher(x(1).trim).matches|reg2.pattern.matcher(x(1).trim).matches|reg3.pat tern.matcher(x(1).trim).matches | reg4.pattern.matcher(x(1).trim).matches))
Step 6: Now convert each Array to Strings
val valid =vatidRecords.map(e => (e(0), e(1), e(2)))
val bad =badRecords.map(e => (e(0), e(1), e(2)))
Step 7: Save the output as a Text file and output must be written in a single tile,
valid.repartition(1).saveAsTextFile("spark9/good.txt") bad.repartition(1).saveAsTextFile("sparkS7bad.txt")



Problem Scenario 25 : You have been given below comma separated employee information. That needs to be added in /home/cloudera/flumetest/in.txt file (to do tail source)
sex, name, city
1, alok, mumbai
1, jatin, chennai
1, yogesh, kolkata
2, ragini, delhi
2, jyotsana, pune
1, valmiki, banglore
Create a flume conf file using fastest non-durable channel, which write data in hive warehouse directory, in two separate tables called flumemaleemployee1 and flumefemaleemployee1
(Create hive table as well for given data}. Please use tail source with /home/cloudera/flumetest/in.txt file.
Flumemaleemployee1 : will contain only male employees data flumefemaleemployee1 :
Will contain only woman employees data

  1. See the explanation for Step by Step Solution and configuration.

Answer(s): A

Explanation:

Solution :
Step 1: Create hive table for flumemaleemployeel and .' CREATE TABLE flumemaleemployeel
(
sex_type int, name string, city string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ', ';
CREATE TABLE flumefemaleemployeel
(
sex_type int, name string, city string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ', ';
Step 2: Create below directory and file mkdir /home/cloudera/flumetest/ cd /home/cloudera/flumetest/
Step 3: Create flume configuration file, with below configuration for source, sink and channel and save it in flume5.conf.
agent.sources = tailsrc
agent.channels = mem1 mem2
agent.sinks = stdl std2
agent.sources.tailsrc.type = exec
agent.sources.tailsrc.command = tail -F /home/cloudera/flumetest/in.txt agent.sources.tailsrc.batchSize = 1
agent.sources.tailsrc.interceptors = i1 agent.sources.tailsrc.interceptors.i1.type = regex_extractor agent.sources.tailsrc.interceptors.il.regex = A(\\d} agent.sources.tailsrc. interceptors. M.serializers = t1 agent.sources.tailsrc. interceptors, i1.serializers.t1. name = type
agent.sources.tailsrc.selector.type = multiplexing agent.sources.tailsrc.selector.header = type agent.sources.tailsrc.selector.mapping.1 = memi agent.sources.tailsrc.selector.mapping.2 = mem2
agent.sinks.std1.type = hdfs
agent.sinks.stdl.channel = mem1
agent.sinks.stdl.batchSize = 1
agent.sinks.std1.hdfs.path = /user/hive/warehouse/flumemaleemployeei agent.sinks.stdl.rolllnterval = 0
agent.sinks.stdl.hdfs.tileType = Data Stream
agent.sinks.std2.type = hdfs
agent.sinks.std2.channel = mem2
agent.sinks.std2.batchSize = 1
agent.sinks.std2.hdfs.path = /user/hi ve/warehouse/fIumefemaleemployee1 agent.sinks.std2.rolllnterval = 0
agent.sinks.std2.hdfs.tileType = Data Stream
agent.channels.mem1.type = memory agent.channels.meml.capacity = 100 agent.channels.mem2.type = memory agent.channels.mem2.capacity = 100
agent.sources.tailsrc.channels = mem1 mem2
Step 4: Run below command which will use this configuration file and append data in hdfs.
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/fIumeconf/flume5.conf --name agent
Step 5: Open another terminal create a file at /home/cloudera/flumetest/in.txt.
Step 6: Enter below data in file and save it.
l.alok.mumbai
1 jatin.chennai
1, yogesh, kolkata
2, ragini, delhi
2, jyotsana, pune
1, valmiki, banglore
Step 7: Open hue and check the data is available in hive table or not.
Step 8: Stop flume service by pressing ctrl+c






Post your Comments and Discuss Cloudera CCA175 exam with other Community members:

CCA175 Exam Discussions & Posts