Free AWS Certified Data Engineer - Associate DEA-C01 Practice Questions & AI Tutor

QUESTION: 17

A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour.
Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)

Configure AWS Glue triggers to run the ETL jobs every hour.
Use AWS Glue DataBrew to clean and prepare the data for analytics.
Use AWS Lambda functions to schedule and run the ETL jobs every hour.
Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift.
Use the Redshift Data API to load transformed data into Amazon Redshift.

Answer(s): A,D

Explanation:

Hourly triggers in AWS Glue provide automated, serverless ETL execution aligned with the requirement, meeting the least-ops overhead. A) Glue triggers can schedule ETL jobs to run on an hourly cadence without manual intervention. D) Glue connections enable secure, managed connectivity between RDS, MongoDB, and Redshift within Glue’s managed environment, simplifying data movement and transformation without custom networking setup. B) DataBrew is primarily for data cleaning in a data prep context and not a full ETL workflow for multiple data stores to Redshift. C) Lambda scheduling would add orchestration and state management overhead. E) Redshift Data API is for issuing SQL from applications, not for orchestrating and loading ETL pipelines.

Show Answer Next Question

QUESTION: 18

A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling.
Which solution will meet this requirement?

Turn on concurrency scaling in workload management (WLM) for Redshift Serverless workgroups.
Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.
Turn on concurrency scaling in the settings during the creation of any new Redshift cluster.
Turn on concurrency scaling for the daily usage quota for the Redshift cluster.

Answer(s): B

Explanation:

Concurrency scaling in Redshift is enabled at the WLM queue level for a provisioned cluster, which allows automatic scaling of read/write workloads to handle bursts without user intervention. A) is incorrect because concurrency scaling applies to cluster-based WLM, not Serverless workgroups. B) is correct. C) is incorrect because concurrency scaling is not toggled globally at cluster creation; it is configured per WLM queue. D) is incorrect because daily usage quotas are unrelated to concurrency scaling behavior.

Show Answer Next Question

QUESTION: 19

A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes.
Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)

Use an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.
Create an AWS Step Functions workflow and add two states. Add the first state before the Lambda function. Configure the second state as a Wait state to periodically check whether the Athena query has finished using the Athena Boto3 get_query_execution API call. Configure the workflow to invoke the next query when the current query has finished running.
Use an AWS Glue Python shell job and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.
Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes to determine whether the current Athena query has finished running successfully. Configure the Python shell script to invoke the next query when the current query has finished running.
Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the Athena queries in AWS Batch.

Answer(s): A,B

Explanation:

Athena queries over 15 minutes are best orchestrated with event-driven control and scalable coordination; Lambda with start_query_execution is cost-effective for invoking queries, and Step Functions can poll for completion without idle EC2 costs.
A) Lambda with Athena Boto3 start_query_execution is cost-efficient for short-lived orchestration and triggers; suitable for repeated daily runs without provisioning servers.
B) Step Functions with a Wait and get_query_execution provides reliable polling and sequencing across multiple queries without constant active polling, reducing compute waste.
C) Glue Python shell is more expensive and not ideal for long-running, frequent daily queries; adds unnecessary ETL service.
D) Glue Python shell with sleep polling incurs unnecessary Lambda-like idle wait and maintenance overhead.
E) MWAA introduces extra managed Airflow overhead and AWS Batch, not cost-optimal for simple sequential tasks.

Show Answer Next Question

QUESTION: 20

A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options.
The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS.
Which extract, transform, and load (ETL) service will meet these requirements?

AWS Glue
Amazon EMR
AWS Lambda
Amazon Redshift

Answer(s): B

Explanation:

A strong fit: Amazon EMR, because it provides managed clusters for big data frameworks (Hadoop, Spark, HBase, Flink, Pig, Oozie) enabling scalable ETL at prior-on-prem performance, with options to run on-demand and serverless-like flexibility via EMR on EKS/Step Functions integration, reducing operational overhead.
A) AWS Glue is serverless but primarily targets data cataloging and ETL for structured data; it may not natively support Pig, Oozie, HBase, or Flink at petabyte-scale with existing Pig/Oozie workflows.
C) AWS Lambda is serverless compute but not suitable for long-running, heavy ETL workloads and complex big data pipelines at petabyte scale.
D) Amazon Redshift is a data warehouse, not an ETL service, and lacks direct support for Pig/Oozie workflows and HBase/Flink-based processing.

Show Answer Next Question

QUESTION: 21

A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII.
Which solution will meet this requirement with the LEAST operational effort?

Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Create an AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate the PII. Set the S3 data lake as the target for the delivery stream.
Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.
Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule in AWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.
Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table and to transform the data. Use the same Lambda function to ingest the data into the S3 data lake.

Answer(s): B

Explanation:

The Detect PII transform in AWS Glue Studio provides built-in profiling and PII detection with minimal setup, and combining it with obfuscation and an orchestrated ingest pipeline via AWS Step Functions yields a low-operational, serverless solution to profile and mask data before storing in S3.
A) Requires custom Lambda transform and SDK, increasing operational overhead and maintenance risk.
C) Uses Glue Studio detection but relies on Glue Data Quality for obfuscation, adding extra tools and steps.
D) Involves DynamoDB and Lambda for both detection and obfuscation, plus manual data movement to S3, raising complexity and latency.

Show Answer Next Question

QUESTION: 22

A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company's operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data.
The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort.
Which solution will meet these requirements with the LEAST operational overhead?

AWS Glue workflows
AWS Step Functions tasks
AWS Lambda functions
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows

Answer(s): B

Explanation:

Automating ETL orchestration with minimal manual effort is best achieved using AWS Step Functions, which can coordinate Glue jobs, EMR steps, and other AWS services in serverless workflows with built-in retries, error handling, and visual monitoring.
A) AWS Glue workflows are Glue-native but provide limited cross-service orchestration and less flexibility for complex state machines compared to Step Functions.
C) AWS Lambda functions require custom orchestration logic and may not handle long-running tasks efficiently, increasing operational effort.
D) Amazon MWAA provides Airflow-based orchestration but introduces more management overhead and is not as lightweight as Step Functions for serverless, event-driven workflows.

Show Answer Next Question

QUESTION: 23

A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.
A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year.
The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability.
Which solution will meet these requirements in the MOST cost-effective way?

Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.
Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

Answer(s): B

Explanation:

Transition to S3 Standard-IA after 6 months preserves high availability while reducing cost for infrequently accessed data, and then move to Glacier Flexible Retrieval after 2 years for long-term archival at lower costs. This aligns with a tiered retention pattern: frequent access early, then infrequent, then archival, without sacrificing availability during the lifecycle.
A) One Zone-IA reduces durability (single AZ) and is not ideal for high availability across AZ failures.
C) Glacier Deep Archive after 2 years offers the lowest cost but higher retrieval times and is not the most cost-effective given typical retrieval requirements.
D) One Zone-IA + Glacier Deep Archive further compromises durability and performance compared with Standard-IA followed by Glacier Flexible Retrieval.

Show Answer Next Question

QUESTION: 24

A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks.
The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team's BI cluster.
The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster.
Which solution will meet these requirements?

Set up the sales team BI cluster as a consumer of the ETL cluster by using Redshift data sharing.
Create materialized views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.
Create database views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.
Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based on the content of the ETL cluster.

Answer(s): A

Explanation:

Redshift data sharing allows a consumer cluster (sales BI) to access live data from the producer cluster (ETL) without data duplication or ETL cluster load, meeting low-resource impact and real-time joined analytics.
A) Correct. Redshift data sharing enables cross-cluster query access with minimal compute on the producer, avoiding ETL disruption.
B) Incorrect. Materialized views would require data duplication or periodic refresh, and granting direct access to the ETL cluster increases load and risks contention.
C) Incorrect. Database views alone offer no cross-cluster sharing; direct access forces ETL cluster workload and potential performance impact.
D) Incorrect. Unloading to S3 and Spectrum adds ETL to ETL data movement, introduces latency, and does not provide real-time joins between clusters.

Show Answer Next Question

Amazon AWS Certified Data Engineer - Associate DEA-C01 Exam Questions
AWS Certified Data Engineer - Associate DEA-C01 (Page 4 )

QUESTION: 17

Explanation:

QUESTION: 18

Explanation:

QUESTION: 19

Explanation:

QUESTION: 20

Explanation:

QUESTION: 21

Explanation:

QUESTION: 22

Explanation:

QUESTION: 23

Explanation:

QUESTION: 24

Explanation:

AWS Certified Data Engineer - Associate DEA-C01 Exam Discussions & Posts

Amazon AWS Certified Data Engineer - Associate DEA-C01 Exam Questions AWS Certified Data Engineer - Associate DEA-C01 (Page 4 )

QUESTION: 17

Explanation:

QUESTION: 18

Explanation:

QUESTION: 19

Explanation:

QUESTION: 20

Explanation:

QUESTION: 21

Explanation:

QUESTION: 22

Explanation:

QUESTION: 23

Explanation:

QUESTION: 24

Explanation:

AWS Certified Data Engineer - Associate DEA-C01 Exam Discussions & Posts

Amazon AWS Certified Data Engineer - Associate DEA-C01 Exam Questions
AWS Certified Data Engineer - Associate DEA-C01 (Page 4 )