Amazon AWS Certified Data Engineer - Associate DEA-C01 Exam Questions
AWS Certified Data Engineer - Associate DEA-C01 (Page 11 )

Updated On: 24-Mar-2026

A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3.
Which solution will meet this requirement MOST cost-effectively?

  1. Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis.
  2. Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files.
  3. Use Amazon Athena Federated Query to join the data from all data sources.
  4. Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift.

Answer(s): C

Explanation:

The correct answer is C.
A) An EMR cluster incurs provisioning and ongoing compute costs; for a one-time analysis, it is not the most cost-effective option compared to managed federated querying. B) Copying data into S3 adds ETL/storage costs and time, increasing total cost for a one-time analysis. C) Athena Federated Query enables on-demand, serverless access to multiple data sources (DynamoDB, RDS, Redshift, S3) with pay-per-query pricing, minimizing setup and cost for a one-off analysis. D) Redshift Spectrum can query S3 data and some external sources, but it does not natively federate across DynamoDB and RDS as seamlessly or cost-effectively as Athena Federated Queries for a one-time analysis.



A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance.
Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)

  1. Use Hadoop Distributed File System (HDFS) as a persistent data store.
  2. Use Amazon S3 as a persistent data store.
  3. Use x86-based instances for core nodes and task nodes.
  4. Use Graviton instances for core nodes and task nodes.
  5. Use Spot Instances for all primary nodes.

Answer(s): B,D

Explanation:

A robust, cost-effective EMR setup uses S3 for persistent storage and Graviton-based core/task nodes for efficiency and price performance.
A) HDFS as persistent storage is discouraged for long-running, cost-optimized EMR workloads because S3 provides durable, scalable object storage with lower management overhead.
B) S3 as persistent data store is correct due to durability, lifecycle management, and lower maintenance for long-running Spark jobs.
C) x86-based instances for core/task is not as cost-efficient as Graviton2/3 for many EMR workloads.
D) Graviton instances offer better price/performance for Spark workloads on EMR, improving TCO.
E) Spot Instances for all primary nodes risks interruption and is unsuitable for continuous, high-reliability workloads.



A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools.
Which solution will meet these requirements with the LEAST operational overhead?

  1. Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command to load data from Amazon S3 directly into Amazon Redshift to make the data immediately available for real-time analysis.
  2. Access the data from Kinesis Data Streams by using SQL queries. Create materialized views directly on top of the stream. Refresh the materialized views regularly to query the most recent stream data.
  3. Create an external schema in Amazon Redshift to map the data from Kinesis Data Streams to an Amazon Redshift object. Create a materialized view to read data from the stream. Set the materialized view to auto refresh.
  4. Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift.

Answer(s): C

Explanation:

K) C is correct because Redshift’s external schema (Spectrum-style) allows Redshift to query streaming data proxied through Kinesis Data Streams with a materialized view that auto-refreshes, delivering near-real-time insights with minimal operational overhead. It avoids manual ETL and maintains low latency by directly exposing stream data to Redshift.
A) Incorrect: COPY from S3 introduces latency and batching; not truly real-time and adds unnecessary storage steps.
B) Incorrect: Materialized views on streams are not natively supported for real-time querying in Redshift; auto-refresh on streams is not standard behavior.
D) Incorrect: Firehose+S3/COPY adds extra staging and latency; not as low-overhead for live streaming analytics as an external schema with auto-refresh.



A company uses an Amazon QuickSight dashboard to monitor usage of one of the company's applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day.
A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs.
Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.)

  1. Partition the data that is in the S3 bucket. Organize the data by year, month, and day.
  2. Increase the AWS Glue instance size by scaling up the worker type.
  3. Convert the AWS Glue schema to the DynamicFrame schema class.
  4. Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day.
  5. Modify the IAM role that grants access to AWS glue to grant access to all S3 features.

Answer(s): A,B

Explanation:

A) Partition the data that is in the S3 bucket. Organize the data by year, month, and day.
B) Increase the AWS Glue instance size by scaling up the worker type.
C) Convert the AWS Glue schema to the DynamicFrame schema class.
D) Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day.
E) Modify the IAM role that grants access to AWS glue to grant access to all S3 features.
A) Correct: Partitioning reduces scan scope and speeds queries for large S3 datasets used by Glue ETL and downstream QuickSight. B) Correct: Larger worker type improves parallelism and throughput, reducing job runtimes. C) Incorrect: DynamicFrame vs DataFrame choice affects transformation API, not core performance for partitioned data. D) Incorrect: Fewer runs may delay updates; performance not improved. E) Incorrect: Overly broad IAM permissions do not enhance ETL performance.



A data engineer needs to use AWS Step Functions to design an orchestration workflow. The workflow must parallel process a large collection of data files and apply a specific transformation to each file.
Which Step Functions state should the data engineer use to meet these requirements?

  1. Parallel state
  2. Choice state
  3. Map state
  4. Wait state

Answer(s): C

Explanation:

A Map state
A) Parallel state is for running multiple branches concurrently but does not automatically apply per-item transformation with dynamic collection; it is not designed to iterate over a collection of items.
B) Choice state selects between branches based on conditions, not for per-item processing across a collection.
C) Map state scales per-element processing by applying a defined workflow to each item in an input array, ideal for transforming every file in parallel.
D) Wait state introduces a delay and does not perform any per-item processing or parallel work.



A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information.
The data engineer must identify and remove duplicate information from the legacy application data.
Which solution will meet these requirements with the LEAST operational overhead?

  1. Write a custom extract, transform, and load (ETL) job in Python. Use the DataFrame.drop_duplicates() function by importing the Pandas library to perform data deduplication.
  2. Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.
  3. Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe library. Use the dedupe library to perform data deduplication.
  4. Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use the dedupe library to perform data deduplication.

Answer(s): B

Explanation:

A short summary: Using AWS Glue FindMatches ML transform provides deduplication with minimal ops.
A) Incorrect: Pandas drop_duplicates is in-memory and requires custom orchestration, not scalable with large S3 data; increases operational overhead.
B) Correct: AWS Glue FindMatches ML transform identifies duplicates with built-in, serverless deduplication; minimal maintenance and seamless integration with Glue ETL.
C) Incorrect: Python dedupe library requires custom code and management of similarity schemas and performance tuning; higher operational burden.
D) Incorrect: Importing Python dedupe in AWS Glue adds dependency management and custom logic, increasing complexity versus using managed FindMatches.



A company is building an analytics solution. The solution uses Amazon S3 for data lake storage and Amazon Redshift for a data warehouse. The company wants to use Amazon Redshift Spectrum to query the data that is in Amazon S3.
Which actions will provide the FASTEST queries? (Choose two.)

  1. Use gzip compression to compress individual files to sizes that are between 1 GB and 5 GB.
  2. Use a columnar storage file format.
  3. Partition the data based on the most common query predicates.
  4. Split the data into files that are less than 10 KB.
  5. Use file formats that are not splittable.

Answer(s): B,C

Explanation:

Using a columnar storage file format and partitioning the data by common predicates yields the fastest Redshift Spectrum queries.
A) Not correct: gzip compresses individual files but larger compressed sizes reduce parallelism and do not inherently guarantee faster scans; 1–5 GB per file is not optimal for Spectrum performance.
B) Correct: Columnar formats (e.g., ORC, Parquet) enable predicate pushdown and selective column reading, speeding scans.
C) Correct: Partitioning by common predicates reduces the data scanned and improves query performance via pruning.
D) Not correct: 10 KB files create excessive metadata operations and overhead, hurting performance.
E) Not correct: Non-splittable formats hinder parallelism and slow queries; splittable formats enable efficient parallel reads.



A company uses Amazon RDS to store transactional data. The company runs an RDS DB instance in a private subnet. A developer wrote an AWS Lambda function with default settings to insert, update, or delete data in the DB instance.
The developer needs to give the Lambda function the ability to connect to the DB instance privately without using the public internet.
Which combination of steps will meet this requirement with the LEAST operational overhead? (Choose two.)

  1. Turn on the public access setting for the DB instance.
  2. Update the security group of the DB instance to allow only Lambda function invocations on the database port.
  3. Configure the Lambda function to run in the same subnet that the DB instance uses.
  4. Attach the same security group to the Lambda function and the DB instance. Include a self-referencing rule that allows access through the database port.
  5. Update the network ACL of the private subnet to include a self-referencing rule that allows access through the database port.

Answer(s): C,D

Explanation:

C) Running the Lambda in the same VPC/subnet as the RDS instance ensures the function’s traffic stays within the private network, enabling private connectivity without internet exposure. D) Attaching the same security group to both Lambda and RDS with self-referencing rules allows intra-security-group communication on the database port, enabling authorized access without additional routing or public endpoints.
A) Turning on public access would expose the DB to the internet, contradicting “private” access. B) Security group on the DB to allow Lambda invocations is vague and not sufficient without correct networking; it also doesn’t guarantee same-subnet routing. E) Modifying NACLs adds unnecessary complexity and is not required when SG-based isolation suffices.



Viewing page 11 of 27
Viewing questions 81 - 88 out of 341 questions



Post your Comments and Discuss Amazon AWS Certified Data Engineer - Associate DEA-C01 exam dumps with other Community members:

AWS Certified Data Engineer - Associate DEA-C01 Exam Discussions & Posts

AI Tutor 👋 I’m here to help!