Free Amazon Data-Engineer-Associate Exam Questions

Question 1

A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data

that is associated with the legacy application. The data engineer found that the legacy data contained some

duplicate information.

The data engineer must identify and remove duplicate information from the legacy application data.

Which solution will meet these requirements with the LEAST operational overhead?

A : Write a custom extract, transform, and load (ETL) job in Python. Use the DataFramedrop duplicatesf) function by importingthe Pandas library to perform data deduplication.

B : Write an AWS Glue extract, transform, and load (ETL) job. Usethe FindMatches machine learning(ML) transform to transform the data to perform data deduplication.

C : Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe library. Use the dedupe library to perform data deduplication.

D : Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use the dedupe library to perform data deduplication.

Answer: B

Question 2

A company has three subsidiaries. Each subsidiary uses a different data warehousing solution. The firstsubsidiary hosts its data warehouse in Amazon Redshift. The second subsidiary uses Teradata Vantage onAWS. The third subsidiary uses Google BigQuery.The company wants to aggregate all the data into a central Amazon S3 data lake. The company wants to useApache Iceberg as the table format.A data engineer needs to build a new pipeline to connect to all the data sources, run transformations by usingeach source engine, join the data, and write the data to Iceberg.Which solution will meet these requirements with the LEAST operational effort?

A : Use native Amazon Redshift, Teradata, and BigQuery connectors to build the pipeline in AWS Glue. Use native AWS Glue transforms to join the data. Run a Merge operation on the data lake Iceberg table.

B : Use the Amazon Athena federated query connectors for Amazon Redshift, Teradata, and BigQuery to build the pipeline in Athena. Write a SQL query to read from all the data sources, join the data, and run a Merge operation on the data lake Iceberg table

C : Use the native Amazon Redshift connector, the Java Database Connectivity (JDBC) connector for Teradata, and the open source Apache Spark BigQuery connector to build the pipeline in Amazon EMR. Write code in PySpark to join the data. Run a Merge operation on the data lake Iceberg table.

D : Use the native Amazon Redshift, Teradata, and BigQuery connectors in Amazon Appflow to write data to Amazon S3 and AWS Glue Data Catalog. Use Amazon Athena to join the data. Run a Merge operation on the data lake Iceberg table.

Answer: B

Question 3

A company has as JSON file that contains personally identifiable information (PIT) data and non-PII data. The company needs to make the data available for querying and analysis. The non-PII data must be available to everyone in the company. The PII data must be available only to a limited group of employees. Which solution will meet these requirements with the LEAST operational overhead?

A : Store the JSON file in an Amazon S3 bucket. Configure AWS Glue to split the file into one file that contains the PII data and one file that contains the non-PII data. Store the output files in separate S3 buckets. Grant the required access to the buckets based on the type of user.

B : Store the JSON file in an Amazon S3 bucket. Use Amazon Macie to identify PII data and to grant access based on the type of user.

C : Store the JSON file in an Amazon S3 bucket. Catalog the file schema in AWS Lake Formation. Use Lake Formation permissions to provide access to the required data based on the type of user.

D : Create two Amazon RDS PostgreSQL databases. Load the PII data and the non-PII data into the separate databases. Grant access to the databases based on the type of user.

Answer: C

Question 4

A Data Engineering Consultant is working on optimizing the CI/CD pipeline for a data processing application on AWS. The application's CI/CD pipeline utilizes AWS CodeBuild for its build and test processes. One of the requirements is to reduce build times and resource consumption. The application has conditional build scenarios, where certain builds require more compute resources than others based on the complexity and size of the data being processed.

Which feature of AWS CodeBuild should the consultant configure to optimize the build process while considering varying resource needs for different builds?

A : Use CodeBuild’s buildspec file to define different compute types for each build scenario, dynamically adjusting the environment based on the source code's requirements.

B : Configure CodeBuild to utilize On-Demand instances for regular builds and switch to Spot Instances for resource-intensive builds, optimizing costs and compute resource usage.

C : Implement build batching in CodeBuild to group source versions and run them as a single build, ensuring efficient use of resources for varying build scenarios.

D : Leverage the CodeBuild cache feature to store frequently used build artifacts and dependencies, reducing build times for similar build scenarios.

Answer: A

Question 5

A company wants to migrate an application and an on-premises Apache Kafka server to AWS. Theapplication processes incremental updates that an on-premises Oracle database sends to the Kafka server. Thecompany wants to use the replatform migration strategy instead of the refactor strategy.Which solution will meet these requirements with the LEAST management overhead?

A : Amazon Kinesis Data Streams

B : Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned cluster

C : Amazon Data Firehose

D : Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless

Answer: D