You are working with a large dataset of customer reviews stored in Cloud Storage. The dataset contains several inconsistencies, such as missing values, incorrect data types, and duplicate entries. You need to clean the data to ensure that it is accurate and consistent before using it for analysis.
What should you do?
- Use the PythonOperator in Cloud Composer to clean the data and load it into BigQuery. Use SQL for analysis.
- Use BigQuery to batch load the data into BigQuery. Use SQL for cleaning and analysis.
- Use Storage Transfer Service to move the data to a different Cloud Storage bucket. Use event triggers to invoke Cloud Run functions to load the data into BigQuery. Use SQL for analysis.
- Use Cloud Run functions to clean the data and load it into BigQuery. Use SQL for analysis.
Answer(s): B
Explanation:
Using BigQuery to batch load the data and perform cleaning and analysis with SQL is the best approach for this scenario. BigQuery provides powerful SQL capabilities to handle missing values, enforce correct data types, and remove duplicates efficiently. This method simplifies the pipeline by leveraging BigQuery's built-in processing power for both cleaning and analysis, reducing the need for additional tools or services and minimizing complexity.
Reveal Solution Next Question