You are training an object detection machine learning model on a dataset that consists of three million X-ray images, each roughly 2 GB in size. You are using Vertex AI Training to run a custom training application on a Compute Engine instance with 32-cores, 128 GB of RAM, and 1 NVIDIA P100 GPU. You notice that model training is taking a very long time. You want to decrease training time without sacrificing model performance. What should you do?
- Increase the instance memory to 512 GB and increase the batch size.
- Replace the NVIDIA P100 GPU with a v3-32 TPU in the training job.
- Enable early stopping in your Vertex AI Training job.
- Use the tf.distribute.Strategy API and run a distributed training job.
Reveal Solution Next Question