You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?
- Increase the share of the test sample in the train-test split.
- Try to collect more data and increase the size of your dataset.
- Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.
- Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.
Reveal Solution Next Question