Convolutional Neural Networks (CNNs) have been shown to be effective in “end-to-end” speech modeling tasks, such as acoustic scene/object classification, automatic speech recognition and emotion recognition from speech. Training these deep architectures requires large amounts of speech data. Unfortunately, the relatively small size of emotion datasets hinders the application of deep architectures for emotion estimation. We explore transfer learning as a way around this, with fully convolutional neural networks. Specifically, we train a model to detect anger in speech by fine-tuning SoundNet, a fully convolutional neural network to classify natural environmental sounds. SoundNet was trained multimodally on a large dataset of audio/video data and ground truth generated from vision-based classifiers. In our experiments, we use acted, elicited and natural emotion datasets to evaluate model performance. We also examine the language dependence of anger display in speech by testing on a dataset of a different language.