In this presentation, we will introduce the recommended technical architecture for content recommendation scenario on Alibaba Cloud and we will introduce the performance optimization work and the results for this scenario on large-scale distributed GPU VM nodes in Alibaba Cloud. We need to train about 20 billion samples within an hour, which is a very challenge goal to reach. The model has high communication-computing ratio and is implemented with Tensorflow, which has very bad scalability for large-scale distributed nodes and especially bad on the Cloud Computing virtual network. What’s more, the performance is very blocked by the distributed file reading. We optimized the performance both on communication and IO aspects and get over 14x speedup on 64 GPU VMs than the original implementation and finally trained over 20 billion samples within an hour on the 64 GPU VMs in Alibaba Cloud.
Liang is in charge of the Elastic Artificial Intelligence Team at Alibaba Cloud, a subsidiary of Alibaba Group. He focuses on AI platform solution and performance optimization for both large-scale distributed deep learning training and inferencing on GPU platform of Alibaba Cloud... Read More →