Loading…
This event has ended. Visit the official site or create your own event on Sched.
June 25 - 27 - Beijing, China
Click Here For Information & Registration

Wednesday, June 27 • 11:30 - 12:10
Performance Optimization for Content Recommendation Workload on Large-scale Distributed GPU VM Nodes on Alibaba Cloud - Liang You, Alibaba Cloud (slides attached)

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
In this presentation, we will introduce the recommended technical architecture for content recommendation scenario on Alibaba Cloud and we will introduce the performance optimization work and the results for this scenario on large-scale distributed GPU VM nodes in Alibaba Cloud. We need to train about 20 billion samples within an hour, which is a very challenge goal to reach. The model has high communication-computing ratio and is implemented with Tensorflow, which has very bad scalability for large-scale distributed nodes and especially bad on the Cloud Computing virtual network. What’s more, the performance is very blocked by the distributed file reading. We optimized the performance both on communication and IO aspects and get over 14x speedup on 64 GPU VMs than the original implementation and finally trained over 20 billion samples within an hour on the 64 GPU VMs in Alibaba Cloud.

Speakers

亮 游

高级技术专家, 阿里云
Liang is in charge of the Elastic Artificial Intelligence Team at Alibaba Cloud, a subsidiary of Alibaba Group. He focuses on AI platform solution and performance optimization for both large-scale distributed deep learning training and inferencing on GPU platform of Alibaba Cloud... Read More →



Wednesday June 27, 2018 11:30 - 12:10
306A