Due to low latency and high bandwidth with congestion, RDMA is adopted by many services of Alibaba Cloud, such as storage, HPC and so on. Compared to developed TCP in Linux kernel, developing RDMA in userspace encounters many challenges. First of all, RDMA is more difficult to share and monitor in userspace, and also hard to virtualize. Then, RDMA is more sensitive to performance, failure and cluster scale. Lastly, RDMA is unfriendly to program for software developers.
In spite of these challenges, Alibaba service like Pangu Storage has deployed a large scale cluster in production and development. In the presentation, we give our opinion about these challenges in Alibaba.
Li Qiang is a senior engineer in Alibaba Cloud, whose work focuses on Pangu distributed storage system for large-scale service. He received and Ph.D. degrees in Institute of Computing Technology in 2012. Since 2012, he has been an associate professor in Institute of Computing Technology... Read More →