Enabling fairer data clusters for machine learning

By | August 20, 2020
(Photo of a very large, dark room filled with glowing computers.)
(Google)

Research published recently by CSE investigators can make training machine learning (ML) models fairer and faster. With a tool called AlloX, Mosharaf Chowdhury and a team from Stony Brook University developed a new way to fairly schedule high volumes of ML jobs in data centers that make use of multiple different types of computing hardware, like CPUs, GPUs, and specialized accelerators. As these so-called heterogeneous clusters grow to be the norm, fair scheduling systems like AlloX will become essential to their efficient operation.

This project is a new step for Chowdhury’s lab, which has recently published a number of tools aimed at speeding up the process of training and testing ML models. Their past projects Tiresias and Salus sped up GPU resource sharing at multiple scales: both within a single GPU (Salus) and across many GPUs in a cluster (Tiresias).