Distributed Machine Learning

Distributed Machine Learning :

Systems, Platforms and Algorithms

Edge Accelerators for Training

Figure shows an Nvidia Jetson Orin edge device, with 12 ARM Cortex CPU cores, an Ampere GPU with 2048 CUDA cores and 64 tensor cores, and 64GB of RAM shared between CPU and GPU. It delivers 275 TOPS of performance, comparable to an RTX 3060 Ti GPU workstation.

With the increase in IoT deployments, edge computing is emerging as a paradigm that enables low latency and privacy-preserving computation close to the data source. Some edge devices like the Nvidia Jetson feature multiple accelerators such as GPU and DLA in addition to a multicore ARM CPU, making it feasible to train deep learning workloads in addition to inferencing. These accelerated edges vary from desktop/server grade GPUs in terms of architecture, requiring new explorations of using them effectively. We investigate systems optimizations to efficiently use accelerated edge devices for deep learning workloads. Our work includes characterizing deep learning training and inferencing on edge devices [PAISE 2022, SIGMETRICS 2023], understanding the interference between different concurrent workloads, performance and energy modelling, intelligently scheduling concurrent training/inference workloads to better utilize heterogenous hardware [CCGRID 2023], virtualization/containerization of edge devices, etc.



Federated learning (System)

Figure shows hierarchical federated learning in action. Clients are clustered into groups of similar performance and perform local model training and aggregation. This aggregated model it sent to a central server for the second level of aggregation.

Federated Learning (FL) is a machine learning paradigm that enables privacy-driven learning in a distributed fashion by learning a shared model across clients. However, one of the main issues plaguing the reliability and accuracy of FL is the heterogeneity, both in terms of data and device specifications. Our work includes surveying many state-of-the-art existing works that help tackle these issues, find the shortcomings related to them and to come up with a new framework built around techniques that help tackle the same. We have been working on HaDFL, a novel synchronous FL framework built on top of the popular FedML library that has been extended to leverage on the availability of power-modes on edge devices to reduce device-based heterogeneity and works in a cluster-based hierarchical aggregation scenario, trying to involve as many clusters at a time in an attempt tackle data heterogeneity in terms of reducing model skew towards particular clusters enabling faster convergence. We have compared the progress made in our work with existing frameworks in a simulation environment and shown that it outperforms the same using simple heuristic tweaks and also helps improve the existing works by addressing the gaps in them as proof of our methodology being consistent across all scenarios.


Video Data Analytics

Figure shows the entire pipeline of Video Data Analytics in action

With advances in UAV (Unmanned Aerial Vehicles) and computer vision, automated management and analysis of video data captured by UAV mounted cameras is an area of growing interest. The video and metadata collected by a fleet of UAVs can be utilized to answer historical queries. Managing and semantic analysis of video data is known to be challenging due to its volume and computational overhead. But the videos captured by a fleet of drones are additionally plagued by visual information of varying level of detail and shorter duration due to which efficient analysis (by a single model) can be challenging, as well as rules out traditional techniques to reduce volume of video by downscaling it to a predetermined static resolution. Thus, we analyzed existing methods to measure level of detail in literature, proposed and built a data (video & metadata) processing pipeline to scale videos captured by UAVs dynamically (using metadata captured by drone sensors) to meet user specified level of detail configuration on average. We have also implemented the above pipeline on a heterogeneous edge cluster and observed reduction in turnaround time (for given experimental setup). We also integrated and evaluated device and load heterogeneity aware load balancers and reported the further reduction of makespan with device and load aware load balancers [HiPCW 2022].

We are currently working on developing a system to query the above data captured by UAVs based on geospatial, temporal, and semantic predicates and exploring further optimizations in the above pipeline.


  • [HiPCW 2022] Bharati Khanijo, Harshil Gupta  and Y. Simmhan, Video Ingest Pipeline on the Edge for Drone Videos , p. 75, in IEEE International Conference on High Performance Computing, Data and Analytics Workshop, 2022, 10.1109/HiPCW57629.2022.00015.