The unprecedented ability to collect massive datasets from large scientific instruments and enterprise data warehouses offers grand challenges for data-intensive computing. At the same time, computing infrastructure is undergoing swift changes both in architecture and resource access models. With even smartphones and Raspberry Pi’s being equipped with multiple CPU cores, distributed computing is the norm rather than the exception.
This confluence of “Big Data” applications with emerging computing infrastructure can lead to transformative scientific and societal advances. However, translating this opportunity to scientific discovery and sustainable cities requires advances in software platforms and middleware. This includes novel programming abstractions to compose distributed applications over new classes of datasets such as streams and dynamic graphs, innovative algorithms that can make use of the potential of such abstractions to scale their techniques, and execution platforms that allow transparent, resilient, and efficient usage of distributed computing facilities. Such an integrated framework needs to be as tuned to the characteristics of the data that they operate upon (e.g., volume, velocity, variety) as to the computing infrastructure that they execute upon (e.g., elasticity, cost, power).
The Distributed Research on Emerging Applications and Machines Lab (DREAM:Lab) focuses on holistic distributed systems research that enables the effective and efficient use of emerging distributed data and computing systems, using scalable software architectures, innovative programming and data abstractions, and algorithms for optimal distributed execution, to support data intensive scientific and engineering applications, which can lead to transformative advances to society.
Housed at the Indian Institute of Science‘s Department of Computational and Data Sciences (CDS), a unique inter-disciplinary department in India offering programs on computational and data sciences, the DREAM:Lab explores the verticals of the data science stack, from data-driven applications to Big Data platforms to emerging distributed infrastructure. Prof. Yogesh Simmhan heads the group.
Active Research Areas :
Some of the concepts the lab explores include:
- The Distributed machines we consider include public and private Clouds, infrastructure and platform as a service (IaaS and PaaS), and virtualized and commodity clusters. Our emphasis is on the distinctive features of the Cloud, such as elastic acquisition and release of Virtual Machines (VMs) and pay-as-you-go billing or spot pricing [0. Cloudy with a Spot of Opportunity: Analysis of Spot-Priced VMs for Practical Job Scheduling, Vedsar Kushwaha and Yogesh Simmhan, Cloud Computing for Emerging Markets (CCEM), 2014], rather than generic features that are common with commodity clusters. More recently, we focus on the emerging paradigm of “Edge” Computing”, [1. IoT Analytics Across Edge and Cloud Platforms, Yogesh Simmhan, IEEE Internet of Things, 2017] where smartphone and Raspberry Pi devices on the edge of the network cooperatively work with the Cloud. We are also examining the use of accelerated Fog devices such as NVidia TX1 and low-power ARM64 servers that can support deep learning closer to the data source [195. Demystifying Fog Computing: Characterizing Architectures, Applications and Abstractions, Prateeksha Varshney, Yogesh Simmhan, IEEE International Conference on Fog and Edge Computing, 2017] [2. ARM Wrestling with Big Data: A Study of ARM64 and x64 Servers for Data Intensive Workloads, Jayanth Kalyanasundaram, Yogesh Simmhan, arXiv:1701.05996, 2017].
- System software fabrics offer the equivalent of an “OS for distributed machines”. While Cloud fabrics have used virtualization to manage thousands of servers at data centers efficiently, we are examining the role of containers in supporting lightweight sandboxing of application environments and resource allocation. In particular, we are developing ECHO as an IoT Fabric to offer a manageable interface over thousands of edge and Fog devices that will be part of IoT deployments.
- Big Data Platforms & programming abstractions are a core competency of our team.
- The GoFFish [3. GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics, Yogesh Simmhan, et al., International European Conference on Parallel Processing (EuroPar), 2014] and GoDB [4. GoDB: From Batch Processing to Distributed Querying over Property Graphs, Nitin Jamadagni and Yogesh Simmhan, IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid), 2016] platforms support distributed graph processing and querying using a novel subgraph-centric programming model we have developed. They scale over the web, social network, and physical infrastructure graphs with billions of vertices and edges while using the elasticity of Clouds [43. A Meta-graph Approach to Analyze Subgraph-centric Distributed Programming Models, Ravikant Dindokar, Neel Choudhury and Yogesh Simmhan, IEEE International Conference on Big Data (Big Data), 2016],[46. Elastic Partition Placement for Non-stationary Graph Algorithms, Ravikant Dindokar and Yogesh Simmhan, IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid), 2016]. We also develop new distributed graph algorithms using such abstractions [47. Subgraph Rank: PageRank for SubgraphCentric Distributed Graph Processing, Nitin Badam and Yogesh Simmhan, International Conference on Management of Data (COMAD), 2014]. These are being extended to operate on dynamic and temporal graphs [5. Distributed Programming over Time-series Graphs, Yogesh Simmhan, et al., IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015]. See the GoFFish project page for more details.
- We develop scheduling and resource allocation strategies for distributed stream processing systems to handle high-velocity data on Cloud VMs [6. Model-driven Scheduling for Distributed Stream Processing Systems, Anshu Shukla, Yogesh Simmhan, arXiv:1702.01785],[7. Reactive Resource Provisioning Heuristics for Dynamic Dataflows on Cloud Infrastructure, Kumbhare, Simmhan, Frincu and Prasanna, IEEE Transactions on Cloud Computing (TCC), 2015]. These are validated on reach-fast data platforms like Apache Storm and outperform existing scheduling algorithms on accuracy and resource cost. We are also currently investigating consistent updates and intelligent reuse of actively running streaming dataflow applications. See the Fast Data project page for more details.
- Edge and Fog computing suffers from a lack of programming frameworks that allow transparent composition and seamless execution of IoT applications on them. The ECHO platform [8. ECHO: An Adaptive Orchestration Platform for Hybrid Dataflows across Cloud and Edge, Pushkara Ravindra, Aakash Khochare, Siva Prakash Reddy, Sarthak Sharma, Prateeksha Varshney, Yogesh Simmhan, arXiv:1707.00889, 2017] aims to address these gaps through an application platform for dataflow composition, and distributed execution on edge, fog and Cloud resources using compute and power efficient scheduling algorithms [9. Distributed Scheduling of Event Analytics across Edge and Cloud, Rajrup Ghosh, Yogesh Simmhan, arXiv:1608.01537, 2016]. Training and inferencing of distributed deep-learning models using TensorFlow is a key focus at present. See the ECHO project page for more details.
- Data Science Algorithms, Applications, and Benchmarks: As the scientific and engineering domains contend with an influx of massive data, they offer a valuable context to apply the advances made in distributed systems research as well as a rich space for discovering novel problems that are as yet unaddressed. Distributed algorithms help translate the application requirements to underlying programming and runtime abstractions, and we particularly work on distributed and dynamic graph algorithms. We also investigate benchmarks to validate emerging applications, platforms, or machines, such as for stream processing and edge analytics [101. RIoTBench: A Real-time IoT Benchmark for Distributed Stream Processing Platforms, Anshu Shukla, Shilpa Chaturvedi, Yogesh Simmhan, Concurrency, and Computation: Practises and Experience, 2017 (To Appear)],[102. Benchmarking Fast Data Platforms for the Aadhaar Biometric Database, Yogesh Simmhan, Anshu Shukla, Arun Verma, Workshop on Big Data Benchmarking (WBDB), 2015]. Smart Cities offers a vast application domain with its foundations in Cyber Physical Systems (CPS) and Internet of Things (IoT). The IISc Smart Campus project aims to validate distributed technologies on the field to make a sustainable impact [103. Towards a Practical Architecture for Internet of Things: An India-centric View, Prasant Misra, Yogesh Simmhan and Jay Warrior, IEEE IoT Newsletter, 2015],[104. An Open Smart City IoT Test Bed: Street Light Poles as Smart City Spines, Amrutur, Rajaraman, Acharya, Ramesh, Joglekar, Sharma, Simmhan, Lele, Mahesh and Sankaran, International Conference on Internet-of-Things Design and Implementation (IoTDI), 2017]. See the Software and Smart Campus project pages for more details.
The research activities of the DREAM:Lab will advance fundamental knowledge on effectively scaling data-driven scientific applications on contemporary and emerging distributed computing infrastructure. Further, the applied nature of this research will translate novel research outcomes into sustainable software prototypes that will help accelerate scientific discovery in critical application domains of national importance. Taking an integrated view across the research stack, from the system to the application, is important. We also collaborate with industry partners such as NetApp and VMWare, and other research groups at the Robert Bosh Center for Cyber Physical Systems and the University of Melbourne. It avoids conducting research in a vacuum, under idealized conditions detached from reality. This is particularly important for systems research due to the fast-changing nature of computing technology and advances in hardware architectures. At the same time, this must not degenerate into building software, systems, or applications as an end in themselves, in the absence of tangible research outcomes. Such practical grounding will also illustrate to students the value of interdisciplinary research while also helping train the research scientists and workforce of the future on advanced technologies.
We acknowledge the support of our current and past sponsors:
- IISc’s Robert Bosch Center for Cyber Physical Systems (RBCCPS)
- GoI’s Ministry of Electronics and Information Technology (MeitY)
- NetApp Inc.
- Microsoft Azure for Research
- TechMahindra