PUB-Web: Paderborn University BSP-based Web Computing

Overview

The Paderborn University BSP-based Web Computing (PUB-Web) library (formerly known as PUBWCL) combines aspects of both volunteer computing and grid computing: Like in the volunteer computing approach, web computing means to unify lots of PCs connected via the Internet to a virtual supercomputer (rather than connecting a few supercomputers or computing clusters), and to utilize only the continually fluctuating, donated idle time on these computers. And like in the grid computing approach, web computing means to run coupled, massively parallel algorithms (rather than distributed data processing). Previously organized according to the master-worker paradigm, PUB-Web in its second generation is now realized as a pure peer-to-peer (P2P) system, using some supernodes only for the management of the system.


In recent years volunteer computing and grid computing have received considerable attention. In the web computing approach we combine aspects of both volunteer computing and grid computing: Like in the volunteer computing approach, web computing means to unify lots of PCs connected via the Internet to a virtual supercomputer (rather than connecting a few supercomputers or computing clusters). And like in the grid computing approach, web computing means to run coupled, massively parallel algorithms (rather than distributed data processing). Thus, web computing can be seen as grid computing with the additional challenge to manage a big dynamic set of computers, where not only the set itself is dynamic, i.e., computers may be added and removed at any time, but also the computing power is dynamically changing as we were only donated the idle time on the particular computers. Alternatively, web computing can be seen as volunteer computing with the additional challenge that the processes of a parallel program have to run simultaneously, communicate with each other, and may have certain dependencies.


In order to avoid arbitrarily high communication and synchronization delays in such a heterogeneous computing environment, we restrict the parallel applications to a round-based model with computing, communication, and synchronization phases. Rather than inventing a new variant, we stick to the well-established Bulk-Synchronous Parallel (BSP) model, which has been introduced by Leslie G. Valiant in order to simplify the development of parallel algorithms. It forms a bridge between the hardware to use and the software to develop by giving the developer an abstract view of the technical structure and the communication features of the hardware to use (e.g., a supercomputer with shared memory, a cluster of workstations or PCs connected via the Internet). A BSP program consists of a set of BSP processes and a sequence of supersteps—time intervals bounded by a barrier synchronization. Within a superstep each process performs local computations and sends messages to other processes.


Other well-known BSP implementations are the Oxford BSP programming library (BSPlib) and the Paderborn University BSP (PUB) library. The Bayanihan BSP implementation is a first attempt to support BSP programs in a volunteer computing context: The central master node decomposes the BSP program to be executed into pieces of work, each consisting of one superstep in one BSP process. The worker nodes download a work package consisting of the current process state of one BSP process and its incoming messages for the current superstep, execute the superstep, and send the resulting state together with the outgoing messages back to the master. When the master has received the results of the current superstep for all BSP processes, it moves the messages to their destination work packages. Then the workers continue with the next superstep. With this approach all communication between the BSP processes passes though the server additionally to the overhead generated by starting / stopping the processes and saving / restoring their process states.


The first generation of PUB-Web has removed this bottle-neck at the master node: All computing nodes communicate directly with each other; the master node was only involved when a computing node had to look up another computing node before their first interaction, or when something went wrong. This has removed the immense overhead of communication and work package management at the master node; however, the master node still had to fulfill tasks like scheduling, load balancing, user management, etc., so that this node still was a single point of failure.


The second generation of PUB-Web has no master-worker structures anymore but is realized as a pure peer-to-peer (P2P) system, using some supernodes only for the management of the system. A supernode is mainly responsible for scheduling and load balancing. Since the computing power available on the assigned peers continually fluctuates depending on how intensively the people, who donate their unused computing power, currently utilize their computers, the supernode's prediction where to optimally schedule the BSP processes may appear to be wrong after some time. As the whole BSP program is delayed at the end of the current superstep if only one peer does not provide the expected amount of computing power, we migrate BSP processes on such peers to other, faster peers.


As not only the available computing power is dynamic, but also the P2P network itself, i.e., peers may disappear out of a sudden, PUB-Web creates backup copies of the process states during each synchronization phase. Thus, we are able to restore processes of a BSP program on-the-fly, whereat a superstep is delayed by a factor of at most 2.


Because the computers in our scenario are not only highly heterogeneous with respect to its hardware, but also run various types and versions of operating systems, PUB-Web needs to be platform independent. The choice of Java does not only fulfill this requirement, but also provides a basic security model out of the box: The Java Sandbox allows to grant code specific permissions depending on its origin. For example, access to (part of) the file system or network can be denied. In order to guarantee a high level of security, we grant user programs only read access to a few Java properties which are needed to write completely platform independent code.

Key Facts

Websites:
Homepage
PUB-Web

More Information

Principal Investigators

contact-box image

Joachim Gehweiler

About the person
contact-box image

Prof. Dr. Friedhelm Meyer auf der Heide

Algorithmen und Komplexität / Heinz Nixdorf Institut (bis 2023)

About the person

Contact

If you have any questions about this project, contact us!

Joachim Gehweiler

contact-box image