|
|
|
CS342: Algorithms for Parallel and Distributed Systems Active Exercise 4: Load Balancing In class we discussed the problem of parallelizing a loop of the following
form: FOR I = 1,N DO /* EXECUTE LOOP BODY FOR VALUE I */ Where the body of the loop for each value of I is independent of the body of the loop for other values of I. We noted that a first naïve approach would be to distribute N/P loop iterations to each of the P processors, but that this might lead to significant load imbalance and poor performance. The major purpose of this exercise is for you to design and experiment with techniques for load balancing for these sorts of loops. The application is motivated by datamining which is essentially the process of searching a database for patterns among the data. On all 14 workstations, excluding ebbets, you will find a directory called /data. /data contains 500 files named f1 through f500. Think of each file as records of transactions from a different supermarket. Each supermarket sells 50 products, and a transaction is simply represented by a binary string of length 50 which indicates which products were purchased in that transaction. Each file contains hundreds or thousands of transactions; they are the same on each machine. In these 500 files of transactions you wish to find interesting pairs: pairs of products such that when one is purchased it is likely that the other is purchased - for example, bagels and cream cheese or beer and chips. We precisely define an interesting pair to be a pair of products (x,y) such that in at least 40% of the transactions in which x appears, y appears, and in at least 40% of the transactions in which y appears, x appears. Write an MPI program that , for each of the 500 files, finds all interesting pairs in that file, and then determines the ten most interesting pairs over all files - the most interesting pairs are those that are interesting for the most files. Your goal is to make this program run as fast as possible. The winning group will receive some kind of (small) prize;either choice of a free book and/or pizza. Since the 500 files contain as much as 143 megabytes of data, you may not be able to run your program on all 50 products - we'll see as we go along how much we can expect. Some tips: If you are in a group of 3, have one person focus on writing a good sequential program and the code that determines the most interesting pairs, and each of the other 2 members focus on a different approach to the load balancing. Each group should implement two reasonably different approaches to the load balancing problem.Your written report is due in class on October 21st. You will have the entire week of the 7th to work on this problem in class. |