|
|
|
In this exercise, we continue with issues in Parallel Database Systems, studying
the problem of parallelizing the join operation. The Join: For those of you who have not studied relational databases, we present here a very informal discussion of everything you need to know for this assignment. Think of a relation as a collection of tuples. So, for example, suppose a bank had records on its account holders. Each record consisted of several fields:
The relation that captured this information would be a collection
of these tuples. Often you want to join two relations together. If you have two relations R1 and R2, you can join R1 and R2 on a field f1 in R1 and f2 in R2. The result is a new relation R3 whose tuples/records are the union of records in R1 and R2; for every record r1 in R1 and r2 in R2 such that the f1 field of r1 is equal to the f2 field of r2, there is a new record in R3 that is the union
of r1 and r2. The Parallel Problem: Consider the situation in which you have two relations, R1 and R2, that are spread out over P processors; e.g. each processor contains 1/P of the records of R1 and of R2. For simplicity we'll assume that all the values in each relation are integers. The goal of this exercise is for you to implement an algorithm that will execute an efficient parallel join on any
pair of fields in the two relations. Data: You will find in /data on the local (non-ebbets) workstations two files rel1 and rel2. Each has four integer fields; it is the first field of rel1
on which you should carry out the sorting exercise. Your report on this project should of course discuss the issues in a parallel implementation and explain your algorithmic choices, and give results
of experiments in a scientific fashion. |