Why combined architecture ?
During the last period the architectures of supercomputers remained essentially unchanged: vector/pipeline systems (Cray Y-MP, Cray C90, NEC SX-3, etc.), and MIMD multiprocessors (TMC CM-5, KSR1, Intel XP/S, etc.), while the progress in performance has been achieved mostly by improving the physical characteristics of the integrated circuits.
However, the things are changing.
The sharp drop of government financing of high-performance computing compels the designers and vendors to seek for new applications of their complicated and expensive products. Instead of highest performance, most economic performance becomes the leading demand in the field.
The new application problems drawing the attention of specialists now are not limited by pure computations, but are rather based on quite different paradigms. These problems are usually characterized by their specific data types, algorithms, and implementation techniques. For different problems and various fragments of the same problem it is expedient to use features of different architectures. Utilization of a rigid architecture leads to a gap between the peak and the real performance.
In this situation it is natural to use more sophisticated ways of exploiting the rich resources of modern electronic components. One of such ways is supporting of computer hardware and software by specialized processors.
The idea of using specialized hardware in computer systems is actually not new. However, for a long time the technological and economical reasons were hampering the broad use of dedicated hardware in high-performance computing systems. Now, with the advent of modern FPGA devices and silicon compilers, these reasons fall away.
In our opinion, the only important problem is to find an integral frame of utilizing different specialized processors. The combined architecture provides such a frame, as well as an appropriate methodology of selection of the types of processors.
Combined Architecture (Figure 1) is a cooperation of a highly parallel host computer with a set of specialized processors or Hardware Modules, HMs. In this architecture, solving of any problem is considered as interaction of several processes, so that execution of each process is delegated to a specialized subsystem, most efficient in implementation of this process. The subsystems are controlled in such a way that their balanced operation might be ensured, and special complementing features of subsystems might be best exploited. For each subsystem a structure is chosen which best corresponds to the function it should perform.
The hardware modules are connected with the host by a suitable interconnection network.
At choosing the architecture of subsystems, the following considerations should be taken into account.
In the combined architecture, the main working load of the processing is delegated to the coprocessors. Hence, the requirements to the performance of the basic computer can be moderate. However, in order to ensure the effective interaction between the subsystems and the necessary data flows, this computer should have a sufficiently high degree of parallelism, a large capacity of the main memory, and an adequate software.
At the same time, in view of the general objectives of the system, extremely high demands should be made to the performance of each coprocessor. It means that special care is needed in selection of the structures of coprocessors. The most suitable architecture for the basic subsystem seems to be the fine-grained SIMD similar to DAP, or CM [4]. These computers fit the above mentioned requirements (memory size, parallelism, etc.), while they are much cheaper of the other parallel systems because of using simple single-bit processing elements.
The last was, in fact, the weak point of this remarkable architecture preventing it to take a leading place in computation intensive applications. In the case of combined architecture, however, this drawback is not a decisive factor, because hard computations, according to the main idea of combined architecture are delegated to the coprocessors. On the other hand, the fine-grained SIMD host computer can ensure the necessary input/output and preprocessing of large data arrays. Thus, the fine-grained SIMD host and the set of specialized processors make a mutually complementing pair.
Another benefit of using the fine-grained SIMD as a host comes from saving of all its features and advantages to be exploited at solving in the system of problems well fitting the "data parallel" paradigm.
Thus, the combined architecture might extend the lifetime of existing fine-grained SIMD computers.
Two types of specialization of material processing and article manufacturing are conceivable in every area of industry: item- oriented, and process-oriented. In the first case, the production is oriented to manufacturing of a single article, or a group of similar articles. In the second, it is adapted to implementation of definite technological procedures which are typical in producing many different articles, and the production is done by combining these procedures in required sequences. In the development of process-oriented manufacturing, people invented very efficient tools for casting, punching, welding, etc., and those tools are successfully used at producing a large variety of things.
What is the analogy in data processing ? We have here also some well-settled "technologies", procedures, types of processing having broad application for various problems. Thus, in numerical problems, one of the most wide-spread procedures is the vector/matrix multiplication, in non-numerical tasks the lyons share belongs to searching and sorting.
Analyzing the common numerical methods, algorithms, and programming languages, one can select a set of recurring general technologies, or styles of data treatment which we call processing types. It such analysis, specific features of existing hardware should be taken into account in order to make possible a reasonable mapping of the processing types onto efficient hardware modules.
Our goal is to provide a classification of processing styles confronting them with the known hardware structures, in order to find a reasonable mapping:
PROCESSING TYPE -> HARDWARE MODULE
Indeed, the variety of styles, or technologies involved in machine realization of different application problems is not too large.
We will describe the Processing Type (PT) as a three-tuple: PT = {A1,A2,T},
where A1(A2) corresponds to the data type of the first (second) argument, and T corresponds to the transformation to be executed upon these arguments.
In massive procedures, the terms A1 and A2 usually take the values "Vector (V)", "Scalar (S)", and "Binary (B)". Other possible types of arguments can be: "Array", "Set", "Relation", "Tree", etc. Examples of transformations T are: "Arithmetic (A)", "Logic (L)", "Permute (P)".
Table 1 shows the notations of the most common processing types. In this Table, some HMs are also shown suitable for implementation of different PTs. Of course, this list of basic processing types needs further extentions and corrections.
Our investigations in the field of homogeneous specialized processors are summarized in [5]. Here, the so-called Distributed Functional structures (DF-structures) are discussed, in which direct mapping of algorithms into circuits is accomplished. This means that the algorithms of the basic procedures are modelled by specialized logical networks in the process of signal propagation.
Two examples of DF-structures are given below.
Extremum Selector (alpha-structure)
Consider a two-dimensional homogeneous structure of size m*n with m n-bit elements (binary numbers) of the processed array written in its memory so that each element occupies one row of the matrix (most significant digits to the left).
For the maximum selection a known algorithm of column-wise (from left to right) inspection of values a of the bits of array elements is used, described as follows.
Step 1. The contents of the first (left) column is looked over, that is, the most significant digits of all m elements. If all these digits are zeros, then at the following step the second digits of all m elements are looked over. If, however, the first column contains both zeros and ones, then at the second step only those elements which had ones in the first position are looked over.
Step j. The contents of the j-th column (j-th digits of all elements) are looked over, in those rows which were singled out at the (j-1)-th step. If all these digits are zeros, then at the following step the (j+1)-th digits of the same rows are looked over. If there are both zeros and ones in the memory elements looked over at the j-th step, then at the (j+1)-th step only rows corresponding to ones are looked over.
The subset of the rows singled out at the last (n-th) step (and this may consist of only one row) contains the maximal elements.
As it was shown in [5], this algorithm is realized when each cell of the distributed logical network implements the functions z ´ =z(a or | y | ), x ´ =x or az, where the variables x and z belong to the vertical and horizontal look-over channels, and y in each column should be set to xm ´ .
Evidently, the implementation of minimum selection differs only in that the negations of all binary variables $a$ should be used at the array inspection.
Digital Compressor
We call digital compressor a functional unit realizing the compression of binary vectors, that is, the conversion of an arbitrary binary vector into a prefix (suffix) vector of the same weight. Various types of digital compressors are known. Consider one of them, based on a simple DF-structure called lambda-matrix.
This is a two-dimensional homogeneous array (Figure 2a) each cell of which contains two logical gates, AND and OR (Figure 2b) and realizes logical functions z ´ =zt (the horizontal channel) and t ´ =z or t (the vertical channel).
Let an arbitrary binary vector be applied to the inputs z of the left boundary of lambda-matrix.
Consider the first (the left) column of the matrix. The variable t retains its initial value 0 in the vertical channel of this column only till z=0. In some i1-st cell, where z=1 is encountered for the first time, the value of t changes to 1 which cannot change then till the lower bound. However, the i1-st cell receives yet the signal t=0. Hence, it is the single cell in the whole column where the combination z | t | =1 is present. This combination may serve as an indication for extracting the "one".
The horizontal channel of thus indicated i1-st cell is closed by the signal t=0. Hence, the first ``one" of the given vector does not propagate further along the current row. In all cells lower than the indicated one, t=1, so that z ´ =z. Thus, to the inputs of the second column a duplicate of the given vector is applied, except for its first ``one".
Similar transformations are performed in the second, the third
column, and others: in some i2-nd cell of the 2-nd column the
second ``one" of the given vector is indicated, in some i3-rd cell
of the 3-rd column the third "one" is indicated, etc.
(i1
The distinguishing characteristics of this approach are:
A technique of parallel programming for the combined architecture is now under development based on the language VEPRAN [6] extended by appropriate means to specify the processing types and to control the movement of parallel data in the given concentrated heterogeneous system. Using this language, one can analyze the algorithm of solution of any given problem and present it as a sequence of processing types. In course of the execution of such a program, each processing type is assigned to a definite hardware module (or a combination of some modules) taken from the set of real modules included in the system.
The combined architecture is considered as an open system which can be supplemented by additional processors, according to the requirements of the application. From the user's point of view, a computer of combined architecture appears conventionally. But actually, in addition to the usual library programs, it has a "library" of hardware modules.
In some sense, the combined architecture realizes what Daniel Slotnick has predicted in [7]: "What we can place, however, is the evolution of large systems ... to comprehend entire families of special-purpose 'peripheral devices' in a way not different in principle from the way they now comprehend their library of programs".
The further research in this subject will cover:
The results of this investigation will contribute to the European efforts in designing of new HPC architectures.
Application Problem |
--> |
Design of Algorithm |
<-- |
Processing Types |
A1 | A2 | T | PT | HM(s) |
Vector | Vector | Arithmetic | VVA | Pipeline, PVM, PVA, Systolic |
Vector | -- | Arithmetic | VoA | Pipeline, PVM, PVA, Systolic |
Vector | Vector | Logic | VVL | Set Intersection Processor |
Vector | Scalar | Arithmetic | VSA | Pipeline |
Vector | Scalar | Search | VSS | Associative Processor |
Vector | Interval | Search | VIS | Associative Processor |
Vector | Vector | Search | VVS | Set Intersection Processor |
Vector | -- | Order | VoO | Sorting Network, Systolic |
Vector | -- | maX | VoX | Extremum Selector |
Vector | -- | miN | VoN | Extremum Selector |
Vector | Vector | Permute | VVP | Permutation Network |
Vector | -- | Permute | VoP | Permutation Network |
Vector | -- | Compress | VoC | Digital Compressor |
Vector | -- | Expand | VoE | Permutation Network |
Vector | -- | Logic | VoL | Associative Processor |
Binary | Binary | Logic | BBL | Associative Processor |
Binary | Binary | Permute | BBP | Permutation Network |
Binary | -- | Permute | BoP | Permutation Network |