Why combined architecture ?
During the last period the architectures of supercomputers remained essentially unchanged: vector/pipeline systems (Cray Y-MP, Cray C90, NEC SX-3, etc.), and MIMD multiprocessors (TMC CM-5, KSR1, Intel XP/S, etc.), while the progress in performance has been achieved mostly by improving the physical characteristics of the integrated circuits.
However, the things are changing.
The sharp drop of government financing of high-performance computing compels the designers and vendors to seek for new applications of their complicated and expensive products. Instead of highest performance, most economic performance becomes the leading demand in the field.
The new application problems drawing the attention of specialists now are not limited by pure computations, but are rather based on quite different paradigms. These problems are usually characterized by their specific data types, algorithms, and implementation techniques. For different problems and various fragments of the same problem it is expedient to use features of different architectures. Utilization of a rigid architecture leads to a gap between the peak and the real performance.
In this situation it is natural to use more sophisticated ways of exploiting the rich resources of modern electronic components. One of such ways is supporting of computer hardware and software by specialized processors.
The idea of using specialized hardware in computer systems is actually not new. However, for a long time the technological and economical reasons were hampering the broad use of dedicated hardware in high-performance computing systems. Now, with the advent of modern FPGA devices and silicon compilers, these reasons fall away.
In our opinion, the only important problem is to find an integral frame of utilizing different specialized processors. The combined architecture provides such a frame, as well as an appropriate methodology of selection of the types of processors.
Combined Architecture (Figure 1) is a cooperation of a highly parallel host computer with a set of specialized processors or Hardware Modules, HMs. In this architecture, solving of any problem is considered as interaction of several processes, so that execution of each process is delegated to a specialized subsystem, most efficient in implementation of this process. The subsystems are controlled in such a way that their balanced operation might be ensured, and special complementing features of subsystems might be best exploited. For each subsystem a structure is chosen which best corresponds to the function it should perform.
The hardware modules are connected with the host by a suitable interconnection network.
At choosing the architecture of subsystems, the following considerations should be taken into account.
In the combined architecture, the main working load of the processing is delegated to the coprocessors. Hence, the requirements to the performance of the basic computer can be moderate. However, in order to ensure the effective interaction between the subsystems and the necessary data flows, this computer should have a sufficiently high degree of parallelism, a large capacity of the main memory, and an adequate software.
At the same time, in view of the general objectives of the system, extremely high demands should be made to the performance of each coprocessor. It means that special care is needed in selection of the structures of coprocessors. The most suitable architecture for the basic subsystem seems to be the fine-grained SIMD similar to DAP, or CM [4]. These computers fit the above mentioned requirements (memory size, parallelism, etc.), while they are much cheaper of the other parallel systems because of using simple single-bit processing elements.
The last was, in fact, the weak point of this remarkable architecture preventing it to take a leading place in computation intensive applications. In the case of combined architecture, however, this drawback is not a decisive factor, because hard computations, according to the main idea of combined architecture are delegated to the coprocessors. On the other hand, the fine-grained SIMD host computer can ensure the necessary input/output and preprocessing of large data arrays. Thus, the fine-grained SIMD host and the set of specialized processors make a mutually complementing pair.
Another benefit of using the fine-grained SIMD as a host comes from saving of all its features and advantages to be exploited at solving in the system of problems well fitting the "data parallel" paradigm.
Thus, the combined architecture might extend the lifetime of existing fine-grained SIMD computers.
Two types of specialization of material processing and article manufacturing are conceivable in every area of industry: item- oriented, and process-oriented. In the first case, the production is oriented to manufacturing of a single article, or a group of similar articles. In the second, it is adapted to implementation of definite technological procedures which are typical in producing many different articles, and the production is done by combining these procedures in required sequences. In the development of process-oriented manufacturing, people invented very efficient tools for casting, punching, welding, etc., and those tools are successfully used at producing a large variety of things.
What is the analogy in data processing ? We have here also some well-settled "technologies", procedures, types of processing having broad application for various problems. Thus, in numerical problems, one of the most wide-spread procedures is the vector/matrix multiplication, in non-numerical tasks the lyons share belongs to searching and sorting.
Analyzing the common numerical methods, algorithms, and programming languages, one can select a set of recurring general technologies, or styles of data treatment which we call processing types. It such analysis, specific features of existing hardware should be taken into account in order to make possible a reasonable mapping of the processing types onto efficient hardware modules.
Our goal is to provide a classification of processing styles confronting them with the known hardware structures, in order to find a reasonable mapping:
PROCESSING TYPE -> HARDWARE MODULE
Indeed, the variety of styles, or technologies involved in machine realization of different application problems is not too large.
We will describe the Processing Type (PT) as a three-tuple: PT = {A1,A2,T},
where A1(A2) corresponds to the data type of the first (second) argument, and T corresponds to the transformation to be executed upon these arguments.
In massive procedures, the terms A1 and A2 usually take the values "Vector (V)", "Scalar (S)", and "Binary (B)". Other possible types of arguments can be: "Array", "Set", "Relation", "Tree", etc. Examples of transformations T are: "Arithmetic (A)", "Logic (L)", "Permute (P)".
Table 1 shows the notations of the most common processing types. In this Table, some HMs are also shown suitable for implementation of different PTs. Of course, this list of basic processing types needs further extentions and corrections.
Our investigations in the field of homogeneous specialized processors are summarized in [5]. Here, the so-called Distributed Functional structures (DF-structures) are discussed, in which direct mapping of algorithms into circuits is accomplished. This means that the algorithms of the basic procedures are modelled by specialized logical networks in the process of signal propagation.
Two examples of DF-structures are given below.
Extremum Selector (alpha-structure)
Consider a two-dimensional homogeneous structure of size m*n with m n-bit elements (binary numbers) of the processed array written in its memory so that each element occupies one row of the matrix (most significant digits to the left).
For the maximum selection a known algorithm of column-wise (from left to right) inspection of values a of the bits of array elements is used, described as follows.
Step 1. The contents of the first (left) column is looked over, that is, the most significant digits of all m elements. If all these digits are zeros, then at the following step the second digits of all m elements are looked over. If, however, the first column contains both zeros and ones, then at the second step only those elements which had ones in the first position are looked over.
Step j. The contents of the j-th column (j-th digits of all elements) are looked over, in those rows which were singled out at the (j-1)-th step. If all these digits are zeros, then at the following step the (j+1)-th digits of the same rows are looked over. If there are both zeros and ones in the memory elements looked over at the j-th step, then at the (j+1)-th step only rows corresponding to ones are looked over.
The subset of the rows singled out at the last (n-th) step (and this may consist of only one row) contains the maximal elements.
As it was shown in [5], this algorithm is realized when each cell of the distributed logical network implements the functions z ´ =z(a or | y | ), x ´ =x or az, where the variables x and z belong to the vertical and horizontal look-over channels, and y in each column should be set to xm ´ .
Evidently, the implementation of minimum selection differs only in that the negations of all binary variables $a$ should be used at the array inspection.
Digital Compressor
We call digital compressor a functional unit realizing the compression of binary vectors, that is, the conversion of an arbitrary binary vector into a prefix (suffix) vector of the same weight. Various types of digital compressors are known. Consider one of them, based on a simple DF-structure called lambda-matrix.
This is a two-dimensional homogeneous array (Figure 2a) each cell of which contains two logical gates, AND and OR (Figure 2b) and realizes logical functions z ´ =zt (the horizontal channel) and t ´ =z or t (the vertical channel).
Let an arbitrary binary vector be applied to the inputs z of the left boundary of lambda-matrix.
Consider the first (the left) column of the matrix. The variable t retains its initial value 0 in the vertical channel of this column only till z=0. In some i1-st cell, where z=1 is encountered for the first time, the value of t changes to 1 which cannot change then till the lower bound. However, the i1-st cell receives yet the signal t=0. Hence, it is the single cell in the whole column where the combination z | t | =1 is present. This combination may serve as an indication for extracting the "one".
The horizontal channel of thus indicated i1-st cell is closed by the signal t=0. Hence, the first ``one" of the given vector does not propagate further along the current row. In all cells lower than the indicated one, t=1, so that z ´ =z. Thus, to the inputs of the second column a duplicate of the given vector is applied, except for its first ``one".
Similar transformations are performed in the second, the third
column, and others: in some i2-nd cell of the 2-nd column the
second ``one" of the given vector is indicated, in some i3-rd cell
of the 3-rd column the third "one" is indicated, etc.
(i1 The distinguishing characteristics of this approach are:
A technique of parallel programming for the combined
architecture is now under development based on the language
VEPRAN [6] extended by appropriate means to specify the
processing types and to control the movement of parallel data
in the given concentrated heterogeneous system. Using this
language, one can analyze the algorithm of solution of any
given problem and present it as a sequence of processing
types. In course of the execution of such a program, each
processing type is assigned to a definite hardware module (or
a combination of some modules) taken from the set of real
modules included in the system.
The combined architecture is considered as an open system
which can be supplemented by additional processors,
according to the requirements of the application. From the
user's point of view, a computer of combined architecture
appears conventionally. But actually, in addition to the usual
library programs, it has a "library" of hardware modules.
In some sense, the combined architecture realizes what Daniel
Slotnick has predicted in [7]: "What we can place, however, is
the evolution of large systems ... to comprehend entire families
of special-purpose 'peripheral devices' in a way not different in
principle from the way they now comprehend their library of
programs".
The further research in this subject will cover:
The results of this investigation will contribute to the
European efforts in designing of new HPC architectures.
Conclusion
A new approach is suggested to the organization of high-performance
parallel computing systems.
BIBLIOGRAPHY
Application
Problem
-->
Design of
Algorithm
<--
Processing
Types
A1
A2
T
PT
HM(s)
Vector
Vector
Arithmetic
VVA
Pipeline, PVM, PVA, Systolic
Vector
--
Arithmetic
VoA
Pipeline, PVM, PVA, Systolic
Vector
Vector
Logic
VVL
Set Intersection Processor
Vector
Scalar
Arithmetic
VSA
Pipeline
Vector
Scalar
Search
VSS
Associative Processor
Vector
Interval
Search
VIS
Associative Processor
Vector
Vector
Search
VVS
Set Intersection Processor
Vector
--
Order
VoO
Sorting Network, Systolic
Vector
--
maX
VoX
Extremum Selector
Vector
--
miN
VoN
Extremum Selector
Vector
Vector
Permute
VVP
Permutation Network
Vector
--
Permute
VoP
Permutation Network
Vector
--
Compress
VoC
Digital Compressor
Vector
--
Expand
VoE
Permutation Network
Vector
--
Logic
VoL
Associative Processor
Binary
Binary
Logic
BBL
Associative Processor
Binary
Binary
Permute
BBP
Permutation Network
Binary
--
Permute
BoP
Permutation Network