Supercomputers of Combined Architecture

Yakov I. Fet
Supercomputer Software Department,
Computing Center RAS
Novosibirsk, Russia

Combined Architecture

The main objective of this work is the investigation of fundamental and applied aspects of high-performance computing systems of so-called Combined Architecture [1-3].

Why combined architecture ?

During the last period the architectures of supercomputers remained essentially unchanged: vector/pipeline systems (Cray Y-MP, Cray C90, NEC SX-3, etc.), and MIMD multiprocessors (TMC CM-5, KSR1, Intel XP/S, etc.), while the progress in performance has been achieved mostly by improving the physical characteristics of the integrated circuits.

However, the things are changing.

The sharp drop of government financing of high-performance computing compels the designers and vendors to seek for new applications of their complicated and expensive products. Instead of highest performance, most economic performance becomes the leading demand in the field.

The new application problems drawing the attention of specialists now are not limited by pure computations, but are rather based on quite different paradigms. These problems are usually characterized by their specific data types, algorithms, and implementation techniques. For different problems and various fragments of the same problem it is expedient to use features of different architectures. Utilization of a rigid architecture leads to a gap between the peak and the real performance.

In this situation it is natural to use more sophisticated ways of exploiting the rich resources of modern electronic components. One of such ways is supporting of computer hardware and software by specialized processors.

The idea of using specialized hardware in computer systems is actually not new. However, for a long time the technological and economical reasons were hampering the broad use of dedicated hardware in high-performance computing systems. Now, with the advent of modern FPGA devices and silicon compilers, these reasons fall away.

In our opinion, the only important problem is to find an integral frame of utilizing different specialized processors. The combined architecture provides such a frame, as well as an appropriate methodology of selection of the types of processors.

Combined Architecture (Figure 1) is a cooperation of a highly parallel host computer with a set of specialized processors or Hardware Modules, HMs. In this architecture, solving of any problem is considered as interaction of several processes, so that execution of each process is delegated to a specialized subsystem, most efficient in implementation of this process. The subsystems are controlled in such a way that their balanced operation might be ensured, and special complementing features of subsystems might be best exploited. For each subsystem a structure is chosen which best corresponds to the function it should perform.

The hardware modules are connected with the host by a suitable interconnection network.

At choosing the architecture of subsystems, the following considerations should be taken into account.

In the combined architecture, the main working load of the processing is delegated to the coprocessors. Hence, the requirements to the performance of the basic computer can be moderate. However, in order to ensure the effective interaction between the subsystems and the necessary data flows, this computer should have a sufficiently high degree of parallelism, a large capacity of the main memory, and an adequate software.

At the same time, in view of the general objectives of the system, extremely high demands should be made to the performance of each coprocessor. It means that special care is needed in selection of the structures of coprocessors. The most suitable architecture for the basic subsystem seems to be the fine-grained SIMD similar to DAP, or CM [4]. These computers fit the above mentioned requirements (memory size, parallelism, etc.), while they are much cheaper of the other parallel systems because of using simple single-bit processing elements.

The last was, in fact, the weak point of this remarkable architecture preventing it to take a leading place in computation intensive applications. In the case of combined architecture, however, this drawback is not a decisive factor, because hard computations, according to the main idea of combined architecture are delegated to the coprocessors. On the other hand, the fine-grained SIMD host computer can ensure the necessary input/output and preprocessing of large data arrays. Thus, the fine-grained SIMD host and the set of specialized processors make a mutually complementing pair.

Another benefit of using the fine-grained SIMD as a host comes from saving of all its features and advantages to be exploited at solving in the system of problems well fitting the "data parallel" paradigm.

Thus, the combined architecture might extend the lifetime of existing fine-grained SIMD computers.

Classification of Procedures

The novelty of the present approach is that the specific type, or "technology", of processing necessary for efficient execution of the most labor-intensive procedures involved in the implementation of a problem is used as a criterion for the selection of appropriate hardware architecture. As a rule, similar technologies are encountered as well in solving problems of other classes.

Two types of specialization of material processing and article manufacturing are conceivable in every area of industry: item- oriented, and process-oriented. In the first case, the production is oriented to manufacturing of a single article, or a group of similar articles. In the second, it is adapted to implementation of definite technological procedures which are typical in producing many different articles, and the production is done by combining these procedures in required sequences. In the development of process-oriented manufacturing, people invented very efficient tools for casting, punching, welding, etc., and those tools are successfully used at producing a large variety of things.

What is the analogy in data processing ? We have here also some well-settled "technologies", procedures, types of processing having broad application for various problems. Thus, in numerical problems, one of the most wide-spread procedures is the vector/matrix multiplication, in non-numerical tasks the lyons share belongs to searching and sorting.

Analyzing the common numerical methods, algorithms, and programming languages, one can select a set of recurring general technologies, or styles of data treatment which we call processing types. It such analysis, specific features of existing hardware should be taken into account in order to make possible a reasonable mapping of the processing types onto efficient hardware modules.

Our goal is to provide a classification of processing styles confronting them with the known hardware structures, in order to find a reasonable mapping:

PROCESSING TYPE -> HARDWARE MODULE

Indeed, the variety of styles, or technologies involved in machine realization of different application problems is not too large.

We will describe the Processing Type (PT) as a three-tuple: PT = {A1,A2,T},

where A1(A2) corresponds to the data type of the first (second) argument, and T corresponds to the transformation to be executed upon these arguments.

In massive procedures, the terms A1 and A2 usually take the values "Vector (V)", "Scalar (S)", and "Binary (B)". Other possible types of arguments can be: "Array", "Set", "Relation", "Tree", etc. Examples of transformations T are: "Arithmetic (A)", "Logic (L)", "Permute (P)".

Table 1 shows the notations of the most common processing types. In this Table, some HMs are also shown suitable for implementation of different PTs. Of course, this list of basic processing types needs further extentions and corrections.

Hardware Modules

Some exemples of existing types of hardware modules are: Vector/pipeline processors, Systolic/wavefront arrays, Permutation networks, Sorting networks, Associative array processors, Database processors. Among the known HMs' architectures the parallel homogeneous processors draw particular attention. This is because they, on the one hand, ensure the highest level of parallel processing (the bit level), and, on the other hand, they fit perfectly the nature of modern integrated circuits.

Our investigations in the field of homogeneous specialized processors are summarized in [5]. Here, the so-called Distributed Functional structures (DF-structures) are discussed, in which direct mapping of algorithms into circuits is accomplished. This means that the algorithms of the basic procedures are modelled by specialized logical networks in the process of signal propagation.

Two examples of DF-structures are given below.

Extremum Selector (alpha-structure)

Consider a two-dimensional homogeneous structure of size m*n with m n-bit elements (binary numbers) of the processed array written in its memory so that each element occupies one row of the matrix (most significant digits to the left).

For the maximum selection a known algorithm of column-wise (from left to right) inspection of values a of the bits of array elements is used, described as follows.

Step 1. The contents of the first (left) column is looked over, that is, the most significant digits of all m elements. If all these digits are zeros, then at the following step the second digits of all m elements are looked over. If, however, the first column contains both zeros and ones, then at the second step only those elements which had ones in the first position are looked over.

Step j. The contents of the j-th column (j-th digits of all elements) are looked over, in those rows which were singled out at the (j-1)-th step. If all these digits are zeros, then at the following step the (j+1)-th digits of the same rows are looked over. If there are both zeros and ones in the memory elements looked over at the j-th step, then at the (j+1)-th step only rows corresponding to ones are looked over.

The subset of the rows singled out at the last (n-th) step (and this may consist of only one row) contains the maximal elements.

As it was shown in [5], this algorithm is realized when each cell of the distributed logical network implements the functions z ´ =z(a or | y | ), x ´ =x or az, where the variables x and z belong to the vertical and horizontal look-over channels, and y in each column should be set to xm ´ .

Evidently, the implementation of minimum selection differs only in that the negations of all binary variables $a$ should be used at the array inspection.

Digital Compressor

We call digital compressor a functional unit realizing the compression of binary vectors, that is, the conversion of an arbitrary binary vector into a prefix (suffix) vector of the same weight. Various types of digital compressors are known. Consider one of them, based on a simple DF-structure called lambda-matrix.

This is a two-dimensional homogeneous array (Figure 2a) each cell of which contains two logical gates, AND and OR (Figure 2b) and realizes logical functions z ´ =zt (the horizontal channel) and t ´ =z or t (the vertical channel).

Let an arbitrary binary vector be applied to the inputs z of the left boundary of lambda-matrix.

Consider the first (the left) column of the matrix. The variable t retains its initial value 0 in the vertical channel of this column only till z=0. In some i1-st cell, where z=1 is encountered for the first time, the value of t changes to 1 which cannot change then till the lower bound. However, the i1-st cell receives yet the signal t=0. Hence, it is the single cell in the whole column where the combination z | t | =1 is present. This combination may serve as an indication for extracting the "one".

The horizontal channel of thus indicated i1-st cell is closed by the signal t=0. Hence, the first ``one" of the given vector does not propagate further along the current row. In all cells lower than the indicated one, t=1, so that z ´ =z. Thus, to the inputs of the second column a duplicate of the given vector is applied, except for its first ``one".

Similar transformations are performed in the second, the third column, and others: in some i2-nd cell of the 2-nd column the second ``one" of the given vector is indicated, in some i3-rd cell of the 3-rd column the third "one" is indicated, etc. (i1Evidently, signals ``1" appear at the outputs t ´ of the lower bound in the 1-st, 2-nd, ... columns of the lambda-matrix, and the number of such columns corresponds to the number of "ones" in the given binary vector. Hence, the lambda-structure performs the compression of a binary vector.

Conclusion

A new approach is suggested to the organization of high-performance parallel computing systems.

The distinguishing characteristics of this approach are:

Use of a highly parallel fine-grained SIMD computer as a basic (host) subsystem of combined architecture.
Broad application of strongly specialized accelerators (hardware modules) of different intentions and various hardware structures including parallel, pipeline, systolic, associative, etc.
Systematic approach to the selection of accelerators based on the classification of massive procedures according to so-called processing types.
Balanced interaction of various subsystem of the combined architecture.

A technique of parallel programming for the combined architecture is now under development based on the language VEPRAN [6] extended by appropriate means to specify the processing types and to control the movement of parallel data in the given concentrated heterogeneous system. Using this language, one can analyze the algorithm of solution of any given problem and present it as a sequence of processing types. In course of the execution of such a program, each processing type is assigned to a definite hardware module (or a combination of some modules) taken from the set of real modules included in the system.

The combined architecture is considered as an open system which can be supplemented by additional processors, according to the requirements of the application. From the user's point of view, a computer of combined architecture appears conventionally. But actually, in addition to the usual library programs, it has a "library" of hardware modules.

In some sense, the combined architecture realizes what Daniel Slotnick has predicted in [7]: "What we can place, however, is the evolution of large systems ... to comprehend entire families of special-purpose 'peripheral devices' in a way not different in principle from the way they now comprehend their library of programs".

The further research in this subject will cover:

Analysis of various combined architecture systems with different destinations, algorithms, and the choice of subsystems structures.
Defining of a basic set of composable functional building blocks (hardware modules) of the combined architecture providing efficient implementation of the most important application problems (linear algebra, mathematical physics, image and signal processing, computer graphics, artificial intelligence, etc.).
Mapping of the hardware modules onto the current available FPGA and ASIC technologies.
Simulation of the hardware modules on FPGA plarform in order to get a preliminary estimation of their performance.

Simulation of most important application problems.

The results of this investigation will contribute to the European efforts in designing of new HPC architectures.

BIBLIOGRAPHY

Ya.I. Fet Hardware support of massive computations, Optimization, Novosibirsk, Inst. of Mathematics, Siberian Div. of the USSR Acad. Sci., No.22(39), pp.115-126, 1978. (In Russian).
A.P.Vazhenin, S.G.Sedukhin and Ya.I.Fet High-performance computing systems of combined architecture, In: "Parallel Computing Technologies (PaCT-91)", Novosibirsk, Russia, 1991 (N.N. Mirenkov, ed.), World Scientific, Singapore, pp. 246-257, 1991.
Ya.I.Fet and A.P.Vazhenin Heterogeneous processing: a combined approach, In: "Workshop on Parallel Scientific Computing (PARA'94-L)", 1994, Lingby, Denmark, (LNCS, Vol.879), Berlin, Springer-Verlag, 1994, pp. 194-206.
Ya.I.Fet Vertical processing systems: a survey, IEEE Micro, Vol.15, No.1, pp.65-75, 1995.
Ya.I.Fet Parallel Processing in Cellular Arrays, Taunton, UK, Research Studies Press, 1995.
A.P.Vazhenin Hardware and algorithmic support of high-accuracy computations in vertical processing systems, In: "Parallel Computing Technologies (PaCT-93)", Obninsk, Russia, 1993, Moscow: ReSCo J.-S. Co., pp. 149-162, 1993.
D.L.Slotnick The conception and development of parallel processors - a personal memoir. Ann. Hist. Comput., 1983, Vol.4, No.1, pp.20-30.

Application
Problem

-->

Design of
Algorithm

<--

Processing
Types

A1 A2 T PT HM(s)

Vector Vector Arithmetic VVA Pipeline, PVM, PVA, Systolic

Vector -- Arithmetic VoA Pipeline, PVM, PVA, Systolic

Vector Vector Logic VVL Set Intersection Processor

Vector Scalar Arithmetic VSA Pipeline

Vector Scalar Search VSS Associative Processor

Vector Interval Search VIS Associative Processor

Vector Vector Search VVS Set Intersection Processor

Vector -- Order VoO Sorting Network, Systolic

Vector -- maX VoX Extremum Selector

Vector -- miN VoN Extremum Selector

Vector Vector Permute VVP Permutation Network

Vector -- Permute VoP Permutation Network

Vector -- Compress VoC Digital Compressor

Vector -- Expand VoE Permutation Network

Vector -- Logic VoL Associative Processor

Binary Binary Logic BBL Associative Processor

Binary Binary Permute BBP Permutation Network

Binary -- Permute BoP Permutation Network

Please contact Dr. Yakov Fet for all the questions concerning these researches and publications.
E-mail: fet@ssd.sscc.ru

Last update: April 22, 1996

A1	A2	T	PT	HM(s)
Vector	Vector	Arithmetic	VVA	Pipeline, PVM, PVA, Systolic
Vector	--	Arithmetic	VoA	Pipeline, PVM, PVA, Systolic
Vector	Vector	Logic	VVL	Set Intersection Processor
Vector	Scalar	Arithmetic	VSA	Pipeline
Vector	Scalar	Search	VSS	Associative Processor
Vector	Interval	Search	VIS	Associative Processor
Vector	Vector	Search	VVS	Set Intersection Processor
Vector	--	Order	VoO	Sorting Network, Systolic
Vector	--	maX	VoX	Extremum Selector
Vector	--	miN	VoN	Extremum Selector
Vector	Vector	Permute	VVP	Permutation Network
Vector	--	Permute	VoP	Permutation Network
Vector	--	Compress	VoC	Digital Compressor
Vector	--	Expand	VoE	Permutation Network
Vector	--	Logic	VoL	Associative Processor
Binary	Binary	Logic	BBL	Associative Processor
Binary	Binary	Permute	BBP	Permutation Network

Binary	--	Permute	BoP	Permutation Network