|
Kraeva M.A., Malyshkin V.E. Assembly Technology for
Parallel Realization of Numerical Models on MIMD-Multicomputers.
In the special issue of the International
Journal on Future Generation Computer Systems,devoted
to Parallel Computing Technologies. Vol. 17 (2001),
No. 6, pp. 755-765 |
1. Introduction
Parallel implementation of realistic numerical models, using direct modeling
of a physical phenomenon on the basis of description of the phenomenon
behaviour in the local area, usually requires high performance computations.
However these models are also remarkable for irregularity and even dynamically
changing irregularity of the data structure (adaptive mesh, variable time
step, particles, etc.). For example, in the PIC method the test particles
are in the bottom of such an irregularity. Additionally, parallel programs,
implemented realistic numerical models, should possess by new dynamic
properties like dynamic load balancing. Hence, these models are very difficult
for effective parallelization and high performance implementation with
conventional programming systems.
The Assembly Technology (AT) [1-3] has been especially created in order
to support the development of fragmented parallel programs for multicomputers.
Fragmentation and automatic dynamic load balancing are the key features
of programming and program execution under the AT. In order to provide
high performance of target programs the AT is implemented in different
ways for different application areas. In this paper, the application of
the AT to implementation of big size numerical models is demonstrated
on the example of parallel implementation of the PIC method [4,5].
2. The PIC method and the problems of its parallel implementation
The particle simulation is a powerful tool for modeling of the behaviour
of complex non-linear phenomena in plasmas and fluids. In the PIC method,
trajectories of a huge number of test particles are calculated as these
particles are moved under the influence of the electromagnetic fields
computed self-consistently on a discrete mesh. These trajectories represent
a desirable solution of the system of differential equations describing
a physical phenomenon under study.
A real physical space is represented by a model of simulation domain called
the space of modeling (SM). The electric E and magnetic B fields are defined
as vectors and discretised upon rectangular mesh (or several shifted meshes,
as shown in Figs.1 and 2). Thus, as distinct from other numerical methods
on the rectangular mesh, in the PIC method there are two different data
structures – particles and meshes. None of the particles affects
another particle. At any moment of modeling a particle belongs to a certain
cell of each mesh.
Each charged particle is characterized by its mass, co-ordinates and velocities.
Instead of solution of equations in the 6D space of co-ordinates and velocities,
the dynamics of the system is determined by integrating the equations
of motion of every particle in the series of discrete time steps. At each
time step tk+1:=tk+Δt the following is done:
- for each particle, the Lorentz force is calculated from the values
of electromagnetic fields at the nearest mesh points (gathering phase);
- for each particle the new co-ordinates and velocity of a particle
are calculated; a particle can move from one cell to another (moving
phase);
- for each particle the charge carried by a particle to the new cell
vertices is calculated to obtain the current charge and density, which
are also discretised upon the rectangular mesh (scattering phase);
- Maxwell's equations are solved to update the electromagnetic field
(mesh phase).
|
|
Fig. 1. A cell of the SM with the electric
E and magnetic B fields, discretised upon shifted meshes |
Fig. 2. The whole space of modeling
(SM) assembled out of cells |
The sizes of a time step and of a cell are chosen in such a manner that
a particle cannot fly farther than into the adjacent cell at one time
step of modeling. The number of time steps depends on a physical experiment.
The PIC algorithm has the great possibility for the parallelisation, because
all the particles are moved independently. The volume of computations
at the first three phases of each time step is proportional to the number
of particles. About 90% of multicomputer resources are spent for the particle
processing. Thus, in order to implement the PIC code on multicomputer
with high performance, the equal number of particles should be assigned
for processing to each processor element (PE). However, on the MIMD distributed
memory multicomputers, the performance characteristics of the PIC code
crucially depends on how the mesh and particles are distributed among
the PEs. In order to decrease the communication overheads at the first
and the third phases of a time step, it is required that a PE contains
as the cells (values of the electromagnetic fields at the mesh points)
as the particles located inside them. Unfortunately, in the course of
modeling, some particles might fly from one cell to another. To satisfy
the previous requirement, two basic decompositions can be used [4,5].
In the so-called Lagrangian decomposition, the equal number of particles
are assigned to each PE with no regard for their position in the SM. In
this case, the values of the electromagnetic fields, the current charge
and density at all the mesh points should be copied in every PE. Otherwise,
the communication overheads at the first and the third phases will decrease
the effectiveness of parallelisation. Disadvantages of the Lagrangian
decomposition are the following:
- strict memory requirements;
- communication overheads at the second phase (to update the current
charge and the current density in each PE).
In the Eulerian decomposition, each PE contains a fixed rectangular sub-domain,
including electromagnetic fields at the corresponding mesh points and
particles in the corresponding cells. If a particle leaves its sub-domain
and flies to another sub-domain in the course of modeling, then this particle
should be transferred to the PE containing this latter sub-domain. Thus,
even with an equal initial workload of the PEs, in several steps of modeling,
some PEs might contain more particles than the others. This results in
the load imbalance. The character of the particles motion does not fully
depend on equations, but also on the initial particles distribution and
the initial value of the electromagnetic field.
Let us consider some examples of particles distribution, which correspond
to the different real physical experiments.
- Uniform distribution of the particles in the entire space
of modeling.
- The case of a plate. The space of modeling has n1 x n2 x
n3 size. The particles are uniformly distributed in k x n2 x n3 size
space (k<<n1).
- Flow. The set of particles is divided into two subsets:
the particles with zero initial velocity and the active particles with
the initially nonzero velocity. Active particles are organized as a
flow crossing the space along a certain direction.
- Explosion. There are two subsets of particles. The background
particles with zero initial velocities are uniformly distributed in
the entire space of modeling. All the active particles form a symmetric
cloud (r<<h, where r is the radius of the cloud, h is the mesh
step). The velocities of active particles are directed along the radius
of the cloud.
The main problem of programming is that the data distribution among the
PEs depends not only on the volume of data, but also on the data properties
(particles velocities, configuration of electromagnetic field, etc.).
With the same volume of data but different particles distributions inside
the space the data processing is organized in different ways.
As the particles distribution is not stable in the course of modeling,
the program control and data distribution among the PEs should be dynamically
changing. It is clear that the parallel implementation of the PIC method
on a distributed memory multicomputer strongly demands the dynamic load
balancing.
3. Substantial features of AT
Numerical methods, generally, and the PIC method, in particular, are
very suitable for the application of AT. The fragmented representation
of an algorithm affects different stages of an application program development.
Here we would like in brief to list some peculiarities of the AT.
3.1 Algorithm and program fragmentation
In accordance with the AT, the description of an application problem should
be divided into a system of reasonably small atomic fragments, representing
the implementation entities of a problem description. Fragments might
be represented in programming languages by variables, procedures, subroutines,
macros, nets, notions, functions, etc. depending on the method of the
AT implementation. The program, which realizes an atomic fragment (P_fragment),
contains both data and code. In other words, a program, realizing an application
problem, is assembled out of such small P_fragments of computations which
are connected through variables for data transfer.
The fragmented structure of an application parallel program is kept in
the executable code and provides the possibility for organization of flexible
and high performance execution of the assembled program. The general idea
is the following. An assembled program is represented in a programming
language, and is composed as a system of executable P_fragments. This
program is executed inside every PE, looping over all the P_fragments,
loaded to the PE. If these fragments are small enough, then initially
for each PE of a multicomputer the equal workload is assembled out of
these P_fragments. This is of course a general idea only. Actually, in
order to attain high performance of a target program its assembling is
realized in different ways.
The workload of PEs can be changed in the course of computation, and if
at least one PE becomes overloaded, then a part of P_fragments (with the
processing data), which were assigned for execution into the overloaded
PE, should fly to the underloaded PEs equalizing the workload of multicomputer
PEs. Dynamic load balancing of a multicomputer is based on such a fragmentation
[5,6]. The algorithms of a target program assembling out of atomic fragments
are not discussed here in detail.
3.2 Assembling vs. Partitioning
Our basic key word is assembly. Contrary to partitioning, the AT supports
explicit assembling of a whole program out of ready-made fragments of
computations, rather then dividing a problem, defined as a whole, into
the suitable fragments to be executed on the different PEs of a multicomputer.
These fragments are the elementary blocks, the "bricks", to
construct a whole program (a whole problem description), there is no a
difference here). An algorithm of a problem assembling is kept and used
later for dynamic program/problem parallelization. Assembling defines
the "seams" of a whole computation, the way of the fragments
connection. Therefore, these seams are the most suitable way to cut the
entire computation for parallel execution. For this reason program parallelization
can be always done if the appropriate size of atomic fragments is chosen.
3.3 Explicitly two-level programming
Under the AT, first, suitable atomic fragments of computation are designed,
programmed and debugged separately. Then the whole computation (problem
solution) is assembled out of these fragments.
3.4 Separation of the fine grain and the coarse grain computations.
The fine and the coarse grain computations are executed on the different
level of hardware and software. Under AT the fine grain computations are
encapsulated inside a module (P_fragment) that realizes the computations
bound up with atomic fragment. Such a module can be implemented very effectively
on a processor element. The whole parallel program is assembled out of
these ready-made modules. The system of modules of a program defines a
system of interacting processes (coarse grain computations). Encapsulation
of the fine grain computations, their complexity, inside an atomic fragment
provides the possibility of formalization of a parallel program construction
and the use of explicitly two-level representation of an algorithm: the
programming level inside an atomic fragment and the scheme level of an
application program assembling.
3.5 Separation of semantics and scheme of computation
The fine grain computations define a sufficient part of semantics (functions)
of computations. They are realized within an atomic fragment. Therefore,
on the coarse grain level, only a scheme (non-interpreted or semi-interpreted)
of computations is assembled. It means that formal methods of automatic
synthesis of parallel programs [1] can be successfully used and the standard
schemes of parallel computations can be accumulated in libraries.
4. Parallelization of numerical methods with AT
Let us consider the assembly approach to the numerical methods parallelisation
on the example of the PIC method parallel implementation.
The line and the 2D grid structure of interprocessor communications of
a multicomputer are sufficient for the effective parallel implementation
of numerical methods on the rectangular meshes. In [3] the algorithm of
the 2D grid mapping into the hypercube keeping the neibourhood of PEs
is given.
A cell is natural atomic fragment of computation for a numerical method
implementation. It contains both data (particles inside the cells of a
fragment and values of electromagnetic fields, current density at their
mesh points) and the procedures, which operate with these data. For the
PIC method, when a particle moves from one cell to another, it should
be removed from the former cell and added to the latter cell. Thus, we
can say that with the AT the Eulerian decomposition is realized.
4.1 PIC parallelization on the line of PEs
Let us consider first in what way the PIC is parallelized for the multicomputers
with the line structure of interprocessor communications. The three-dimensional
simulation domain is initially partitioned into N blocks (where N is the
number of PEs). Each block_i consists of several adjacent layers of cells
and contains approximately the same number of particles (Figs. 3, 4).
When the load imbalance is crucial, some layers of the block located into
an overloaded PE are transferred to another less loaded PE. In the course
of modeling, the adjacent blocks are located in the linked PEs. Therefore,
the adjacent layers are located in the same or in the linked PEs. It is
important for the second phase, at which some particles can fly from one
cell into another and for the fourth phase, when for recalculation of
values of electromagnetic fields in a certain cell, values in the adjacent
cells are also used.
|
Fig. 3. Decomposition of SM for implementation
of the PIC on the line of PEs |
4.2 PIC parallelization on the 2D grid of PEs
Let us consider now the PIC method parallelization for the 2D grid of
PEs. Let the number of PEs be equal to lxm. Then SM is divided into l
blocks orthogonal to a certain axis. Each block consists of several adjacent
layers and contains about NP/l particles (in the same way as it was done
for the line of PEs). The block_i is assigned for processing to the i-th
row of the 2-D grid (Fig. 4). Blocks are formed in order to provide an
equal total workload of every PE's row of the processor grid. Then every
block_i is divided into m sub-blocks block_i_j, which are distributed
for processing among m PEs of the row. These sub-blocks are composed in
such a way in order to provide an equal workload of every PE of the row.
If overload of at least one PE occurs in the course of modeling, this
PE is able to recognize it at the moment when the number of particles
substantially exceeds NP/(lxm). Then this PE initiates the re-balancing
procedure.
|
Fig. 4. Decomposition of SM for implementation
of PIC on the 2D grid of PEs |
If the number of layers k x N (or k x l in the case of the grid of PEs),
it is difficult or even impossible to divide the SM into blocks with the
equal number of particles. Also, if particles are concentrated inside
a single cell, it is definitely impossible to divide SM into the equal
sub-domains. In order to attain the better load balance, the following
modified domain decomposition is used. A layer containing more than the
average number of particles is copied at least into 2 or more neighbouring
PEs (Fig. 5) – these are virtual layers. A set of particles located inside
such a layer are distributed among all these PEs. In the course of the
load balancing, particles inside the virtual layers are the first to be
redistributed among PEs, and only if necessary, the layers are also redistributed.
For the computations inside a PE, there is no difference between virtual
and non-virtual layers. Any layer could become virtual in the course of
modeling, a virtual layer can stop to be virtual.
|
Fig. 5. Virtual layers for implementation
of PIC on the 2D grid of PEs |
We can see, that in both cases there is no a necessity to provide flying
of a cell. A cell is very small fragment, therefore a large resources
should be spent to provide its flying. Thus, there is a necessity to use
bigger indivisible fragments on the step of execution (not on the step
of a problem/program assembling!). In the case of the line of PEs a layer
of the SM should be chosen as indivisible fragment of concrete implementation
of PIC. Such a fragment is called a minimal fragment. For PIC implementation
on the grid of PEs a column is taken as minimal indivisible fragment.
Procedure realizing minimal fragment is composed statically out of P_fragments,
before the whole program assembling. This essentially improves the performance
of an executable code.
In such a way, a cell is used as atomic fragment at the step of numerical
algorithm description. At the step of execution of a numerical algorithm,
different minimal fragments assembled out of atomic fragments are chosen
depending on architecture of a multicomputer.
4.3 Implementation of the PIC method on multicomputers
Using AT PIC method has been implemented on different multicomputer systems
(MBC-100, multitransputer system, Hypercube Intel iPSC/860, Parsytec PowerXplorer,
Cray J90, IBM SP2). In order to provide a good portabilitym the language
C was chosen for the parallel PIC code implementation. Since different
message passing systems are used for different multicomputers, the special
communication functions were designed. In the compilation time the actual
functions of a target multicomputer are substituted instead of these communication
functions. In particular, the MPI functions might be substituted. For
the dynamic load balancing of PIC several algorithms were developed [7,8].
In the cases of the grid communication structure (Fig. 6) and the virtual
fragments (particles might fly not only to the neighbouring PEs) special
tracing functions are used.
|
Fig. 6. Direction of data transfer
for implementation of the PIC on the 2D grid |
According to the AT the array of particles is divided into N parts, where
N is the number of minimal fragments (layers of SM in the case of the
line of PEs, and columns in the case of the 2D grid). When elements of
the mesh variables (electromagnetic fields, current charge and density)
of different minimal fragments hit upon the same point of SM, they are
unified (Fig. 7). The elements of the mesh variables of one minimal fragment
are not added to the data structure for this fragment, but are stored
in the 3D arrays with elements of the mesh variables of other minimal
fragments in the PE. This appears possible due to the rectangular shape
of blocks block_i (block_i_j in the case of the 2D grid). Such a decision
allows us to decrease the memory requirements and to speed up computations
during the forth phase of the PIC algorithm.
|
Fig. 7. Unification of the Hx mesh
variables for two minimal fragments (layers) |
In the case of dynamic load balancing, when some minimal fragments are
transferred from one PE to another, the size of the 3D arrays of elements
of the mesh variables is dynamically changed. This demands special implementation
of such dynamic arrays.
5. Dynamic load balancing
To attain a high performance of parallel PIC implementation, a number
of centralized and decentralized load balancing algorithms were especially
developed.
5.1 Initial load balancing
A layer of cells is chosen as minimal fragment for the PIC implementation
on the line of PEs. Each minimal fragment has its own weight. The weight
of a minimal fragment is equal to the number of particles in this fragment.
The sum of weights of all the minimal fragments in some PE determines
the workload of PE. The 3D simulation domain is initially partitioned
with a certain algorithm into N blocks (where N is the number of PEs).
Each block consists of several adjacent minimal fragments and contains
approximately the average number of particles. For the initial load balancing
two heuristic centralized algorithms were designed [7,8]. These algorithms
employ the information about the weights of all the minimal fragments.
Each PE has this information and constructs the workload card by the same
algorithm. The workload card contains the list of minimal fragments that
should be loaded to the PEs. If the number of minimal fragments is much
greater than the number of PEs, it is usually possible to distribute the
minimal fragments among the PEs in such a way that every block would contain
approximately the same number of particles.
If considerable portion of particles is concentrated inside a single cell,
it is impossible to divide the SM into the blocks with quite an equal
workload. To solve this problem, a notion of virtual layer is introduced.
The centralized algorithm was modified for the case of virtual fragments.
If overloading of at least one PE occurs in the course of modeling, this
PE is able to recognize it at the moment when the number of particles
substantially exceeds NP/N. Then this PE initiates the re-balancing procedure.
5.2 Dynamic load balancing
If a load imbalance occurs, the procedure BALANCE is called. In this procedure,
the decision about the direction of data transfer and the volume of data
to be transferred is taken. For any load balancing algorithm there exists
its own realizing procedure BALANCE. The procedure TRANSFER is used for
the data transfer. There are two implementations of this procedure: for
the line of PEs and for the grid of PEs. The procedure TRANSFER is the
same for any load-balancing algorithm on the line of PEs. Parameters of
the procedure are the number of particles to be exchanged and the direction
of data transfer.
Let us consider algorithms of the dynamic load balancing of the PIC. All
the PEs are numerated. In the case of the line of PE, each PE has number
i, where 0<= i < number_of_PEs. In the case of the (l x m) grid
of PEs, the number of PE is the pair (i,j) , where 0 <=i < l, 0
<=j < m. In the same way, layers and columns of the SM are numerated.
5.3 Centralized dynamic load balancing algorithm
For the dynamic load balancing the initial load balancing algorithm could
be used. One of PEs collects the information about weights of the minimal
fragments and broadcasts this information to all the other PEs. All the
PEs build the new load card. After that neighbouring PEs exchange minimal
fragments according to the information in the new workload card.
Imbalance Threshold. If centralized algorithms are used for the dynamic
load balancing, the PEs exchange the information about load balancing,
therefore every PE has information about the number of particles in all
the PEs. In any PE, the difference mnp-NP/N (where mnp is the maximum
number of particles in PE, NP is the total number of particles, N is the
number of PEs) is calculated. If the difference is more than the threshold
Th, the procedure BALANCE is called. The threshold can be a constant,
chosen in advance, or an adaptive number. In the latter case, initially
Th=0. In the course of modeling, the time t_part, which is required to
implement steps 1)-3) of the PIC algorithm for one particle, is calculated.
After every BALANCE call the time of balancing t_bal is calculated. Th
is assigned to be equal to t_bal/t_part (how many particles could be processed
for the same time as one balancing requires). After each subsequent step
of the PIC algorithm, Th is decreased by mnp-NP/N. When the value of Th
is negative, BALANCE is called.
If the threshold is always equal to zero, the procedure BALANCE is called
after each time step of modeling.
5.4 Decentralized dynamic load balancing algorithm for the constant
number of particles
The use of centralized algorithms is good enough for multicomputers containing
a few PEs. But if the number of PEs is big enough, the communication overheads
could neutralize the advantages of the dynamic load balancing. In this
case it is more preferable to use decentralized algorithms. Such algorithms
use information about the load balance in the local domain of PEs only.
If the number of model particles is not changed in the course of modeling,
a simple decentralized algorithm could be suggested. Each PE has the information
on how many particles were moved to/from its neighbouring PEs. To equalize
the load balance it is sufficient just to receive/send the same number
of particles from/to the neighbouring PEs. It should be noted that this
algorithm works in the case of virtual fragments only.
5.5 Specialized decentralized algorithm
To equalize the load balance for the PIC implementation, the following
specialized algorithm was designed. In the course of modeling in every
PE the main direction of particles motion is calculated from the values
of particles velocities. In the load balancing, particles are delivered
to the direction opposite to the main direction. As in the previous algorithm
it is assumed that the number of particles is not changed in the course
of modeling. The number of particles to be transferred from overloaded
PEs to their neighbours in the direction opposite to the main direction
is calculated from the average number of particles and the number of particles
in the PE. Some particles are transferred in advance, in order to reduce
the number of calls of the dynamic load balancing procedure.
5.6 Diffusive load balancing algorithms
The basic diffusive load-balancing algorithm was implemented and tested
for the PIC parallel implementation [7-9]. The size of the local domain
is equal to two. Any diffusive algorithm is characterized by the number
of steps. Actually the number of steps defines how far the data from one
PE could be transferred in the course of load balancing. For every step
the procedure TRANSFER is called. The more the number of steps of diffusive
algorithm the better load balance could be attained, but also the more
time is required for load balancing. The tests have shown that the total
program time does not decrease with the growth of the number of steps.
5.7 Algorithms comparison
In the following Tables the results of different dynamic load balancing
algorithms testing for the problem of plasma cloud explosion modeling
are shown. The size of the problem is the following:
• The size of the space of modeling - 24x24x36
• The number of background particles - 24*24*36*27=559872
• The number of cloud particles - 240128
• The total number of particles - 800
|
2 PE |
3 PE |
4 PE |
5 PE |
6 PE |
7PE |
8 PE |
20 PE |
The max number of particles in PE |
400000 |
353256 |
223190 |
207663 |
213364 |
153948 |
152389 |
79930 |
Duration of particles processing |
110190 |
97462 |
62012 |
57375 |
59072 |
42821 |
46980 |
22267 |
Total time |
121768 |
107878 |
68801 |
63481 |
65332 |
47470 |
46716 |
24687 |
Tab.1
Dynamic load balancing is absent |
|
2 PE |
3 PE |
4 PE |
5 PE |
7 PE |
8 PE |
The max number of particles
in PE |
400000 |
266787 |
200102 |
160091 |
114388 |
100091 |
Duration of particles processing |
110625 |
73862 |
55722 |
44498 |
32189 |
27579 |
Total time |
122294 |
81900 |
61902 |
49453 |
35826 |
30744 |
Tab.2
Centralized load balancing algorithm |
|
2 PE |
3 PE |
4 PE |
5 PE |
6 PE |
7PE |
8 PE |
20 PE |
The max number of particles in PE |
400000 |
266977 |
200001 |
160484 |
133679 |
153948 |
100727 |
42092 |
Duration of particles processing |
110459 |
75116 |
56246 |
45036 |
59072 |
37270 |
28260 |
11887 |
Total time |
122048 |
83191 |
62606 |
50083 |
65332 |
47470 |
31508 |
13353 |
Tab.3
Diffusive load balancing algorithm (the number of steps is equal
to 2) |
|
2 PE |
3 PE |
4 PE |
5 PE |
6 PE |
7 PE |
8 PE |
20 PE |
The max number of particles in PE |
400000 |
328325 |
200000 |
161630 |
133473 |
115489 |
100238 |
41388 |
Duration of particles processing |
110329 |
90333 |
56203 |
45926 |
36821 |
32669 |
28504 |
11647 |
Total time |
121952 |
100022 |
62337 |
50946 |
40980 |
36419 |
31801 |
13093 |
Tab. 4 Specialized
decentralized algorithm |
If the number of PEs is greater than 2, the load balancing is required.
The results of the tests confirm the fact, that for the small number of
PEs (less than 10), the utilization of a centralized algorithm is more
preferable, though decentralized algorithms also give satisfactory results.
6. Automatic generation of parallel code
The PIC method is applied to simulation of many natural phenomena. In
order to facilitate the parallel PIC implementation, a special system
of parallel program automatic generation was designed. This system consists
of the VISual system of parallel program Assembling (VisA) and a parallel
code generator in C language.
The process of the generation of a parallel program for the PIC and for
the other numerical algorithms on the rectangular meshes consists of three
steps.
At the first step, a user defines the atomic fragment of computation –
a cell of the SM. This cell contains elements of the mesh variables at
several mesh points, an array of particles (for PIC method) and procedures
in C language, which describe all the computations inside the cell – {procedure1,…,procedurek}
(Fig. 8).
|
Fig. 8. Review window for Hx mesh variable
of an atomic fragment of computation |
On the next step the user chooses the target multicomputer and the topology
of the communication net, for which the parallel code should be generated.
After that the generator constructs a minimal fragment (a layer or a column)
from the atomic fragments according to the chosen communication topology.
The particle arrays of atomic fragments merge to a single particle array
for the minimal fragment. The elements of the mesh variables, which hit
upon the same point of SM, are unified. In such a way, for every mesh
variable, only one 3D array of its elements is formed.
At the third step, the decision on the PEs workload is made. The generator
creates a parallel program for a target multicomputer. This program includes
data initialization and a time loop. At each iteration of the time loop,
k loops (where k is the number of procedures in the description of an
atomic fragment) over all the minimal fragments of PE run. After each
k-th loop, those elements of the mesh variables, which are copied in several
PEs, are updated (if necessary).
All the particles are stored in m arrays (where m is the number of minimal
fragments in a certain PE). However, similarly the case of the minimal
fragment assembling, the elements of a mesh variable in all the minimal
fragments of a PE form one 3D array.
The user develops procedures (computations inside a cell) in C language,
using also several additional statements for defining the computations
over the mesh variables.
7. Conclusion
The described algorithms of the numerical models assembling provide high
performance of the assembled program execution, its high flexibility in
reconstruction of the code and dynamic tunability to available resources
of a multicomputer.
High performance of the program execution provides the modeling of big
size problems such as the study of a cloud plasma explosion in the magnetized
background, modeling of interaction of a laser impulse with plasma, etc.
We apply the assembly technology to implementation of different numerical
methods and hope to create a general tool to support implementation of
mathematical approximating models.
Publications:
1. V.A.Valkovskii, V.E.Malyshkin. Synthesis of Parallel Programs and Systems
on the Basis of Computational Models.// Nauka, Novosibirsk, 1988. (In
Russian)
2. V.Malyshkin. Functionality in ASSY System and Language of Functional
Programming.// Proc. "The First Aizu International Symposium on Parallel
Algorithms/Architecture Synthesis", 1995, Aizu-Wakamatsu, Fukushima,
Japan. IEEE Comp. Soc. Press, Los Alamitos, California. P. 92-97.
3. Kraeva M.A., Malyshkin V.E. Assembly Technology for Parallel Realization
of Numerical Models on MIMD-Multicomputers. In the special issue of the
International Journal on Future Generation Computer Systems, devoted to
Parallel Computing Technologies. Vol. 17 (2001), No. 6, pp. 755-765.
4. M.Kraeva, V.Malyshkin. Implementation of PIC Method on MIMD Multicomputers
with Assembly Technology.// Proc. of HPCN Europe 1997 (High Performance
Computing and Networking) Int. Conference, LNCS, Vol.1255, Springer Verlag,
1997. pp. 541-549.
5. M.A.Kraeva, V.E.Malyshkin. Algorithms of Parallel Realization of PIC
Method with Assembly Technology. In Proceedings of 7th HPCN Europe 1999,
Springer Verlag, LNCS, Vol. 1593, pp.329-338.
6. M.Kraeva, V.Malyshkin. Dynamic load balancing algorithms for implementation
of PIC method on MIMD multicomputers.// Programmirovanie, 1999, N 1, p.47-53
(in Russian)
Supervisor: Malyshkin
V. E. (
)
Ph. D. Student: Nechaev
S. P. (
)
|