Exploiting The Ultimate Parallelism-Bit-level
ABSTRACT
The finest grain parallelism is the ultimate way of getting maximum performance. Bit-level parallelism is the finest grain that we can ever achieve. Bit-level parallelism is also easy to exploit and understand theoretically because it is just a truth table. Problems associated with bit-level parallelism such as hardware limitation and verification techniques are being overcome by advances in VLSI and compiler technologies. With the advent of XILINK PGA, exploiting bit-level parallelism is no longer a dream but there are some issues which must be studied.
INTRODUCTION
SPLASH1 board which comprises 2 VME boards with 32 XC3090 XILINK PGA from Super Computing Research and Brown University, managed to beat CRAY 2 by time on a DNA matching benchmark. It even managed to beat PNAC full custom NMOS chip by 45.5 times. The reason why SPLASH1 can beat a full custom chip which is surely bit-level optimised is the gap in semiconductor technology which goes to the problem of exploiting the techniques.
SPLASH1 is primitive compared to what can be done with full exploitation of bit-level parallelism. First is the problem of realisation. SPLASH1 comes equipped with standard CAD tools where we can enter our schematic and then convert into the format required by XILINK PGAs. The schematic entry end is well established but not satisfactory yet. XILINK software development tools come with logic minimization and auto router. The highest level of problem specification is the VHDL syntax which is still not easy to write in. Research work on higher level abstraction such as Boyer-Moore theorem prover is currently ongoing but it still cannot interface directly to the final device level. The most realistic expectation is just a high level computer language such as C language, where users can specify their problems which can be converted to a form suitable for the devices.
Compared to coarser grained parallelism such as distributed processors, optimization and realization is comparatively easy. After all bit-level parallel is just a realization in the form of truth tables. If we can describe our problems into truth tables, we have achieved or aim. Any student of electrical engineering can form a truth table when confronted with a problem. A truth table can solve any problem including sequential logical problems which emulates the random nature any problem. Optimizing the truth table can easily be done by using logic minimization packages such as ESPRESSO[1]. Its efficiency may be lower than other forms of parallelism such as vector processors but there is little danger such as hazards which may reduce performance. The relative inefficiency can be compensated by customization such as being done by SPLASH1. Because we are working at bit levels Information Theory can be used to analyze the hardware requirements for a particular problem.
IDEAL CASE OF FULL EXPLOITATION
When we say exploitation, we are visualising a scenario where a user can just enter his problem in a high level language,the system will convert it into the truth-table form and finally to the most straight-forward hardware implementation, PLA(programmable logic array) if we have enough hardware to accommodate the truth table of any size(Figure 1).
How much hardware do we need? Depends on the problem. For an 8-bit multiplier, giving 8-bit output, for a ROM(read only implementation), you would need 2^(8+8)X8=512Kbit memory elements i.e. 12 column truth table where there is no redundancy. With redundancy we can reduce the number of storage elements if implemented as PLA structures. The number of storage elements is equal to the number of entries of the output portion of the truth table(lookup table), i.e. 2^(no of inputs) X (no of outputs). The amount of predictive hardware(memory is the knowledge which is used to do the prediction) required to decode the multiplier is 512 Kbits, equivalent to the information content of the multiplication process. In the PLA case, the kind of logic gates(which operate at bit levels), and the quantity can be deduced from the number of minterms. However PLA is not a true truth table because it requires two gate stages for operations.
Imagine if you need to do a 32 X 32 matrix multiplication of 32-bit numbers. Each 32-bit multiplier requires 1.4E11. You need 32 X 32 multipliers. Total 1.4E14. The amount may be reduced by using logic minimization and PLA structures but by how many orders of magnitude can we reduce it? Maximising utilisation is not a big problem because all circuit elements are similar but we may depart from the ideal case of full parallelization, assuming we can get the perfect truth table implementer. Actually this is impossible because of limitations of gate driving capability. Mead and Conway[2] calculated that the optimum next load is 2.7 times that of the basic gate. The optimum fan-out is 2.7.
PROBLEMS OF BIT-LEVEL PARALLELISM
In the real world, we face limitations of amount of hardware and speed trade-off as well as basic problems such as driver loading problems. To drive a very large truth table made up of PLA would slow down its cycle time tremendously compared to the simplest logic gate. We may have a nice parallel representation of our problem but we cannot implement them in just one cycle of minimum logic gate delay. Winograd theorem actually predicts the performances of various circuits in fan-in limited circuitry. Some of his results are very interesting and not so apparent, but this bit-level(gate level) analysis shows its advances compared to other coarser grained analysis[3].
There are various problems which are devised to reduce and optimise bit-level parallelism. In fact this problem is old and had confronted even the early logic designers, except that they do not use the term bit-level parallelism. Early logic designers have limited number of gates but more fan-outs and higher minimum gate delays. However the terms serial and parallel adders are already well known but the problems they face is always the exploitation of their techniques, so when microprocessors were introduced, very few users need to resort to lower lever logic designs. The main problem had never been in techniques because the techniques are relatively easy to do but very tedious and consumes a lot of hardware. It is because they are not flexible(or intelligent) so finally they lost out to computers. The basic problem is just that of exploitation of these techniques. By using Information Theory, they now have a more reliable and robust theory to explain their circuitry compared to the coarser grained parallelism. By understanding what intelligence really is[4], we can incorporate it into logic gate designs.
In terms of research papers produced, coarse grained parallelism far exceed that of bit-level parallelism. For example the ESPRIT program is basically data-flow parallelism which is yet to be utilised and commercialised. The problems are both theoretical and exploitations. Users find it hard to understand the theories and also to exploit the parallelism using current tools. So much money is poured into these course grained parallelism all over the world that little is left to bit-level parallelism. Most work on exploiting bit-level parallelism is in VLSI design tools but they are not tailored to bit-level parallelism.
Another major area of research is Neural Network, which are actually bit-level(or can be converted to this form) parallel architectures because once the Neural Network is trained, it behaves like a truth table. It should be possible to design programmable truth tables that can emulate the functions of neural networks. The difficulty is in understanding the basic principles of Intelligence and Cognition. I have devised a method of measuring Intelligence which just equate Intelligence to random behaviour but its major contribution is just to demystify Intelligence and revert back to the simplicity of the Scientific Method. Current efforts of trying to emulate a human brain is similar to trying to fly like a bird, and the participants can be an acknowledged genius, such as Leonardo Da Vinci.
USING DSP TO EXPLOIT BIT-LEVEL PARALLELISM
Special purpose signal processors had been in existence long before DSP(Digital Signal Processors) but when DSP such as TMS32010 came out, it became an instant success. A DSP is basically a highly bit-level parallel multiplier connected to bare minimum computer elements. It is by far the most successful exploitation of bit-level parallelism, far exceeding that of specialised signal processors because it allows intelligence to be utilised. Intelligence is just flexibility and random behaviour so the conditional jumps and removable EPROM, provide the intelligent behaviours.
DSPs are designed only to use parallel multipliers but with the advent of higher level of device integration, more complicated operations may be supportable. One very important operation is the convolution. Even the string search used in the SPLASH1 benchmark is actually a class of convolution operation. For proper support for parallel convolution, the input data may be multi-dimensional. This varying data sizes lead to a scheme of getting away with data type tagging to a microprocessor instruction.
SCOP(scalable compressed opcode processor) is designed with flag-bit loadable data type selections. The data and instruction widths are separately scalable to cater for various levels of bit-level parallelism which may be available. 2-size instruction formats are supported and multiple register stacks(replacing internal data memory) are used to reduce the amount of instruction bits.
The objective of SCOP is to reduce system cost through the reduction in program memory and cache, using the parallelism which is inherent in stack operation through auto-incremented addressing modes, and support for zero-overhead loops. It serves as the intelligence interface to the powerful ALUs. It is designed to be the processor core for PGA or other methods of VLSI designs.
Similar intelligence can be incorporated into full truth-table implementation using partitioning techniques but may not be efficient because it is hard to identify the mapping from high-level language to gates when the structure is collapsed into a truth table, however in the end this method will override the processor techniques which are currently used, when optimization technologies become more mature.
CONCLUSION
Bit-level parallelism is relatively simple to design compared to coarser grained parallelism but was not widely used because of problems of exploiting the techniques using existing technology and tools. DSP, SCOP and PGA are attempts at simplifying the use of the bit-level design but the full exploitation can only come through incorporating intelligence(programmability using high level languages). However they are only an intermediate solution compared to the full gate/bit level design implementations.
REFERENCES
[1] R.K. Brayton,G.D. Hachtel,C.T. McMullen, A.L.Sangiovanni-Vincentelli, 'Logic Minimization Algorithms for VLSI Synthesis' Kluwer Academic Publishers (1984)
[2] C. Mead,L. Conway, 'Introduction to VLSI Systems' Addison-Wesley Publishing Company, Inc. (1980)
[3] S. Waser, M.J. Flynn, 'Introduction to Arithmetic for Digital System Designers' Holt, Rinehart and Winston
[4] Othman Ahmad, 'Quantitative Measure of Intelligence' ICCS/ISITA '92,16-20 November,1992, Singapore (to be published)