基于FPGA多通道采樣系統(tǒng)設(shè)計(jì)資料
基于FPGA多通道采樣系統(tǒng)設(shè)計(jì)資料,基于,fpga,通道,采樣系統(tǒng),設(shè)計(jì),資料
High Level Design For High Speed FPGA Devices
Man. Ng mcn99
Department of Computing
Imperial College
June 13, 2002
Acknowledgement
Before starting the report, I would like to thank the following people for helping me throughout the project. Without their help, it would be impossible for me to finish the project:
I would like to thank my supervisor Dr. Wayne Luk for giving me a lot of useful advices and encouragement throughout the project. He also guided me towards the problems I should focus on during the implementation. I would like to thank Professor Yang for letting me to implement his gel-image processing algorithm on hardware. He also gave me references and example sources to understand the theories underneath. And I would like to thank for his teaching in his excellent multimedia course. The course conveyed many useful concepts for me to understand the gel image processing I would like to thank Altaf and Shay. They are two Ph.D research students who helped me a lot throughout the implementation of the application.
Abstract
In the project, I have discovered a systematic approach for high-level hardware design. With this approach, I successfully implemented the sophisticated gel image processing on high speed hardware. In the report, I will also introduced a new technique which can automate the process of high level hardware performance optimization by rearranging the code sequence so that the it can be run at minimum number of clock cycles. The report will be split into 4 Chapters:
Chapter 1 is Introduction. It includes the background, all the related works and my contribution to the project.
Chapter 2 is Optimization. In this chapter, I will focus on the techniques for optimization. I will also demonstrate some techniques which can automate the optimization process.
Chapter 3 is Hardware Development. In this chapter, I will generalize the steps of converting a software programme into hardware. These include several techniques which can improve the performance or save the hardware resources.
Chapter 4 is Case Study : Gel Image Processing. In this chapter, I will use gel image processing as an example to show the effect on resource and performance of the techniques discussed in chapter 2. In this chapter, I will also compare the performance of the application between two devices and the software version: Pilchard and RC1000.
Chapter 5 is Conclusion. It includes the assessed achievements and expected future works.
There is also an online version available for this report, the URL is:
http://www.doc.ic.ac.uk/~mcn99/project/report.pdf
Chapter 1
Introduction
Since the emergent of Handel-C [5], a C-like hardware language, a complete high level FPGA design approach is realized. However, most of developers will stick on the lower-level language such as VHDL when they are aiming to design high performance hardware. It is because developers have greater control on the actual circuit implementation in low-level approach. But low-level design probably will reach its limit when FPGA chips grow bigger and bigger. Developers will not be able to develop new application quick enough with low level design which consists of billions of gates. A high-level approach will then be the answer. The purpose of this project is to introduce a systematic way of developing high performance hardware under high-level approach.
1.1 Background and Related Works
In this section, I am going to present the materials that are necessary to understand the content of this report.
1.1.1 Field Programmable Gate Arrays(FPGAs) [1]
Like Programmable Logic Devices(PLDs), FPGA is a piece of hardware which is programmable. However, while the size of PLDs is limited by power consumption and time delay, FPGA can easily implement designs with million of gates on a single IC. The re-programmable nature of FPGA allows developers implements design with shorter development times and lower cost than an equivalent custom VLSI chips. It worths mentioning that development of FPGA is faster than Moore’s Law with capacity doubling every year. With millions of gates available on the newest chip, FPGA is an ideal platform to develop reconfigurable system which is capable of execute complicate application at performance. Therefore, FPGA is the chip I am developing application for.
1.1.2 Pilchard [2]
Pilchard is a reconfigurable computing platform employing a field programmable gate array(FPGA) which plugs into a standared personal computer’s 133MHz synchronous dynamic RAM Dual In-line Memory Modules(DIMMS)slot. Comparing to traditional FPGA devices which are utilizing the PCI nterface, Pilchard allows data to be transferred to and from the host computer in much shorter time, due to the higher bandwidth as well as the lower latency of the DIMM interface. However, as DIMMS is not originally designed for Input/Output(I/O), extra control signals will be needed for Pilchard to indicate the start and the end of data processing. As a result, high level behavioral design approach is preferable to low level structural design approach for developing applications for Pilchard. That’s proves why it is vital to have a systematic way of high level development for high performance FPGA.
1.1.3 RC1000 [3]
RC1000 is a 32-bit PCI card designed for reconfigurable computing applications. It has full board support package in Handel-C with libraries which ease the circuit design for this device. It also features 4 SRAM banks(2Mbytes each) on the board which can be accessed by the FPGA or host CPU. The board can be configured to be run between 4000KHz to 100MHz. This device is very different from Pilchard in many aspects. In the report, I will show that the development steps introduced in this project is general and can be applicable to application development on different devices.
1.1.4 VHDL [4]
VHDL is one of the first high-level languages emerged in the market for designing applications with programmable logic devices. VHDL provides high-level language constructs that enable designers to describe large circuits and bring products to market rapidly. It supports the creation of design libraries in which to store components for reuse in subsequent designs.Because it is a standard language (IEEE standard 1076), VHDL provides portability of code between synthesis and simulation tools, as well as device-independent design. It also facilitates converting a design from a programmable logic to an ASIC implementation. The disadvantage of this language is it is not completely high level, the language still expects user to know the hardware behaviors of the components. Therefore, I decided to use another even higher level hardware language, i.e. Handel-C.
1.1.5 Handel-C [5]
Handel-C is a high level C-like programming language designed for compiling program into hardware images of FPGAs or ASICs. Handel-C provides some extra features which are not appeared in C to support few hardware optimizations. One of those is the language supports specifying the width of each signal so that just optimization can be achieved by targeting the exact resources needed by Handel-C compilers. Handel-C compilers target hardware directly by mapping the program into hardware at the netlist level in xnf or edif format. The advantage of Handel-C over VHDL is that it doesn’t expect users to know too much about the hardware in low level which VHDL does. It is a completely high level language! Figure 1.1 shows the design flows I will adopted in converting Handel-C program to hardware. Although several tools are involved in different steps, but users won’t need to worry about the hardware detail. Because what users need to do is just clicking several buttons to launch the program for converting the file into next step, it is as simply as that.
1.1.6 Extending the Handel-C language [7]
Dong U Lee, a Ph.D students, has invented a language which supports both hardware and software. His approach is to combine both C and Handel-C language. In the language, user can specify which part is done by software and which part is done by hardware. In the project, he also developed an more friendly interface for communication between the host and the FPGA device. However, the number of devices currently supported by this language is limited. That’s why I finally gave up on using this language.
1.2 Contribution
2 I have developed an easy but efficient optimization method which can rearrange code so that it can be run in minimum of cycles.
2 I have developed a systematic design flow for high level hardware design target for high speed devices
2 I have implemented the complicated 2D gel image processing on hardware
Chapter 2
Optimization
In this chapter, I am going to discuss various methods to optimization the high level code. Optimization is the main part which we try to exploit and utilize parallelism to achieve speed up which PC software normally is not able to do so because of the limited resources of CPU. The main focus of this chapter will be on how to automate these optimizations processes. We will also discuss some evaluation equation so to measure the speedup we can achieve after optimization.
2.1 Performance Optimization
This is exploiting the potential parallelism of the program and then run different non-conflicting operations at the same clock cycle to acquire speed up. In normal applications, tens of even hundreds of operations which run sequentially on PC’s CPU may be able to run in parallel. However, PC can’t run them in parallel because of the limited hardware resource. But by designing specific hardware to run as many operations as possible in parallel, significant speed up can obtain. This is the main reason why application on FPGA can sometimes run faster than the corresponding software version even though FPGA hardware run at much slower clock speed(of course we also need to take account of the CPI, but even then PC CPU can still do the same individual operation much faster). There are several techniques we could apply:
2.1.1 Balance The Delay Of Each Path
Balancing the delay of each pat is important because the hardware clock speed will at most be the same as the path with longest delay. Therefore, if the delay of one particular path is much later than the others, then it means we have wasted resource as other paths is capable of running at much higher speed. By balancing the delay, it can make sure that the the 5 parallel optimization will be optimal in later stage. The delay of a path can be defined as:
Tdelay = Tlogic + Trouting (2.1)
where Tdelay is the total delay of the path
Tlogic is the delay due to logic
Trouting is the delay due to routing
Therefore, reducing the delay is done by reducing one of the Tlogic or Trouting or both. There are 2 main steps to achieve this:
2 Break up complex operation into simple operations;
2 Use components with pre-defined placement and routing constraints
Breaking Up Complex Operation
First of all, the simplest step to do is to break the complicated operation into several simpler operations. This step effiectively reduce the logic in each operation thus reduce Tlogic in equation 2.1. In software program, the effect of complicated operation normally will run as quick or even quicker than the simpler operations with the same result as the compiler will optimize the instructions execution for us. However, for hardware, a complex operation mean it needs a longer clock to finish, while other simpler operations need not take that long to finish. Figure 2.1 shows an example of breaking up a complex operation into simple operations. In this example, we can see that sometimes extra registers are needed to store intermediate result of the calculation to make the operation simple enough.
Predefined Placement and Routing Component
If the result of the first method is not satisfactory enough or the timing constraints still isn’t met, we can then use a timing analyzer provided by the FPGA chip developer to find out the longest paths. For example, we can use Timing Analyzer coming along with Xilinx ISE Foundation 4 for timing analysis of Xilinx FPGA chips. After finding out the longest path, we will then know which operation run too slow. At this time, we could try 2 methods to increase the speed of this operation. The first is try to write a constraint file ourselves which specify explicitly the placement and routing of this component. This could enhance the timing as the tools which automatic do the placement and routing is normally not very smart.Then we will include this constraint file before the placement and routing process. However, this approach require the developer to have fair amount of knowledge on the FPGA chip they are using. This includes knowledge on the primitives component supported by the chip and the relative placement and routing of these primitives which can achieve the minimum delay. The second approach is easier, it is to use the macro of the predefined placement and routing components defined by the chip developer. Put it simpler, the chip developer has done the job for you, and we should just use it! For Xilinx, they have a program called Core Generator which does exactly what I mentioned. How to include these components in Handel-C program is specified in Handel-C menu. However, users must know that the timing of output and input of these components requires extra care. Because of the language limitation of Handel-C. The input signal will always arrive one cycle late into the component. This step will reduce Trouting of equation 2.1 because by nicely placing the logic blocks the routing of the signal will be much shorter thus the delay due to routing will significantly reduced.
Possibility of Automating This Process
Above, we have discussed 2 methods to achieve this step. This step is difficult to be automated, because it is difficult to define what is complex operations and what isn’t. This depends on the device and chip we use as well as the functionality of the program . For device which need a very restricted timing constraint. A 16 bits multiplication may be considered as complex while for some other device which cannot run at high speed, it may be a waste to break up the operations into very simple one which can run at very high speed as the device won’t be run at that speed anyway.However, we can borrow the idea from Lee again to make automation of this process possible. While compiling the source, we can include information about the device we are using as well as the timing constraint. The library of the device will include the delay of each logic unit. It will also include some information on how the FPGA development tools will route the signals. Then the compiler will be able to approximate the Tlogic and Trouting thus Tdelay of each path. The result will then be compared with the timing constraints specified. If violation is detected, the compiler will use the 2 methods mentioned above to balance the delay of each path until the constraint is met.
2.1.2 Basic Parallelism
This is the first step of the actual performance optimization process and is the easiest. The following rules can be applied to automate the process.Scan through the program sequentially. Group as many operations as possible into one clock cycle until there is violation of data dependency. Then repeat the process again for the next cycle. As we have discussed how to detect data dependency in earlier section, this process is possible to be done automatically. Figure 2.2 is a simple example on how to achieve this. We can see that without parallel execution, it takes 8 cycles to complete the 8 operations. However, with parallel execution, it takes only 2 cycles.
2.1.3 Re-arrange Code Sequence
Sometimes, a program can have high parallelism, but the execution sequence of the operations prevents the method just mentioned above to achieve that level. For example, the code in figure 2.3 will have the same effect on the above example shown in figure 2.2. But if we use the ”basic parallelism” method. We will need 4 cycles to finish the operations instead of 2.We can solve the problem by re-arranging code sequence of the code so that the code can run in least cycles as possible. For example, the code above can first change to the same sequence as in 2.2. Then we will apply ”basic parallelism” method again. For this process to be automated by compiler, the compiler will need to have fair amount knowledge and reasoning on the program. I have discovered a method to automate this process:
1. Firstly, choose a group of codes to start with, preferably in the innermost loop.
2. Create an empty table with variables names as index and labels as value. The label is at the formal var:n where var is the name of a variable and n is the number specifying the operation sequence.
3. Scan through the code sequentially. For each variable assignment(Either modification/initialization), assign a label to the operation following the rules listed below:
step 1 search the table to find out the label of the variable being assigned to.
step 2a if no entry is found, the variable is first being assigned. Add an entry in the table, the content(label) is specified as:
step 3ai if the variable is assigned a constant value or a signal from outside the block we are working with, specify the label as varname:1 where varname is the name of the variable being assigned.
step 3aii if the variable value depends on other variables, get the labels of these variables from the table. Assign the label of the variable same as the labels we got with the biggest order but with the order incremented by 1. Eg, for a = b + c, if label for b is d:3 and c is e:4, then label for a should be e:5.
step 2b if an entry is found, the variable has been assigned before.Get the label of that variable.
step 3b Update the label of the variable as specified in step 3ai and step 3aii but with a change that when finding the biggest order label, we should include label of itself in the comparison. For
example, if a = b + c and the labels of b, c are same as above but the label of a this time exists in the table with value a:5 then the new label will be a:6. Note that we can treat constant as having order of 0 when doing comparison.
step 4 associate the operation with the label we just specified.
4. After labelling all the operations, we will rearrange the operations such that operations with the same order are placed together.
5. Now ”basic parallelism” method will return code which can run in minimum cycles which is the same as the highest order of the labels within the block.
6. Then we can work with one outer loop, repeat from step two again until the whole program is covered.
The method works because what it does is to push all operations to execute at the earliest cycles. As the result, the modified code must be able to execute at minimum number of cycles. It is obvious that this method can be automated quite easily. By combining this method with the basic parallelism method mentioned before, a surprising efficient parallelisation tools is developed!
Figure 2.4 shows examples of how the method works.
2.1.4 Add Register To Store Intermediate Result
Just re-arranging the operations sometimes is still not enough. If we find that there are a lot of operations on the same variable while other variables just got few operations on them. Then we will come out with many operations in the first few cycles while there are just operations on that variable in the last few cycles. So, can we balance the operations in each cycles?
The possible solution is to add registers to store Intermediate Result of the variable and run the calculation of the Intermediate result concurrently to shorten the cycles taken. This technique doesn’t always work and is difficult to automated. Because this requires a lot of reasoning and logics. Figure 2.5 shows an example of this technique. We can see that bef
收藏