Software-based Ultrasound Phase Rotation Beamforming on Multi-core DSP

Jieming Ma¹, Kerem Karadayi¹, Murtaza Ali², and Yongmin Kim¹,³,⁴,⁵
Departments of ¹Electrical Engineering and ³Bioengineering, University of Washington, Seattle, WA 98195 USA
Departments of ²Electrical Engineering and ⁴Computer Science and Engineering, POSTECH, Pohang, Korea
²Texas Instruments, Dallas, TX 75266 USA

Abstract—Phase rotation beamforming (PRBF) is a commonly-used digital receive beamforming technique. However, due to its high computational requirement, it has traditionally been supported by hardwired architectures (e.g., application-specific integrated circuits (ASICs) or more recently field-programmable gate arrays (FPGAs)). In this paper, we investigate the feasibility of supporting software-based PRBF on a multi-core DSP. To alleviate the high computing requirement, the analog front-end (AFE) chips integrating quadrature demodulation in addition to analog-to-digital conversion could be adopted. Under this condition, only delay alignment and phase rotation need to be performed by DSP, substantially reducing the computational load. We implemented the delay alignment and phase rotation modules on a Texas Instruments C6678 DSP with 8 cores. With a sampling rate of 40 MHz and 2:1 decimation, it takes 200 μs to generate one scanline (2048 samples/scanline) on two cores. With 4 cores, it can support beamforming for 64 channels with 10k scanlines/s, e.g., 200 scanlines/frame and at 50 frames/s. The remaining 4 cores can work on back-end processing tasks and applications, e.g., color Doppler or ultrasound elastography.

Keywords—ultrasound, beamforming, phase rotation, DSP, programmable, software-based

I. INTRODUCTION

Ultrasound imaging is an important diagnostic tool. It transmits high-frequency sound waves and generates images using echoes that are produced at tissue interfaces where changes in acoustic impedance occur. Compared to other cross-sectional imaging modalities, e.g., CT, MRI and PET, ultrasound has the advantages of being portable and low cost.

In modern ultrasound machines, phase rotation beamforming (PRBF) is a commonly-used digital receive beamforming technique [1]. However, due to its high computational requirement, it has traditionally been supported by hardwired architectures (e.g., application-specific integrated circuits (ASICs) or more recently field-programmable gate arrays (FPGAs)). Programmable processors, e.g., digital signal processors (DSPs), have been used for back-end signal and image processing in clinical ultrasound systems for some time, but not for front-end beamforming.

However, with a trend towards using more software in ultrasound imaging systems, some attempts have been made to support software-based beamforming using DSPs. Sohn et al. presented a system using 16 ADSP-TS201 DSPs for beamforming [2]. Chen et al. proposed an implementation of digital beamforming on Am2045 Massively Parallel Processor Array (MPPA) [3]. On the other hand, some software-based approaches based on simplified algorithms were also attempted to reduce the burden in computation. A pixel-based beamforming algorithm was introduced [4]. Also, we proposed a two-stage demodulation phase rotation beamforming method to significantly reduce the amount of computation [5], which incurred some image quality degradation.

With the development of new analog front-end (AFE) chips and multi-core DSPs, it might be possible to support both full (not simplified) PRBF and back-end processing on the same multi-core DSP, which could lead to a lower-cost ultrasound machine that consumes much less power. The industry trend in new AFEs is for more integration of both analog and digital functionalities. To meet the requirement for high performance and small-sized ultrasound systems, the AFE, which includes analog-to-digital converter (ADC), quadrature demodulation and decimation, is preferred. This highly-integrated AFE could substantially reduce the computation requirement on DSP. At the same time, with the programmable processors going multi-core, more computational power is now available. In this paper, we explored the feasibility of supporting software-based PRBF on a multi-core DSP.

II. METHODS

A. Phase Rotation Beamforming

The basic concept of receive beamforming is to align the time-delayed echoes and sum them up, which is illustrated in Fig. 1. Ultrasound echoes arrive at different transducer elements at different times determined by the distance between the target in the tissue and the position of the transducer element. As the distance from the target to the center element is shorter, the center element receives the echo sooner than the off-center elements. By applying appropriate delay times, the echoes can be aligned up before the summation, which leads to higher SNR and better image quality.

Typically, there are two ways to improve the time delay accuracy. One is by doing interpolation. In terms of implementation, interpolation-based beamforming requires low-pass filtering after up-sampling, which is computationally intensive. Another way is to perform coarse delay adjustment by sample shift followed by fine delay adjustment by phase rotation, i.e., phase rotation beamforming. Typically, PRBF consists of three main modules (quadrature demodulation, delay alignment and phase rotation). Filtering in demodulation is computationally expensive since it is applied to each one of the pre-beamformed channels [5]. To alleviate this high computing requirement, the AFE chips integrating quadrature demodulation can be used. By utilizing such AFE chips, only...
delay alignment and phase rotation need to be performed by DSP, reducing the computational load substantially.

**Figure 1. Simplified schematic diagram with 5 channels to illustrate the principle of receive beamformer.**

The flow diagram of our implementation of PRBF algorithm is shown in Fig. 2. For our implementation in this paper, we assume that demodulation and decimation has been performed in the AFE chip. After digitization, demodulation removes the carrier frequency of the received ultrasound signals to extract the complex baseband data (i.e., in-phase (I) and quadrature (Q) components). Both I and Q data are passed onto delay modules, which select appropriate samples for each channel based on the delay determined by the distance between the target and the transducer element. The delay module is followed by phase rotator, which compensates the phase shifting in I and Q signals to improve image quality. Then, summation across all the channels is performed to produce beamformed I and Q signal for back-end processing.

![Block diagram of phase rotation beamformer assuming quadrature demodulation is performed in the AFE chip.](image)

**Figure 2. Block diagram of phase rotation beamformer assuming quadrature demodulation is performed in the AFE chip.**

**B. Mapping of PRBF Algorithm on TMS320C6678 DSP**

C6678 is a new-generation DSP from Texas Instruments [6] with 8 cores running at a clock frequency of up to 1.25 GHz. Each core is a very-long-instruction-word (VLIW) processor with two data paths, each with four functional units (L, S, M and D). In each core, there are 32-kbyte level-1 program (L1P) memory, 32-kbyte level-1 data (L1D) memory, and 512-kbyte local level-2 (L2) memory. L1P, L1D and L2 can be configured as cache and/or addressable SRAM. In addition, the C6678 DSP has 4-Mbyte MSMC SRAM shared by all 8 cores.

Each core has a programmable internal direct memory access (IDMA) controller that allows data transfer between L2 and L1 memory concurrent with CPU processing [7]. In addition, a separate programmable enhanced direct memory access (EDMA) controller transfers data across external memory, memory-mapped peripherals and on-chip memory.

The phase rotation beamforming algorithm described above was implemented on C6678. To save the computational power, delay and phase rotation parameters were pre-computed and stored in look-up tables (LUT). To reduce the LUT size and cache misses due to loading the LUT, all these parameters were combined into one LUT by using differential delay addressing, shown in Fig. 3. The first entry in the LUT contains the absolute delay address for the first sample. Starting from the second entry, sine and cosine values for phase rotation and a differential delay address are combined. The least significant bit of each 16 bits is used for the differential delay address, and 15 bits used for a phase rotation parameter. Therefore, a differential address between adjacent samples in the axial direction is at most 3. Once the LUT is loaded, sine and cosine values for phase rotation can be extracted, and the absolute delay address for each sample can be calculated.

**Figure 3. Structure of LUT by using differential delay addressing.**

Figure 4 shows the tight loop of PRBF on C6678 implemented using the C6678 instruction set [8]. The beamforming is performed in the axial direction. Two LUT entries for two consecutive samples in the axial direction are loaded, and each contains a differential delay address and phase rotation parameters. The absolute delay address can be calculated by adding the differential delay address with the absolute delay address of the previous sample. The first absolute delay address is contained in the LUT. With the delay address, the appropriate sample is loaded. Phase compensation is performed by using the CMPY instruction. Then, the partial sum for the current sample is loaded and updated.

**C. Data Flow Optimization**

To reduce the penalty caused by cache misses in loading LUT, the double-buffering technique and IDMA were used to transfer LUT data while CPU was computing with previous data [9]. Part of the L1D memory was configured as SRAM with a half for ping buffer and another half for pong buffer. While the core processor is working on delay and phase compensation by using the data from the ping buffer, the IDMA controller brings the next LUT block from local L2 memory to the pong buffer. When the computation and data transfer are complete, the core processor switches to use the pong buffer and the IDMA controller works on the ping buffer.
TABLE I. IDEAL PERFORMANCE AND COMPILER OUTPUT

<table>
<thead>
<tr>
<th>Performance (cycles per IQ output)</th>
<th>Compiler Output</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.75</td>
<td>1.75</td>
</tr>
</tbody>
</table>

TABLE II. CYCLE APPROXIMATE SIMULATOR PROFILING OUTPUT AND MULTI-CORE PERFORMANCE ASSUMING 1-GHZ CLOCK FREQUENCY

<table>
<thead>
<tr>
<th>Performance (per scanline with 2 cores)</th>
<th>Profiling Output</th>
<th>Multi-core Performance (including multi-core partitioning overhead)</th>
</tr>
</thead>
<tbody>
<tr>
<td>174 µs</td>
<td>200 µs</td>
<td></td>
</tr>
</tbody>
</table>

Figure 5. Example timing diagram with 4 cores, assuming one scanline is acquired every 100 µs and it takes 200 µs for each core to process 32-channel data.

III. RESULTS

We coded in Code Composer Studio. Table 1 shows the ideal performance and compiler output of our implementation. The compiler output is the same as the ideal performance of 1.75 cycles/IQ output. Based on this performance, assuming a sampling rate of 40 MHz, 2:1 decimation and a clock frequency of 1 GHz, it takes 229.4 µs to perform one scanline.

As the tight loop needs to load the demodulated IQ data (4 bytes/sample), LUT (4 bytes/sample) and partial sum values (8 bytes/sample), a large number of cache misses were expected, which would cause severe performance degradation. To reduce the number of cache misses, we configured L1D as 8 kbytes of data cache and 24 kbytes of SRAM. 24-kbyte SRAM was divided into two parts; 16-kbyte SRAM was used to keep a partial sum array, while 8-kbyte SRAM was used for IDMA double buffering. In addition, for a system with a 4 bytes per entry LUT, 64 channels and 2048 samples/scanline, the size of LUT is 512 kbytes. It is not feasible to keep the whole LUT in local L2 since the size of local L2 memory is 512 kbytes, which would also be used for other purposes. To fit the LUT into local L2 so that it can be transferred independently by IDMA, we used each core to process the data for 32 channels. Therefore, 2 cores were needed to beamform one scanline for 64 channels and 2048 samples/scanline. With this configuration, the total CPU cycles for the tight loop from the cycle approximate compiler is 174,417 cycles, corresponding to 174 µs to beamform one scanline with 2 cores at the clock frequency of 1 GHz, shown in Table 2.

After one core finished the processing, the half partial sum results were kept in MSMC, which would be loaded by the other core that was working on the second 32-channel data to produce the final results. Taking into account this partitioning overhead and communication overhead among cores, it takes...
200 µs to generate one scanline (2048 samples/scanline) with two cores, shown in Table 2. Therefore, with 4 cores, it can support beamforming of 64 channels with 10k scanlines/s, e.g., 200 scanlines/frame and at 50 frames/s. The remaining 4 cores can work on back-end processing tasks and applications, e.g., color Doppler or ultrasound elastography.

Figure 5 gives an example timing diagram with 4 cores, assuming one scanline is acquired every 100 µs. When the first 32-channel data are available in MSMC, core 0 starts processing the first 32 channels for scanline 0. 50 µs later, core 1 starts beamforming the second 32-channel data for scanline 0. The scanline 0 is generated at 300 µs. After that, each scanline is completed every 100 µs. Figure 6 shows the simplified high-level architecture of the PRBF system.

IV. DISCUSSION AND CONCLUSION

In this paper, we investigated the feasibility of using one multi-core DSP to support PRBF. Our results show that, when quadrature demodulation is performed in the AFE chips and the AFE output data can be fed to C6678, beamforming for 64 channels with 10k scanlines/s and 2048 samples/scanline can be supported on C6678 with 4 cores assuming a clock frequency of 1 GHz. If we increase the clock frequency to 1.25 GHz, 12.5k scanlines can be beamformed per second with 4 cores.

The beamforming performance enabled by this architecture can support various clinical applications, as listed in Table 3. One example is neck imaging, where the carotid arteries are examined for plaques and flow abnormality. With the AFE chip performing digitization at 40 MHz, demodulation and 2:1 decimation, there are 1299 samples per scanline assuming an imaging depth of 5 cm (40 MHz × 10 cm (roundtrip distance) / 1540 m/s (speed of sound in tissue) / 2 (2:1 decimation)). Based on the physical limitation, the maximum number of scanlines that can be produced is 15.4k, which can be supported by our implementation since we can beamform up to 15.7k scanlines with 4 cores (10k scanlines/s × 2048 samples/scanline / 1299 samples/scanline). If 308 scanlines are acquired per frame, a frame rate of 50 Hz can be achieved, which is typically sufficient for imaging the carotid artery.

As the C6678 DSP has 8 cores, there are 4 spare cores that can be used for additional processing. One possibility is to use the remaining 4 cores for back-end processing. Therefore, one DSP can support both front-end beamforming and back-end processing. Typically, back-end processing, such as B-mode, color Doppler and spectral Doppler, only need one or two cores. Thus, more complicated algorithms might be supported on 4 cores, e.g., ultrasound elastography or speed of sound correction. Also, an 8-core DSP could be used for supporting 128 channels with the beamforming throughput of 10k scanlines/s or 64-channel full-rate dual receive beamforming with the throughput of 20k scanlines/s. With an increased number of cores in the future, the number of scanlines that can be supported per second could increase further, and/or more sophisticated front and back-end algorithms could be supported in real time.

In addition, this software-based approach could allow reusability and scalability in ultrasound systems. Based on the performance we obtained, we can estimate that PRBF for 64 channels with 10k scanlines/s and 1024 samples/scanline (40-MHz digitization with 4:1 decimation) can be supported on C6678 with 2 cores. Therefore, one 4-core DSP might be able to support both front-end beamforming and back-end processing, which would reduce the cost and power consumption.

These results were obtained by using the cycle approximate simulator since we did not have access to the hardware for this study. Besides, the overhead for streaming the demodulated IQ data via Hyperlink into the MSMC SRAM of C6678 has not been included.

In this paper, we demonstrated the feasibility of using one multi-core DSP to support both front-end and back-end processing. This software-based architecture could be advantageous in achieving shorter time-to-market (e.g., compared to designing the beamformer ASIC chip and making it work), flexibility (different algorithms can be supported) and scalability. This unified architecture could lead to low-power and low-cost ultrasound machines, benefiting ultrasound imaging in general, but particularly portable ultrasound machines.

### Table III. Example Clinical Scenarios

<table>
<thead>
<tr>
<th>Depth (cm)</th>
<th>Typical Frame Rate (frames/s)</th>
<th>Number of Samples/scanline (40-MHz ADC, 2:1 decimation)</th>
<th>Transducer Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>Neck</td>
<td>5</td>
<td>1,299</td>
<td>Linear</td>
</tr>
<tr>
<td>Kidney</td>
<td>12</td>
<td>3,117</td>
<td>Convex</td>
</tr>
<tr>
<td>Liver/Aorta/OB</td>
<td>20</td>
<td>5,195</td>
<td>Convex</td>
</tr>
</tbody>
</table>

REFERENCES

IMPORTANT NOTICE

Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections, modifications, enhancements, improvements, and other changes to its products and services at any time and to discontinue any product or service without notice. Customers should obtain the latest relevant information before placing orders and should verify that such information is current and complete. All products are sold subject to TI’s terms and conditions of sale supplied at the time of order acknowledgment.

TI warrants performance of its hardware products to the specifications applicable at the time of sale in accordance with TI's standard warranty. Testing and other quality control techniques are used to the extent TI deems necessary to support this warranty. Except where mandated by government requirements, testing of all parameters of each product is not necessarily performed.

TI assumes no liability for applications assistance or customer product design. Customers are responsible for their products and applications using TI components. To minimize the risks associated with customer products and applications, customers should provide adequate design and operating safeguards.

TI does not warrant or represent that any license, either express or implied, is granted under any TI patent right, copyright, mask work right, or other TI intellectual property right relating to any combination, machine, or process in which TI products or services are used. Information published by TI regarding third-party products or services does not constitute a license from TI to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property of TI.

Reproduction of TI information in TI data books or data sheets is permissible only if reproduction is without alteration and is accompanied by all associated warranties, conditions, limitations, and notices. Reproduction of this information with alteration is an unfair and deceptive business practice. TI is not responsible or liable for such altered documentation. Information of third parties may be subject to additional restrictions.

Resale of TI products or services with statements different from or beyond the parameters stated by TI for that product or service voids all express and any implied warranties for the associated TI product or service and is an unfair and deceptive business practice. TI is not responsible or liable for any such statements.

TI products are not authorized for use in safety-critical applications (such as life support) where a failure of the TI product would reasonably be expected to cause severe personal injury or death, unless officers of the parties have executed an agreement specifically governing such use. Buyers represent that they have all necessary expertise in the safety and regulatory ramifications of their applications, and acknowledge and agree that they are solely responsible for compliance with all legal and regulatory requirements in connection with such use. TI products are neither designed nor intended for use in military/aerospace applications or environments unless the TI products are specifically designated by TI as military-grade or "enhanced plastic." Only products designated by TI as military-grade meet military specifications. Buyers acknowledge and agree that any such use of TI products which TI has not designated as military-grade is solely at the Buyer's risk, and that they are solely responsible for compliance with all legal and regulatory requirements in connection with such use. TI products are neither designed nor intended for use in automotive applications or environments unless the specific TI products are designated by TI as compliant with ISO/TS 16949 requirements. Buyers acknowledge and agree that, if they use any non-designated products in automotive applications, TI will not be responsible for any damages arising out of the use of TI products in such safety-critical applications.

TI products are neither designed nor intended for use in military/aerospace applications or environments unless the TI products are specifically designated by TI as military-grade or "enhanced plastic." Only products designated by TI as military-grade meet military specifications. Buyers acknowledge and agree that any such use of TI products which TI has not designated as military-grade is solely at the Buyer's risk, and that they are solely responsible for compliance with all legal and regulatory requirements in connection with such use. TI products are neither designed nor intended for use in automotive applications or environments unless the specific TI products are designated by TI as compliant with ISO/TS 16949 requirements. Buyers acknowledge and agree that, if they use any non-designated products in automotive applications, TI will not be responsible for any damages arising out of the use of TI products in such safety-critical applications.

Following are URLs where you can obtain information on other Texas Instruments products and application solutions:

<table>
<thead>
<tr>
<th>Products</th>
<th><a href="http://www.ti.com/audio">www.ti.com/audio</a></th>
<th><a href="http://www.ti.com/communications">www.ti.com/communications</a></th>
</tr>
</thead>
<tbody>
<tr>
<td>DLP® Products</td>
<td><a href="http://www.dlp.com">www.dlp.com</a></td>
<td><a href="http://www.ti.com/energy">www.ti.com/energy</a></td>
</tr>
<tr>
<td>Power Mgmt</td>
<td><a href="http://www.power.ti.com">www.power.ti.com</a></td>
<td><a href="http://www.ti.com/automotive">www.ti.com/automotive</a></td>
</tr>
<tr>
<td>RFID</td>
<td><a href="http://www.ti-rfid.com">www.ti-rfid.com</a></td>
<td></td>
</tr>
<tr>
<td>OMAP Mobile Processors</td>
<td><a href="http://www.ti.com/omap">www.ti.com/omap</a></td>
<td></td>
</tr>
<tr>
<td>Wireless Connectivity</td>
<td><a href="http://www.ti.com/wirelessconnectivity">www.ti.com/wirelessconnectivity</a></td>
<td></td>
</tr>
</tbody>
</table>

TI E2E Community Home Page: e2e.ti.com

Mailing Address: Texas Instruments, Post Office Box 655303, Dallas, Texas 75265
Copyright © 2011, Texas Instruments Incorporated