New hardware-accelerated AI model speeds up real-time breast cancer diagnosis
Unlike traditional CPU or GPU deployments of AI models, which require high computational resources and often suffer from latency issues, the proposed framework leverages the parallel processing capabilities of FPGAs for real-time diagnostics. The researchers developed a modular architecture consisting of core CNN components synthesized using Vivado High-Level Synthesis (HLS), allowing for high-speed, low-latency computation optimized for medical imaging workflows.
Timely and precise breast cancer diagnosis remains one of the most pressing priorities in modern healthcare. With millions of women undergoing routine screenings each year, the demand for diagnostic tools that are not only fast and reliable but also energy-efficient and scalable has never been greater. A team of researchers from the University of Monastir in Tunisia has proposed a hardware-accelerated artificial intelligence (AI) system designed to significantly boost speed, accuracy, and energy efficiency in ultrasound-based tumor detection.
The research focuses on the deployment of convolutional neural networks (CNNs) on field-programmable gate arrays (FPGAs), specifically the Zynq XC7Z020 SoC, to enable real-time, low-power inference in resource-constrained settings. By implementing core layers such as Conv2D, ReLU, and Average Pooling as custom Intellectual Property (IP) blocks and integrating them into a hybrid software-hardware architecture using the PYNQ-Z2 platform, the team achieved a 16.3% improvement in execution speed and a 63.15% reduction in power consumption compared to CPU-only systems, all while maintaining competitive classification accuracy.
Their study, titled FPGA Hardware Acceleration of AI Models for Real-Time Breast Cancer Classification, was published in AI.
What makes this FPGA-based AI system uniquely suited for breast cancer diagnostics?
Unlike traditional CPU or GPU deployments of AI models, which require high computational resources and often suffer from latency issues, the proposed framework leverages the parallel processing capabilities of FPGAs for real-time diagnostics. The researchers developed a modular architecture consisting of core CNN components synthesized using Vivado High-Level Synthesis (HLS), allowing for high-speed, low-latency computation optimized for medical imaging workflows.
The design begins with ultrasound image classification using a CNN trained to distinguish between malignant, benign, and normal breast tissue. The dataset, sourced from Fattouma Bourguiba University Hospital in Tunisia, includes 1,380 well-balanced ultrasound images. Preprocessing techniques such as normalization, rotation, flipping, and elastic deformation were applied to enhance data quality and model generalization. The CNN training process employed a standard 80:20 train-test split and achieved an accuracy of 94.11% on the ARM Cortex-A9 processor.
To enable hardware acceleration, the team mapped the most computationally intensive layers to FPGA logic. The Conv2D IP core processed input feature maps of 256×256 with 1 channel to produce 32 output channels, using loop unrolling and pipelining to increase throughput. The Average Pooling and ReLU layers were also customized and optimized with unrolling factors of 16 and 8, respectively, to minimize latency and power consumption.
The co-design architecture allows the ARM processor to handle high-level tasks like memory loading and control flow, while the FPGA executes CNN layer computations using hardware overlays. This hybrid approach maximizes efficiency and supports real-time inference suitable for embedded healthcare devices.
How does the performance compare to traditional machine learning models and platforms?
In head-to-head tests, the proposed CNN outperformed conventional machine learning algorithms such as Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGB) on the same dataset. While these methods delivered accuracies ranging from 89.2% to 94.1%, the FPGA-accelerated CNN achieved comparable or superior accuracy with significantly better energy efficiency and latency. For example, the proposed model ran in just 0.821 seconds on the FPGA, compared to 6.44 seconds for a previously published lightweight CNN on the same platform.
Power consumption is another area where the system excels. The FPGA implementation consumed only 1.4 W compared to 3.8 W on the ARM CPU, which is critical for mobile and edge deployments in medical diagnostics. The study emphasizes that this energy-saving gain does not come at the cost of reliability; the system maintained an accuracy of 89.87% on the hardware-accelerated platform, with a minor drop from the software version, attributed to the 8-bit fixed-point arithmetic used for hardware efficiency.
Performance benchmarks show that the proposed Conv2D IP core alone could handle nearly 35 million cycles at a frequency of 120.17 MHz, demonstrating how FPGA acceleration can match, and in some cases exceed, the capability of more power-hungry platforms. Meanwhile, the ReLU and pooling cores demonstrated ultra-low latencies, making them highly suitable for real-time embedded AI.
The framework was tested against existing implementations such as CNNs on ZCU104 and RNNs on generic FPGAs, all of which consumed significantly more power and exhibited slower response times. The new implementation offers a balanced trade-off, ensuring high throughput, low energy usage, and respectable accuracy.
Can FPGA-accelerated AI redefine real-time medical imaging for broader healthcare access?
The implications of this research extend far beyond performance metrics. By demonstrating that FPGA-accelerated CNNs can operate efficiently on compact and energy-constrained platforms like the PYNQ-Z2, the study lays the groundwork for broader deployment of AI diagnostics in rural clinics, mobile units, and other low-resource healthcare environments.
In traditional healthcare setups, real-time AI-based imaging often depends on cloud connectivity or expensive GPU arrays. This limits accessibility and introduces latency that can compromise time-sensitive diagnostics. With FPGAs, however, inference can be performed locally on edge devices without compromising on speed or diagnostic reliability. This decentralization of AI brings advanced diagnostics closer to underserved populations, helping close the health equity gap.
Furthermore, the modular IP design approach ensures that the system can be scaled or modified for other types of medical imaging, such as lung X-rays or liver scans, without requiring a complete redesign. By adjusting layer configurations and retraining on new datasets, the same platform can serve multiple diagnostic functions.
Future work could expand the FPGA implementation across the full CNN model rather than just the initial layers, potentially improving accuracy while preserving the performance gains. Researchers aim to explore higher bit-width configurations, such as 16-bit and 32-bit fixed-point formats, to balance numerical precision with computational efficiency.
The work validates a critical paradigm shift in medical AI deployment: moving from cloud-heavy, GPU-dependent systems toward lightweight, hardware-efficient designs that are faster, more secure, and more sustainable. This is particularly vital in a world increasingly focused on energy efficiency and decentralized care.
- FIRST PUBLISHED IN:
- Devdiscourse

