Supercharged AI? Scientists make LLMs 'twice as fast' on CPUs

Another significant limitation of LLMs is their growing context memory, known as the key-value (KV) cache, which expands as AI interactions progress, demanding substantial high-speed memory storage. The Rice research team tackled this issue with their coupled quantization method, which compresses related memory pieces together rather than individually, optimizing storage without sacrificing model performance.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 07-02-2025 19:53 IST | Created: 07-02-2025 19:53 IST
Supercharged AI? Scientists make LLMs 'twice as fast' on CPUs
Representative Image. Credit: ChatGPT

The growing demand for artificial intelligence (AI) has fueled a race to develop ever-more powerful models, but this progress comes at a cost. Large language models (LLMs) require immense computational power, expensive hardware, and substantial energy resources, making them inaccessible to many and environmentally unsustainable.

Researchers at Rice University, led by Anshumali Shrivastava, are tackling these issues head-on. Their recent studies, presented at the prestigious Neural Information Processing Systems (NeurIPS) conference in Vancouver, showcase innovative solutions to enhance the efficiency and accessibility of AI systems. The three papers introduced at the conference focus on memory optimization, computational efficiency, and reducing AI’s reliance on expensive hardware, thereby paving the way for more sustainable and broadly accessible AI technologies.

Optimizing AI memory with Sketch Structured Transforms

One of the core challenges with LLMs is their reliance on massive weight matrices, which function as their working memory. The Rice team introduced Sketch Structured Transforms (SS1), a novel technique utilizing parameter sharing from probabilistic algorithms. This method significantly reduces the memory and computational requirements of AI models while maintaining their expressivity and accuracy. The researchers demonstrated that SS1 improved processing times by over 11% in popular LLMs without requiring extensive fine-tuning. This advancement represents a crucial step in making AI models more lightweight and efficient, mitigating the need for vast computational resources and making AI more adaptable for diverse applications.

Currently, LLMs and foundational AI models rely on high-performance GPUs (graphics processing units), making them expensive and limiting their deployment to major cloud data centers. Shrivastava’s team introduced the NoMAD Attention algorithm, designed to optimize AI model operations to run efficiently on standard computer processors (CPUs). By rethinking how calculations are performed, the NoMAD Attention algorithm leverages inherent CPU memory capabilities to streamline operations. This breakthrough effectively doubles AI processing speed on CPUs without compromising accuracy. The implications of this development are profound, as it enables the possibility of running advanced AI systems directly on personal devices such as laptops and smartphones, making AI tools more accessible and reducing dependence on energy-intensive data centers.

Another significant limitation of LLMs is their growing context memory, known as the key-value (KV) cache, which expands as AI interactions progress, demanding substantial high-speed memory storage. The Rice research team tackled this issue with their coupled quantization method, which compresses related memory pieces together rather than individually, optimizing storage without sacrificing model performance. Their findings reveal that AI models can operate effectively while using just one bit per piece of stored information - the smallest possible unit - without degrading output quality. This approach not only makes AI models more memory-efficient but also reduces the computational strain required for continuous AI interactions, making real-time AI applications more viable and sustainable.

The future of AI: Democratization through efficiency

The groundbreaking work from Shrivastava’s team at Rice University envisions a future where AI is no longer restricted to major corporations with vast computing resources. By significantly reducing energy and computational demands, these advancements lower the barriers for businesses and organizations to develop their own specialized AI tools. However, as Shrivastava emphasizes, achieving this vision requires more than just algorithmic innovations. The ongoing challenge of balancing AI’s rapid expansion with sustainability remains critical.

“If we want AI to play a central role in solving major global issues - whether in healthcare, climate science, or other fields - we need to make it vastly more efficient,” Shrivastava asserts. His team’s research underscores that the next frontier in AI development will prioritize efficiency, ensuring that AI remains both powerful and accessible to a broader audience beyond tech giants. Supported by the National Science Foundation, the Ken Kennedy Institute, Adobe, and VMware, their work marks a significant step toward a future where AI is smarter, faster, and more environmentally sustainable.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback