NEWS

Apple Collaborates with NVIDIA to Research Faster LLM Performance

12/19/2024

801

Apple Collaborates with NVIDIA to Research Faster LLM Performance

 

In a blog post today, Apple engineers have shared new details on a collaboration with NVIDIA to implement faster text generation performance with large language models.

 

Apple published and open sourced its Recurrent Drafter (ReDrafter) technique earlier this year. It represents a new method for generating text with LLMs that is significantly faster and “achieves state of the art performance.” It combines two techniques: beam search (to explore multiple possibilities) and dynamic tree attention (to efficiently handle choices).

 

While its research demonstrated strong results, Apple collaborated with NVIDIA to apply ReDrafter in production. As part of this collaboration, ReDrafter was integrated into NVIDIA TensorRT-LLM, a tool that helps run LLMs faster on NVIDIA GPUs.

 

Here are the results:

 

  • To enable the integration of ReDrafter, NVIDIA added new operators or exposed existing ones, which considerably improved TensorRT-LLM’s capability to accommodate sophisticated models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.
  • In benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen 2.7x speed-up in generated tokens per second for greedy decoding. These benchmark results indicate this tech could significantly reduce latency users may experience, while also using fewer GPUs and consuming less power.

 

“LLMs are increasingly being used to power production applications, and improving inference efficiency can both impact computational costs and reduce latency for users,” Apple’s machine learning researchers conclude. “With ReDrafter’s novel approach to speculative decoding integrated into the NVIDIA TensorRT-LLM framework, developers can now benefit from faster token generation on NVIDIA GPUs for their production LLM applications.”

 

Source: 9to5mac

Windows
Mac OS
iOS
Linux
3uTools
Win 64-bit For this device
V9.0 2025-11-11
Download
Win 32-bit For this device
V9.0 2025-11-11
Download
3uTools
Intel Chip How to Identify Chip Type
V9.0 2025-12-02
Download
Apple Silicon
V9.0 2025-12-02
Download
How to Identify Chip Type
1.  Click the Apple icon in the top-left corner of the screen and select About This Mac.
2.  Check the Processor or Chip field to determine if it is "Intel" or "Apple".
Please use the 3uTools PC client to install the iOS client:
1、 Install either the Windows or Mac version of 3uTools on your computer
2、 Open the PC client and connect your device to the computer via USB cable
3、 After the connection is successful, wait for the computer to automatically install the mobile app for the device, or locate “Install Mobile App” on the computer and manually click to install.
3uTools
deb file
V3.01 2025-11-20
Download
rpm file
V3.01 2025-11-20
Download
Windows
Windows
iOS
iOS
Android
Android
TV
TV
3uAirPlayer
Win 64-bit For this device
V6.0.2 2025-11-19
Download
Win 32-bit For this device
V6.0.2 2025-11-19
Download
iOS Device Mirroring (No App Required)
1、 Install 3uAirPlayer on the Windows PC
2、 Open Control Center and select Screen Mirroring
3、 From the list, choose your PC to start mirroring
4、 Or connect your iOS device to the PC via USB to begin mirroring
Scan to get "3uAirPlayer" App
3uAirPlayer TV V1.0.18
2025-11-28
TV System Requirements: Android 7.0 or later
Download the TV installation package, copy it to a USB drive, insert it into your TV or set-top box, then select the file from the home screen to install.