* Please refer to the English Version as our Official Version.
With the introduction of artificial intelligence (AI), especially generative AI, the automotive industry is undergoing a transformative transformation. A recent survey conducted by McKinsey on executives in the automotive and manufacturing industries showed that over 40% of respondents have invested up to 5 million euros in generative AI research and development, and over 10% of respondents have invested over 20 million euros.
As the industry continues to evolve towards Software Defined Vehicles (SDVs), the number of lines of code in cars is expected to increase from 100 million lines per vehicle to approximately 300 million lines by 2030. The combination of generative AI and SDV for automobiles can jointly achieve in car use cases in terms of performance and comfort, helping to enhance the driving experience.
This article will introduce a car mounted generative AI use case developed in collaboration between Arm and Amazon Web Services (AWS) and its implementation details.
Use case introduction
As cars become increasingly sophisticated, car owners are now able to continuously receive updates on features such as parking assistance or lane keeping after delivery. The challenge that comes with this is how to keep car owners informed of the new updates and features in a timely manner? The traditional methods of updating cars, such as paper or online manuals, have proven to be inadequate, resulting in car owners being unable to fully understand the potential of their vehicles.
In order to meet this challenge, AWS has developed a demonstration of on-board generative AI by combining the powerful functions of generative AI, edge computing and the Internet of Things (IoT). The solution presented in this demonstration is an in car application supported by the Small Language Model (SLM), aimed at enabling drivers to obtain the latest vehicle information through natural voice interaction. The demo application can run offline after deployment to ensure that drivers can access important information about vehicles without Internet connection.
This solution integrates multiple advanced technologies to create a more seamless and efficient product experience for users. The application of this demonstration is deployed on a local small language model in the car, which utilizes routines optimized by Arm KleidiAI to improve performance. The response time of the system without KleidiAI optimization is around 8 to 19 seconds, while the inference response time of the small language model optimized by KleidiAI is 1 to 3 seconds. By using KleidiAI, the application development time has been reduced by 6 weeks, and developers do not need to focus on optimizing the underlying software during the development process.
Arm Virtual Hardware supports access to many popular IoT development suites on AWS. When physical devices are unavailable or teams around the world cannot access them, developing and testing on Arm virtual hardware can save development time for embedded applications. AWS successfully tested the demo application on the automotive virtual platform, where Arm virtual hardware provided virtual instances of Raspberry Pi devices. The same KleidiAI optimization can also be applied to Arm virtual hardware.
One of the key features of this generative AI application running on edge devices is its ability to receive OTA wireless updates, some of which are received using AWS IoT Greengrass Lite, ensuring that the latest information is always provided to the driver. AWS IoT Greengrass Lite only occupies 5 MB of RAM on edge devices, therefore it has high memory efficiency. In addition, the solution includes automatic quality monitoring and feedback loops for continuously evaluating the relevance and accuracy of small language model responses. Among them, a comparative system is used to mark responses that exceed the expected quality threshold for review. Then, through the dashboard on AWS, the collected feedback data is visualized in near real-time, allowing the quality assurance team of the vehicle manufacturer to review and identify areas that need improvement, and initiate updates as needed.
The advantage of this solution supported by generative AI lies not only in providing accurate information to drivers. It also reflects a paradigm shift in SDV lifecycle management, achieving a more continuous improvement cycle. Vehicle manufacturers can add new content based on user interaction, while small language models can be fine tuned using updated information seamlessly deployed through wireless networks. In this way, by ensuring the latest vehicle information, the user experience can be improved, and in addition, the vehicle manufacturer also has the opportunity to introduce and guide users on new features or additional functions that can be purchased. By taking advantage of the powerful functions of generative AI, the Internet of Things and edge computing, this generative AI application can serve as a car user guide, and the methods shown in it help to achieve a more connected, information-based and adaptive driving experience in the SDV era.
End to end upper layer implementation scheme
The solution architecture shown in the following figure is used for fine-tuning the model, testing the model on Arm virtual hardware, and deploying the small language model to edge devices, and includes a feedback collection mechanism.
The numbers in the above figure correspond to the following:
1. Model tuning: The AWS demo application development team chose TinyLlama-1.1B-Chat-v1.0 as their base model, which has been pre trained for session tasks. In order to optimize the car user guide chat interface for drivers, the team designed concise and focused replies to adapt to situations where drivers can only free up limited attention while driving. The team created a custom dataset containing 1000 sets of Q&A and fine tuned it using Amazon SageMaker Studio.
2. Storage: The optimized small language model is stored in Amazon Simple Storage Service (Amazon S3).
3. Initial deployment: The small language model is initially deployed to an Amazon EC2 instance based on Ubuntu.
4. Development and optimization: The team developed and tested generative AI applications on EC2 instances, quantified small language models using llama.cpp, and applied the Q4-0 solution. KleidiAI optimization pre integrates llama.cpp. At the same time, the model also achieved significant compression, reducing the file size from 3.8 GB to 607 MB.
5. Virtual testing: Transfer the application and small language model to the virtual Raspberry Pi environment of Arm virtual hardware for initial testing.
6. Virtual verification: Conduct comprehensive testing on virtual Raspberry Pi devices to ensure proper functionality.
7. Edge side deployment: Deploy generative AI applications and small language models to physical Raspberry Pi devices using AWS IoT Greengrass Lite, and utilize AWS IoT Core jobs for deployment management.
8. Deployment orchestration: AWS IoT Core is responsible for managing the tasks of deploying to edge side Raspberry Pi devices.
9. Installation process: AWS IoT Greengrass Lite processes software packages downloaded from Amazon S3 and automatically completes the installation.
10. User Interface: The deployed applications provide voice based interaction functionality for end-users on edge side Raspberry Pi devices.
11. Quality monitoring: Generative AI applications implement quality monitoring of user interactions. The data is collected through AWS IoT Core, processed through Amazon Kinesis Data Stream and Amazon Data Firehose, and then stored in Amazon S3. Vehicle manufacturers can monitor and analyze data through the Amazon QuickSight dashboard, promptly identifying and resolving any minor language model quality issues.
Next, we will delve into KleidiAI and the quantification scheme used in this demonstration.
Arm KleidiAI
Arm KleidiAI is an open-source library designed specifically for AI framework developers. It provides optimized performance critical routines for Arm CPUs. This open-source library was originally launched in May 2024 and now provides optimization for matrix multiplication of various data types, including ultra-low precision formats such as 32-bit floating-point, Bfloat16, and 4-bit fixed-point. These optimizations support multiple Arm CPU technologies, such as SDOT and i8mm for 8-bit computing, as well as MLA for 32-bit floating-point operations.
With four Arm Cortex-A76 cores, the Raspberry Pi 5 demonstrated the use of KleidiAI's SDOT optimization, which was one of the earliest instructions designed for AI workloads based on Arm CPUs and was introduced in Armv8.2-A released in 2016.
The SDOT instruction also demonstrates Arm's ongoing commitment to improving AI performance on CPUs. Following SDOT, Arm has gradually introduced new instructions for running AI on CPUs, such as i8mm and Bfloat16 support for more efficient 8-bit matrix multiplication, in order to improve 32-bit floating-point performance while halving memory usage.
For the demonstration using Raspberry Pi 5, KleidiAI plays a key role in accelerating matrix multiplication by using a block wise quantization scheme and integer 4-bit quantization (also known as Q4_0 in llama. cpp).
Q4_0 quantization format in llama.cpp
The Q4-0 matrix multiplication in llama.cpp consists of the following components:
The left side (LHS) matrix stores the activation content in the form of 32-bit floating-point values.
The right-hand matrix (RHS) contains weights in a 4-digit fixed-point format. In this format, quantization scale is applied to a data block consisting of 32 consecutive integer 4-bit values and encoded using 16 bit floating-point values.
Therefore, when it comes to 4-bit integer matrix multiplication, it specifically refers to the format used for weights, as shown in the following figure:
At this stage, neither LHS nor RHS matrices are in 8-bit format. How can KleidiAI utilize the SDOT instruction designed specifically for 8-bit integer dot products? Both input matrices must be converted to 8-bit integer values.
For LHS matrices, an additional step is required before the matrix multiplication routine: dynamic quantization into 8-bit fixed-point format. This process uses a block wise quantization scheme to dynamically quantize the LHS matrix into 8 bits, where the quantization scale is applied to a data block consisting of 32 consecutive 8-bit integer values and stored in the form of 16 bit floating-point values, similar to the 4-bit quantization method.
Dynamic quantization can minimize the risk of accuracy degradation as the quantization scale factor is calculated based on the minimum and maximum values in each data block during inference. In contrast to this method, the scale factor for static quantization is predetermined and remains unchanged.
For RHS matrices, no additional steps are required before the matrix multiplication routine. In fact, 4-bit quantization serves as a compression format, while actual calculations are performed using 8-bit quantization. Therefore, before passing a 4-bit value to the dot product instruction, it is first converted to 8-bit. The computational cost of converting from 4 bits to 8 bits is not high, as only simple shift/mask operations are required.
Since the conversion efficiency is so high, why not use 8-bit directly to save the trouble of conversion?
There are two key advantages to using 4-digit quantization:
Reducing model size: As the memory required for 4-bit values is only half that of 8-bit values, this is particularly beneficial for platforms with limited available RAM.
Improving text generation performance: The text generation process relies on a series of matrix vector operations, which are typically limited by memory. That is to say, performance is limited by the data transfer speed between memory and processor, rather than the computing power of the processor. Due to memory bandwidth being a limiting factor, reducing data size can minimize memory traffic and significantly improve performance.
How to combine KleidiAI with llama.cpp?
Very simple, KleidiAI has been integrated into llama.cpp. Therefore, developers do not need additional dependencies to fully utilize the performance of Arm CPUs in Armv8.2 and newer architecture versions.
The integration of the two means that developers running llama.cpp on mobile devices, embedded computing platforms, and servers based on Arm architecture processors can now experience better performance.
Are there any other options besides llama.cpp?
For running large language models on Arm CPUs, although llama.cpp is a good choice, developers can also use other high-performance generative AI frameworks optimized with KleidiAI. For example (in alphabetical order): ExecutuTorch, MediaPipe, MNN, and PyTorch. Just choose the latest version of the framework.
Therefore, if developers are considering deploying generative AI models on Arm CPUs, exploring the above framework can help optimize performance and efficiency.
summarize
The integration of SDV and generative AI is jointly creating a new era of automotive innovation, making future cars more intelligent and user centric. The in car generative AI application demonstration introduced in the article is optimized by Arm KleidiAI and supported by services provided by AWS, demonstrating how emerging technologies can help solve practical challenges in the automotive industry. This solution can achieve a response time of 1 to 3 seconds and shorten development time by several weeks, proving that more efficient and offline available generative AI applications can not only be implemented, but are also very suitable for in vehicle deployment.
The future of automotive technology lies in creating solutions that seamlessly integrate edge computing, IoT functions and AI. As cars continue to evolve and software becomes increasingly complex, potential solutions (such as the one introduced in this article) will become the key to bridging the gap between advanced car features and user understanding.