AI inference at the edge refers to running trained machine learning (ML) models closer to end users when compared to traditional cloud AI inference. Edge inference accelerates the response time of ML models, enabling real-time AI applications in industries such as gaming, healthcare, and retail.
What is AI inference at the edge?
Before we look at AI inference specifically at the edge, it’s worth understanding what AI inference is in general. In the AI/ML development lifecycle, inference is where a trained ML model performs tasks on new, previously unseen data, such as making predictions or generating content. AI inference happens when end users interact directly with an ML model embedded in an application. For example, when a user inputs a prompt to ChatGPT and gets a response back, the time when ChatGPT is “thinking” is when inference is occurring, and the output is the result of that inference.
AI inference at the edge is a subset of AI inference whereby an ML model runs on a server close to end users; for example, in the same region or even the same city. This proximity reduces latency to milliseconds for faster model response, which is beneficial for real-time applications like image recognition, fraud detection, or gaming map generation.
Head of AI Product at Gcore.
How AI inference at the edge relates to edge AI
AI inference at the edge is a subset of edge AI. Edge AI involves processing data and running ML models closer to the data source rather than in the cloud. Edge AI includes everything related to edge AI computing, from edge servers (the metro edge) to IoT devices and telecom base stations (the far edge). Edge AI also includes training at the edge, not just inference. In this article, we’ll focus on AI inference on edge servers.
How inference at the edge compares to cloud inference
With cloud AI inference, you run an ML model on the remote cloud server, and the user data is sent and processed in the cloud. In this case, an end user may interact with the model from a different region, country, or even continent. As a result, cloud inference latency ranges from hundreds of milliseconds to seconds. This type of AI inference is suitable for applications that don’t require local data processing or low latency, such as ChatGPT, DALL-E, and other popular GenAI tools. Edge inference differs in two related ways:
- Inference happens closer to the end user
- Latency is lower
How AI inference at the edge works
AI inference at the edge relies on an IT infrastructure with two main architectural components: a low-latency network and servers powered by AI chips. If you need scalable AI inference that can handle load spikes, you also need a container orchestration service, such as Kubernetes; this runs on edge servers and enables your ML models to scale up and down quickly and automatically. Today, only a few providers have the infrastructure to offer global AI inference at the edge that meets these requirements.
Low-latency network: A provider offering AI inference at the edge should have a distributed network of edge points of presence (PoPs) where servers are located. The more edge PoPs, the quicker the network roundtrip time, which means ML model responses occur faster for end users. A provider should have dozens—or even hundreds—of PoPs worldwide and should offer smart routing, which routes a user request to the closest edge server to use the globally distributed network efficiently and effectively.
Servers with AI accelerators: To reduce computation time, you need to run your ML model on a server or VM powered by an AI accelerator, such as NVIDIA GPU. There are GPUs designed specifically for AI inference. For example, one of the latest models, the NVIDIA L40S, has up to 5x faster inference performance than the A100 and H100 GPUs, which are primarily designed for training large ML models but are also used for inference. The NVIDIA L40S GPU is currently the best AI accelerator for performing AI inference.
Container orchestration: Deploying ML models in containers makes models scalable and portable. A provider can manage an underlying container orchestration tool on your behalf. In that setup, an ML engineer looking to integrate a model into an application would simply upload a container image with an ML model and get a ready-to-use ML model endpoint. When a load spike occurs, containers with your ML model will automatically scale up, and then scale back down when the load subsides.
Key benefits of AI inference at the edge
AI inference at the edge offers three key benefits across industries or use cases: low latency, security and sovereignty, and cost efficiency.
Low latency
The lower the network latency, the faster your model will respond. If a provider’s average network latency is under 50 ms, it’s appropriate for most apps requiring a near-instant response. By comparison, cloud latency can be as high as a few hundred milliseconds, depending on your location relative to the cloud server. That’s a noticeable difference for an end user, with cloud latency potentially leading to frustration as end users are left waiting for their AI responses.
Keep in mind that a low-latency network only accounts for the travel time of the data. A 50 ms network latency doesn’t mean users will get an AI output in 50 ms; you need to add the time that the ML model takes to perform inference. That ML model processing time is contingent on the model being used and may account for the majority of the processing time for end users. That’s all the more reason to ensure you’re using a low-latency network, so your users get the best possible response time while ML model developers continue to improve model inference speed.
Security and sovereignty
Keeping data at the edge—meaning local to the user—simplifies compliance with local laws and regulations, such as GDPR and its equivalents in other countries. An edge inference provider should set up its inference infrastructure to adhere to local laws to ensure that you and your users are protected appropriately.
Edge inference also increases the confidentiality and privacy of your end users’ data because it’s processed locally rather than being sent to remote cloud servers. This reduces the attack surface and minimizes the risk of data exposure during transmission.
Cost efficiency
Typically, a provider charges only for the computational resources utilized by the ML model. This, along with carefully configured autoscaling and model execution schedules, can significantly reduce inference costs. Who should use AI inference at the edge?
Here are some common scenarios where inference at the edge would be the optimal choice:
- Low latency is critical for your application and users. A wide range of real-time applications, from facial recognition to trade analysis, require low latency. Edge inference provides the lowest latency inference option.
- Your user base is spread across multiple geographical locations. In this case, you need to provide the same user experience—meaning the same low latency—to all of your users regardless of their location. This requires a globally distributed edge network.
- You don’t want to deal with infrastructure maintenance. If supporting cloud and AI infrastructure isn’t part of your core business, it may be worth delegating these processes to an experienced, expert partner. You can then focus your resources on developing your application.
- You want to keep your data local, for example, within the country where it’s generated. In this case, you need to perform AI inference as close to your end users as possible. A globally distributed edge network can meet this need, whereas the cloud is unlikely to offer the extent of distribution you require.
Which industries benefit from AI inference at the edge?
AI inference at the edge benefits any industry where AI/ML is used, but especially those developing real-time applications. In the technology sector, this would include generative AI applications, chatbots and virtual assistants, data augmentation and AI tools for software engineers. In gaming, it would be AI content and map generation, real-time player analytics and real-time AI bot customisation and conversation. For the retail market, typical applications would be smart grocery with self-checkout and merchandising, virtual try-on, and content generation, predictions, and recommendations.
In manufacturing the benefits are to real-time defect detection in production pipelines, VR/VX applications and rapid response feedback while in the media and entertainment industry it would be content analysis, real-time translation and automated transcription. Another sector that develops real time applications is automotive, and particularly rapid response for autonomous vehicles, vehicle personalization, advanced driver assistance and real-time traffic updates.
Conclusion
For organizations looking to deploy real-time applications, AI inference at the edge is an essential component of their infrastructure. It significantly reduces latency, ensuring ultra-fast response times. For end users, this means a seamless, more engaging experience, whether playing online games, using chatbots, or shopping online with a virtual try-on service. Enhanced data security means businesses can offer superior AI services while protecting user data. AI inference at the edge is a critical enabler to AI/ML production deployment at scale, driving AI/ML innovation and efficiency across numerous industries.
We list the best bare metal hosting.
This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro