The Booming Market of Video Analytics: Opportunities and Challenges for VMS and NVR Providers
The use of video analytics is experiencing a rapid surge, with the market projected to grow from $9 billion currently to a staggering $52.7 billion...
Computer vision holds the promise of transforming industries like retail, security, and manufacturing by equipping machines with the ability to interpret visual data and act on it. For seasoned engineers, crafting an initial model can be a matter of weeks, not months. Yet, despite its vast potential, computer vision hasn't become as ubiquitous as one might expect. Why is that? The answer lies in the complexities of real-world deployment—where the devil is in the details, from data quality to the integration of these systems into existing workflows.
In this article, we’ll explore the developer journey from creating a model in the lab, the challenges of deploying it at scale to the real world and how Deep Perception can make your life a whole lot easier!
When it comes to deploying computer vision models, choosing both the right model and the right AI accelerator is crucial. The decision largely depends on the specific use case you are trying to solve plus hardware constraints which are driven by costs and power budget. Developers also need to consider whether the deployment will be in the cloud or at the edge which have various tradeoffs.
Deciding on whether or not your solution will ultimately run in the cloud or at the edge is probably the first decision you should make when developing a computer vision application since it has the greatest impact on the overall solution architecture.
Cloud based solutions work great when you do not have bandwidth or latency concerns between the video source and the cloud. The cloud also lends itself to bursty workloads where you do not need to process video streams 24x7. Edge based deployments on the other hand run close to the video source on dedicated hardware which eliminates sending all of the video frames to the cloud and support near real-time applications that are not possible with cloud solutions.
One concerning pattern we have seen is that computer vision engineers default to cloud based solutions simply because the tooling and accelerator architecture is something they are familiar with. If for some reason the cloud version is not viable for production, they look to port it over to the NVIDIA Jetson product line since the tooling is similar.
No need to dance around it, NVIDIA dominates the computer vision space because they have done a great job developing tooling around their accelerators. Unfortunately NVIDIA also does a great job locking you in with their premium priced NVIDIA Jetson line. At Deep Perception we aim to provide developers access to a wide range of significantly more cost effective edge AI platforms while providing an industry leading development and deployment experience.
For most computer vision engineers, this is the fun part of developing the software application. For this article, we’re going to assume that you already have a dataset that will allow you to validate the performance (accuracy) of models plus do re-training if necessary. Building and managing a dataset is not a trivial task but there are some great tools out there such as Voxel51 that help.
Being able to rapidly test existing models can significantly accelerate development efforts. Oftentimes there’s no need to train a model from scratch. Instead, you can build upon pre-existing models with lightweight retraining working in many cases. We also know the throughput of these base models on the variety of accelerators we support which helps with the model selection process.
Due to model quantization plus performance characteristics of various AI accelerators, developers need to validate their model not only in the lab using a GPU but also test their models on the candidate AI accelerator targets. This is where things become difficult again for most computer vision engineers and an area where Deep Perception can help.
The Deep Perception Platform includes a variety of pre-trained models for various AI accelerators so you can quickly cycle through various base models to find viable options for your use case. We also provide (re)training and compilation toolchains for the accelerators our platform supports which means our users don’t need to become an expert on the toolchain to try out a specific accelerator. Developers can do their initial model development in the toolchain they are already familiar with and use the Deep Perception platform to easily port it over to any of the edge AI accelerators we support.
Once you have a better idea of the models you will be potentially using, it’s now time to narrow down the list of accelerators you plan on using since this will have other downstream implications. At this stage you are trying to form a good understanding of the inference latency/frames per second your target model(s) will achieve on the hardware accelerators you are considering. This is done by comparing values of similar models plus ideally running your own model on actual hardware to identify not only AI accelerator bottlenecks but also system level bottlenecks.
One option with edge AI architectures that differs from GPUs used at the edge is the ability to scale up the number of accelerators in a single system. Edge AI accelerators only consume a few watts each and the relative cost per accelerator is low so it’s very common to have multi-accelerator systems. 2-4 accelerators in a system is fairly common but we’ve seen machines with large numbers Hailo-8 accelerators in them.
The best way to pick the right number of accelerators per system is through experimentation on actual hardware. Deep Perception makes it easy to work with multiple accelerators in one system with the ability to load-balance a workload across accelerators and intelligently route frames to models running on different accelerators to support cascaded networks.
One of the key aspects of an AI system design is the image acquisition pipeline which has to reliably maintain connections to video sources plus handling the necessary video decoding tasks. We’ll cover this and other critical components necessary to support real time and low latency computer vision applications.
Real time computer vision applications deliver events “instantly” whereas near real time applications typically deliver information in seconds. In both cases you need a resilient media processing pipeline that handles video acquisition including decode, AI model execution (inference) and the generation of meaningful insights from the model output.
We’ll dive into all of these in a bit more detail below. One thing to point out is AI model development and testing is typically done with Python. Media pipelines, due to their real time nature, are developed in lower level languages with C++ being the most common framework. Oftentimes a highly skilled model developer or data scientist is proficient in Python but it’s rare to find one who can also build production grade real time media pipelines in C++.
A computer vision model is only useful if you can reliably feed it with video frames. In addition to taking into account the throughput of the computer vision model, developers also need to make sure you can acquire and decode video sources at the framerate for your use case. You also need to handle muxing multiple video streams into a single accelerator.
Video may come from a variety of sources including media files, ONVIF IP cameras, RTSP video sources and directly attached cameras through various interfaces such as USB or CSI. The media pipeline needs to not only handle making the initial connection but also monitoring the connection so that auto-remediation can be performed if necessary.
Files and video sources transported across the network are typically encoded in h264 or h265 to reduce the amount of network traffic and storage used. Software decoders may be viable options for a small number of streams but ideally you can offload this task to dedicated hardware on the underlying platform. The good news is a majority of hardware platforms have a HW decoding option and the Deep Perception platform can take advantage of them when present. Both the Intel iGPU and Rockchip platforms can easily support 16-32 1080p streams and some edge AI accelerators include H264 decoders as well.
Once you have the video decoded into raw frames, now you’re ready to feed them into your model. In most cases an individual accelerator will be able to process more video frames than a single source provides. In order to maximize the utilization of an AI accelerator, multiple video streams are combined into a single stream through the process of muxing. The interleaved frames are tagged with the video source so other pipeline elements know what they are operating on and so that the combined stream can be demuxed back into individual streams if needed.
In order to support low latency applications, the inference engine should be integrated directly with this media pipeline. Some AI accelerator vendors provide plugins for GStreamer but in many cases only a C++ API is provided and building a Gstreamer plug-in is left to the user. Gstreamer development can be challenging which is why Deep Perception provides plug-ins for all of the accelerators we support as part of our platform.
Leveraging our experience with these accelerators, we set default optimum parameters such as batch sizes and optimal color space plus will use image manipulation capabilities provided by the AI accelerator if available. Our platform also supports load balancing across accelerators running the same model so that your application is not limited by the throughput of a single accelerator.
Single model use cases are fairly straightforward to implement but many computer vision use cases require the use of multiple models. License plate reader applications typically use an object detection model to find vehicles, then use another model to find and extract the license plate and then a third model that actually analyzes the license plate.
For these more complex use cases you need to intelligently route only the relevant frames to the correct model which requires coordination between the tracking algorithms and the media pipeline. Our platform provides the building blocks necessary to create complex applications so you don’t have to build them yourself.
This section covers everything downstream of the computer vision model which combines to provide actionable insights to users of the application. For low latency and real time applications it’s critical to have these components as part of the media pipeline which is what is found in the Deep Perception platform.
Tracking algorithms provide context for detected objects across individual frames which the computer vision model acts upon. Relatively simple algorithms will simply maintain the same object ID for a detected object while more complex algorithms can predict movement, re-identify objects that went off frame or even track objects across multiple camera streams such as following a person as they move through a retail store. Similar to models, off the shelf object trackers are usually a good starting point for most applications and achieve excellent results with some minor tuning. Our approach is to provide some out of the box trackers while also enabling developers to bring their own tracking code if desired.
The Deep Perception platform also provides an events emitter plug-in that can execute higher level business logic and emit events directly from the media pipeline. An event could be as simple as an object detected event or could be something more complex that could for example keep track of the number of people in a room and send an alert if capacity limits are exceeded. Similar to object tracking, there are a variety of templates we provide to cover common use cases but our framework allows developers to bring their own code if needed.
The emitted events are placed on a message queue where various integration adapters outside of the media pipeline can handle the processing from there. Deep Perception has built integrations with VMS systems and traditional analytics systems and in some cases can provide these as supported components if needed.
Deploying computer vision applications at the edge saves significant costs over cloud deployments and unlocks the ability to build low latency use cases. Recently a variety of new edge AI accelerators has come to market offering an extremely attractive alternative for inference workload over GPUs from both a cost and efficiency perspective.
Deep Perception makes it easy to build edge AI applications that take advantage of these innovative accelerators providing a much needed alternative to expensive and power hungry GPUs. Contact Deep Perception today for a free evaluation version of our platform and to discuss what edge AI platform is best for your use case.
The use of video analytics is experiencing a rapid surge, with the market projected to grow from $9 billion currently to a staggering $52.7 billion...