Object Detection in Computer Vision

Author:
Innokrea Team

Date of publication: 2025-07-24

Caterogies: AI Innovation

After a series of previous articles, we now understand how neural networks work and what classification means in the context of computer vision. For those still unfamiliar with this topic, we encourage you to refer to our literature:

https://www.innokrea.com/machine-learning-part-1-is-it-worth-it/

https://www.innokrea.com/neural-networks-introduction/

Today, we will take it a step further and focus on object detection – what it is, how it differs from “regular” image classification, its uses, and we will discuss the application of one of the detection algorithms (YOLO) in practice.

What is Object Detection?

Object detection is a process that involves both classifying and locating objects within an image (or a series of images, such as video). The goal of detection, in addition to recognizing what is in the image (e.g., a dog, a car, a license plate), is to specify exactly where these objects are by defining a surrounding bounding box. In practice, this means that an object detection model returns the coordinates of rectangular bounding boxes and assigns the appropriate label to the object contained within it. With this approach, multiple object classes can be identified in a single image, and they can also repeat or even contain each other within one sample. It is also possible that no objects are detected on the image at all.

For example: in an image of a street with cars, pedestrians, and cyclists, a model trained to recognize vehicles and faces will be able to identify where the cars, bicycles, and road participants are – sometimes even if the driver’s face is completely enclosed within the car’s bounding box.

Object Detection in Practice: YOLO

One of the most popular object detection algorithms is YOLO (You Only Look Once). YOLO is a model that identifies objects and their locations in a single pass over the image by attempting to define and check many bounding boxes, evaluating the likelihood of an object being inside each box (according to a chosen confidence threshold), and then discarding some of these boxes to avoid duplicates (for one object, we aim to define exactly one bounding box that best describes it).

YOLO first divides the image into a grid with a fixed number of cells, and then for each cell, it searches for a bounding box whose center is inside the cell and has the highest probability of representing one of the classes. The dimensions of the bounding box (x, y – coordinates of the corner, width and height) are determined, along with the probability of the object belonging to specific classes.

YOLO works differently from traditional object detection methods. In traditional approaches, the image is divided into regions, and detection occurs separately for each of those regions, sometimes recursively. YOLO, however, treats the image as a whole and, as described above, allows the entire process to evaluate both the object classes and their locations in a single pass.

Fig. 1: Example output of YOLO object detection.

Advantages of YOLO

The main advantage of using YOLO is its speed – it is one of the fastest algorithms for object detection, making it suitable for real-time applications. YOLO also achieves high accuracy – it can detect objects of various sizes, even in cases where objects partially obscure each other. Another advantage is that YOLO can be easily trained on new datasets and adapted to various detection problems.

Limitations of YOLO

While YOLO is one of the most commonly chosen object detection algorithms, it does have some limitations. One of them is the difficulty in detecting very small objects compared to other algorithms (e.g., Faster R-CNN). This limitation arises from the mechanics of the model – each grid cell on the image can define only one bounding box. Therefore, if the center of two objects falls within the same cell, only one of those objects will be detected. Additionally, depending on the version of YOLO, there may be a trade-off between speed and detection accuracy – YOLO models come in several generations and various sizes (tiny, small, medium, etc.), with newer and larger models offering more precise performance at the cost of slower processing times.

YOLO in Practice

There are many ready-to-use libraries that offer pre-implemented versions of the YOLO model and repositories of pre-trained weights for various use cases. Regardless of the chosen class set or network version, the detection process generally follows a similar approach. It typically involves:

Loading the pre-trained model (hyperparameter configuration, weights);
Loading class names;
Loading the image(s) on which object detection is to be performed;
Preprocessing the input image for efficient detection algorithm performance (typically including normalization, grayscale conversion, and resizing to square dimensions);
Using the loaded model to detect objects – potential bounding boxes are generated based on the number of grid cells YOLO divides the image into (this depends on the selected model size);
Among the defined bounding boxes, selecting those that best represent the detected object, eliminating duplicate boxes. Non-Max Suppression [1] can be used to help discard less confident boxes by controlling the overlap between boxes;
Displaying the detected bounding boxes on the image along with class labels and model confidence or saving the coordinates for further processing.

A Python library that makes it easy to use YOLO is OpenCV. A script example can be found in the package documentation [2].

Summary

Object detection is one of the key challenges in the field of computer vision, allowing systems to understand what is in an image and exactly where the objects are located. Compared to image classification, object detection not only assigns labels but also locates the objects. To this day, algorithms are available that can efficiently detect objects in images, and their application in basic cases requires only a few lines of Python code. That’s all for today!