Detecting Objects in an Image

This page will describe how to detect other vehicles in an image from a forward facing camera with applied computer vision techniques using OpenCV. Below is a walkthrough of the fundamental principles and methods used in the project and it's subsequent Python implementation. The information supplied is not exhaustive, and it is recommended to view the source code and comments on GitHub for a deeper understanding of the project. A link to the GitHub repository is here. A video of the project in action can be found here.


The project can be combined with driving lane line detection, and if you wish to learn more about how to perform lane detection, a link is provided here.



To accomplish object detection using a traditional computer vision technique will involve the following steps.

     1.   Slide varying sized windows over the image.

     2.  Process the images within the sliding windows.

     3.  Classify the processed data with a supervised learning model.

     4.  Locate the position of the object within the image .


Sliding windows Search



The first part of detecting objects in images is to break the image down into smaller sized images and to then extract the features within the smaller images. The technique we will use to do this will be the sliding window method. The sliding window method uses a box with a set size, called a window, that moves along and up and down the image to process only the pixels data within that window.


As objects will appear with varying sizes, dependent on its distance relative to the camera, multiple sized windows will be employed to locate the objects that are close and distant to the camera. The code below is used to scale the window sizes and call the sliding window search functions.

while px_count >= 48:

    windows = self.slide_window(image, x_start_stop=[None, None], y_start_stop=[ystart, ystop],

                                xy_window=(px_count, px_count), xy_overlap=(0.5, 0.5))


    box_list.extend(self.search_windows(image, windows, self.svc, self.X_scaler,                     

                                        spatial_size=spatial_size, hist_bins=hist_bins,

                                        orient=orient, pix_per_cell=pix_per_cell,


                                        hog_channel=self.hog_channel,                                                                        spatial_feat=spatial_feat,

                                        hist_feat=hist_feat, hog_feat=hog_feat))


    # make the search window smaller

    px_count = int(px_count//1.5)


The original window size is of the full height of the search area to help find close objects. After each pass of the image with the sliding windows and window classification, the windows are scaled down by 1.5 of the previous window size to a min size of 48 pixels to help classify objects that are further away or of a smaller size.


As the image is processed into multiple images, it can require vast computations resulting in decreases system performance. To help minimise this, each image will only be searched in the area in which the objects are to be expected which in our case is the bottom half of the image.


Each search box overlaps is a neighbouring box by 50%, xy_overlap=(0.5, 0.5). This both minimises the amount of area overlap, preventing multiple redundant calculations, as well as minimising missed or false detections on the border of the search boxes. The function self.slide_window() creates a list of each of the window edge pixel positions so that the self.search_windows() function can search each window individually for any of the trained objects. 


# Searches a list of windows to extract features present in the window

# img : image for feature extraction

# windows : list of each windows bounding box pixel position

# clf : supervised classifier type

# scalar: column scalar for the classifier 

# spatial_size : size of eachgrid box

# hist_bins : no. of groups for the colour histogram features

# hist_range : min and max values for the colour histogram

# orient : number of different orientations of the HOG

# pix_per_cell : no. of pixels per cell

# cell_per_block : no. cells per block

# hog_channel : Which image channel to do the processing on

# spatial_feat : True => include the image spacial features

# hist_feat : True => include the image color histograme features

# hog_feat : True => include the image HOG features

# vis : True => will return an image of the HOG

# feature_vec : return HOG data as a features vector

def search_windows(self, img, windows, clf, scaler, spatial_size=(32, 32), hist_bins=32,

                   hist_range=(0, 256), orient=9, pix_per_cell=8, cell_per_block=2,

                   hog_channel=0, spatial_feat=True, hist_feat=True, hog_feat=True,

                   vis=False, feature_vec=False):


    on_windows = []

    # Iterate over all windows in the list

    for window in windows:

        # Extract the test window from original image an resize

        test_img = cv2.resize(img[window[0][1]:window[1][1], window[0][0]:window[1][0]], (64,64))


        # Extract features for that window

        features = self.get_object_features(test_img, spatial_size=spatial_size,

                                            hist_bins=hist_bins,orient=orient,                                                                   pix_per_cell=pix_per_cell,

                                            cell_per_block=cell_per_block, hist_range=hist_range,

                                            hog_channel=hog_channel, spatial_feat=spatial_feat,

                                            hist_feat=hist_feat, hog_feat=hog_feat,

                                            vis=vis, feature_vec=feature_vec)


        # Scale extracted features to be fed to classifier and predict

        test_features = scaler.transform(np.array(features).reshape(1, -1))

        prediction = clf.predict(test_features)


        # If positive prediction save the window

        if prediction == 1:



    return on_windows


Processing each window



Each window within the image is rescaled to enable a static size and creates consistent data for feature extraction. We can then pass this 64x64 sized image to the function which extracts the features for image classification. The function is as follows.​


# gets the supplied image features using call hog_extract,

# bin_spatial() and color_hist()

# image : supplied image for features extract

# spatial_size : size of eachgrid box

# orient : number of different orientations of the HOG

# hist_bins : no. of groups for the color histogram features

# hist_range : min and max values for the color histogram

# pix_per_cell : no. of pixels per cell

# cell_per_block : no. cells per block

# hog_channel : Which image channel to do the processing on

# vis : True => will return an image of the HOG

# feature_vec : return HOG data as a features vector

# spatial_feat : True => include the image spacial features

# hist_feat : True => include the image color histograme features

# hog_feat : True => include the image HOG features

# return : HOG features for the image

def get_object_features(self, image, spatial_size=None, orient=None,

                        hist_bins=None, hist_range=None, pix_per_cell=None,

                        cell_per_block=None, hog_channel=None, vis=False, feature_vec=False,

                        spatial_feat=None, hist_feat=None, hog_feat=None):


    if spatial_size is  None: spatial_size = self.spatial_size

    if hist_bins is  None: hist_bins = self.hist_bins

    if hist_range is  None: hist_range = self.hist_range

    if orient is  None orient=self.orient

    if pix_per_cell is  None: pix_per_cell = self.pix_per_cell

    if cell_per_block is  None: cell_per_block = self.cell_per_block

    if hog_channel is  None: hog_channel = self.hog_channel

    if spatial_feat is  None: spatial_feat = self.spatial_feat

    if hist_feat is  None: hist_feat = self.hist_feat

    if hog_feat is None: hog_feat = self.hog_feat


    image = self.convert_color(image)


    # Call get_hog_features() with vis=False, feature_vec=True

    hog_features = np.ravel(self.hog_extract(image, orient=orient, pix_per_cell=pix_per_cell,

                            cell_per_block=cell_per_block, hog_channel=hog_channel,

                            vis=vis, feature_vec=feature_vec))


    # Apply bin_spatial() to get spatial color features

    spatial_features = self.bin_spatial(image, size=spatial_size)


    # Bin the colour values  

    hist_features = self.color_hist(image, nbins=hist_bins, bins_range=hist_range)

    features = np.concatenate((spatial_features, hist_features, hog_features))

    # Return list of feature vectors

    return features


To eliminate changes in road surface, glare from the sun or headlights, as well as to help see through the darkness of the night or with casted shadows on the surface of the road, the input window has its colour scale converted to aid the object detections under different environmental conditions. The colour space that has been observed as the most consistent and reliable is the Heu, Saturation and Lightness (HSL) colour space.


The first features we will extract are the Histogram of Oriented Gradients or HOG. HOG takes an image, in this case, a 64 x 64, and computes the gradients of the image. This gradient image is then separated into a grid system called cells and in each cell, we calculate the orientation and magnitude of each gradient. These orientations are then put into bins using a histogram and then the histograms are weighted by the magnitude. This enables the cells to effectively vote for which orientation they think the gradient is. We then combine neighbouring cells into blocks which provides contrast normalisation and we finally collect all these blocks to form the HOG calculations. Luckily as we are using python,  the skimage.features library has a built-in function called hog that we will use. If you wish to learn more about HOG, check out this Scikit-image page or this youtube video.


The code to extract the HOG is as simple as follows.​​​


# img : supplied image for features extract

# orient : number of different orientations of the HOG

# pix_per_cell : no. of pixels per cell

# cell_per_block : no. cells per block

# feature_vec : return HOG data as a features vector

# return : HOG features for the image channel


features = hog(img, orientations=orient, pixels_per_cell=(pix_per_cell, pix_per_cell),

               cells_per_block=(cell_per_block, cell_per_block), transform_sqrt=True,


An example of a HOG image orientations=9, pixels_per_cell=(13, 13) and cells_per_block=(2, 2) can be viewed below.




The second features that we extract are with spatial binning. Using spatial binning helps reduce the effects of observation error in the image. This is accomplished by resizing the image, from 64 x 64 to a 12 x 12 in our case, and converting the data to a one-dimensional feature vector as follows.


# Computes the images spatial binned colour features

# img : input image data to extract features from

# size : 2D array with the number of features to collect 

# return : features vector

def bin_spatial(self, img, size=None):

    if size is None: size = self.spatial_size


    # Use cv2.resize().ravel() to create the feature vector

    features = cv2.resize(img, size).ravel()


    return features



The final features to be included for classification is a histogram of the image colours. As vehicles come in a variety of colours, this feature will mainly aid in the detection of the environment as most cars will have a distinct difference between their colours and their surroundings.​


# Computes the images colour histogram

# img : input image data to extract features from

# bins : number of bins to group colour values into

# bin_range : The lower and upper value limits to be included in the bins  

# return : Return the individual histograms, bin_centers and feature vector

def color_hist(self, img, nbins=None, bins_range=None):

    if nbins is None: nbins = self.hist_bins

    if bins_range is None: bins_range = self.hist_range


    # Compute the histogram of the colour channels separately

    channel1_hist = np.histogram(img[:,:,0], bins=nbins, range=bins_range)

    channel2_hist = np.histogram(img[:,:,1], bins=nbins, range=bins_range)

    channel3_hist = np.histogram(img[:,:,2], bins=nbins, range=bins_range)


    # Concatenate the histograms into a single feature vector

    hist_features = np.concatenate((channel1_hist[0], channel2_hist[0], channel3_hist[0]))


    return hist_features



We have now extracted all the features that we require to classify an object. We can now train our supervised classifier.



Classifying objects from the features 


As we are only for this example, classifying a single class of objects, which in our case is a car or no car, we will train and use a Support Vector Machine or SVM classifier as it is good with complicated domains that have a margin of separation.  


The training code for the SVM can be found below.


# Trains an svm classifyer with a car, non-car dataset

# obj_data : data of the images containing the correct class

# not_obj_data : data images of false information

def train_svm(self, obj_data, not_obj_data):

    img_data = self.get_training_data([obj_data, not_obj_data])

    car_features = []

    notcar_features = []


    for image in img_data[0]:

        img = cv2.imread(image)


    for image in img_data[1]:

        img = cv2.imread(image)



    # Create an array stack of feature vectors

    X = np.vstack((car_features, notcar_features)).astype(np.float64)

    # Define the labels vector

    y = np.hstack((np.ones(len(car_features)), np.zeros(len(notcar_features))))




    # Fit a per-column scaler

    self.X_scaler = StandardScaler().fit(X)

    # Apply the scaler to X

    scaled_X = self.X_scaler.transform(X)


    # Split up data into randomized training and test sets

    rand_state = np.random.randint(0, 100)

    X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.2,                                                                       random_state=rand_state)


    # Use a linear SVC

    self.svc = LinearSVC()

    # Check the training time for the SVC, y_train)

    # Check the score of the SVC

    print('Test Accuracy of SVC = ', round(self.svc.score(X_test, y_test), 4))



The dataset used to train contained two classes, car and non-car. This data can be downloaded from the git hub repository here. All the images to be trained and tested with must first have their features extracted, as outlined above. This is accomplished with the following lines.


for image in img_data[0]:

    img = cv2.imread(image)


for image in img_data[1]:

    img = cv2.imread(image)



Once we have the engineered features, we stack them together, create their class labels and randomly shuffle them, It is now time to train and test our classifier.


# Use a linear SVC

self.svc = LinearSVC()

# Check the training time for the SVC, y_train)

# Check the score of the SVC

print('Test Accuracy of SVC = ', round(self.svc.score(X_test, y_test), 4))


With the data set supplied and the default programs parameters, we get accuracy with our SVM. When the prediction returns a 1, we classify the image as containing a car and a zero for non-car.


This trained classifier is then used to predict if the window being search has a car present. It is called after the features are extracted and can be observed in the function search_windows(), as seen in the "Sliding windows search" sections code block two above.



Location of objects in the image 


So far we have created a sliding window, extracted features from the window and classified if the image contains a car or no car. If we test this with an image we get the following.


The process so far classifies the majority of the windows correctly but there are two problems;

   1.   There are false predictions.

   2.  The windows don't perfectly overlap the vehicles in the correct position.


To fixes these issues we will use a method called a heat-map. A heat map gives each windows pixels a value of one and then adds up all the windows pixel values so that when two windows overlap, the pixel values of the overlap will add to two and so on for multiple overlapping windows. This effectively creates areas with multiple vehicle classifications having larger values that incorrect classifications thus providing a higher prediction of the location of a vehicle.​ The code below creates a heat-map of the data.


# combines overlapping boxes to create a heat-map

# heatmap : 2D array with a shape equal or greater than the coordinates in bbox_list.

# bbox_list : 2D tuple with box coordinates

# return : heatmap of the box coordinates

def add_heat(self, heatmap, bbox_list):

    # Iterate through list of boxes

    for box in bbox_list:

        # Add += 1 for all pixels inside each bbox

        # Assuming each "box" takes the form ((x1, y1), (x2, y2))

        heatmap[box[0][1]:box[1][1], box[0][0]:box[1][0]] += 1

    # Return the updated heatmap

    return heatmap


The output of the heat-map from the above image is below.



To eliminate noise and false predictions, all we need to do is threshold the heat-map above a value. This is done as follows.


# removes low counts from the heatmap

# heatmap : 2D array of a heatmap

# threshold : values below this will be removed from the map

# return : a 2D binary heatmap

def apply_threshold(self, heatmap, threshold):

    # Zero out pixels below the threshold

    heatmap[heatmap <= threshold] = 0


    return heatmap


With a threshold of two, the binary output of the heat-map is.



We have now predicted the locations of the cars within in the input image. From this data there is one final step, to draw bounding boxxes around the vehicles and overlay them onto the input image.


# draws boxes of each box in bboxes

# img : image to be converted

# labels : 2D-array of heatmap binary

# color : box edge color

# thick : thickness of the box edge

# return : image with boxes drawn

def draw_boxes(self, img, bboxes, color=(0, 0, 255), thick=6):

    # Make a copy of the image

    imcopy = np.copy(img)


    # Iterate through the bounding boxes

    for bbox in bboxes:

        # Draw a rectangle given bbox coordinates

        cv2.rectangle(imcopy, bbox[0], bbox[1], color, thick)


    return imcopy


This produces the final image.



That is it. 


There is, however, a more modern and quicker method to locating objects in an image and that is to use some deep learning techniques. The code available on GitHub has the option to use the HOG method described above or to use YOLO, an object detection method using convolutional neural networks. The benefit of using the YOLO is that it has been specifically designed for object detection with speed as the key factor. The program uses the Darkflow library and there is no need to train the model as you can download a pre-trained model that can classify multiple objects. The video available for object detection uses the YOLO method for detection.

The complete project code can be viewed and downloaded from GitHub here.

A video of the lane detection system with object detection can be viewed below.







I hope you enjoyed reviewing the project. - Let's get in touch. LinkedIn messages work best.

  • github
  • udacity_black
  • LinkedIn - White Circle
  • git
  • YouTube - White Circle
  • udacity