Engineering

Quantifying Generalization: CVEDIA Detection Technology vs. Seven Open Source Datasets

A research study exploring biases in open source image classification datasets and demonstrating how CVEDIA's synthetic data technology outperforms seven major real-world datasets in cross-domain generalization.

15 min read
Quantifying Generalization: CVEDIA Detection Technology vs. Seven Open Source Datasets

Synthetic data has developed in the academic machine learning community as an alternative to real world data collection. In this research, we explore biases found in open source image classification datasets, as well as issues present in common data annotation methods. We built a machine learning algorithm using CVEDIA’s synthetic technology to counteract biases present in seven open source datasets, and conclude that the CVEDIA algorithms generalize better than each of the real world datasets analyzed.

Background

It’s long been seen in the machine learning community that many of the barriers to robust AI models lie in the data that trains and validates them. Data collection itself is often costly and time consuming, while open datasets are often broadly defined, unspecific to individual projects, and each come with their own set of biases.

In our own research, we’ve also found that manual annotation is not without issue - the misclassification of confusing and small objects, cultural and contextual differences in definitions, and simply missing objects in a background in favour of foreground objects, all pose a problem to the generalization of a dataset.

Toy train cake incorrectly annotated as a car
Image 1.0 - Object tagged as 'train', showing contextual confusion
Road sign with car icon incorrectly annotated as a car
Image 1.1 - Object tagged as 'car', showing contextual confusion
Highway scene showing missing annotations from boxy vehicles dataset
Image 1.2 - Missing annotations from boxy vehicles dataset

Synthetic data has been found to be a viable alternative to real world data when used correctly. However, knowledge about the use and efficacy of synthetic data has been broadly confined to academic circles. CVEDIA found that mentions of the term “synthetic data” have rapidly increased over time in academic papers on ArXiv.org, signalling a growing interest and need for the technology among data scientists.

Chart showing synthetic data mentions on ArXiv increasing over time from 1991 to 2019
Table 1.0 - Mentions of 'synthetic data' and 'synthetic dataset' in papers as a percentage of all papers on ArXiv.org

CVEDIA speculates that due to the complex, visual nature of computer vision training sets, it’s not possible with current technology to create an unbiased dataset, as there is no effective way for its creators to fully gauge or present the biases apparent in the real world. Taking the route of synthetic data, however, allows for a model-based design, in which data can be created to effectively counteract biases.

Methodology

Our data science team selected seven popular open datasets, including both synthetic and real image datasets. We chose datasets with a variety of conditions present to compare different types of bias. Certain datasets have a fixed point of view (POV), others are mixed, some have day and night conditions, others only day, some have mixed camera quality, others a single one, etc.

Our hypothesis is that CVEDIA’s synthetic algorithms can generalize better cross-domain, therefore models trained using CVEDIA technology should perform better when presented a novel domain.

Training Conditions

  • All models were trained using the same base model, RefineDet512 VGG16 COCO
  • All models were trained in the full extension of their target dataset - no sampling was applied
  • We fine tuned hyper parameters for each dataset to assure the highest score possible
  • All models were trained using SGD
  • All models were trained until they reached the highest precision (mAP) and recall (mAR) combination
  • Batch size was fixed to 4 on both training and validation
  • A single existing class, Car, was fine tuned
  • Standard caffe augmentation stack was used

Validation Conditions

  • Models were measured against each validation dataset every 100 epochs
  • All dataset annotations were converted to a common format without any data loss
  • Original dataset images were untouched and used as is

Cross Domain Test

Each of the datasets’ best performing models were then validated using each of the other datasets. This produced a confusion matrix that allows us to compare cross-domain performance.

CVEDIA Model

  • The CVEDIA model was exclusively trained using synthetic data created by our simulation engine SynCity - which creates high-fidelity, realism-based imagery
  • We applied the same base model, fine tuning, and optimization process as all other models
  • Training cutoff strategy was the same as other models

Datasets Analyzed

NameDescriptionSourceSizePOV
ApolloSmall dataset comprised of 100 real images for validation and trainingReal100Car bonnet, Cropped
BDDHighly varied dataset, including day/night scenarios, different weather conditions, heavy traffic, and camera aberrationsReal100kCar bonnet
Boxy VehiclesMainly composed of highway shots, with a single POV. Includes 3D annotationsReal137kCar bonnet
CityscapeFocused on city environments, highly diversified, daylight onlyReal3.5kCar bonnet
KittiFocused on European city environments, daylight only, wide camera angleReal7.5kCar bonnet, Cropped, Wide angle
P4BSynthetic dataset created using GTA5. Includes city and highway environments, day/night conditionsSynthetic184kFront bumper
Synthia SFExtensive synthetic dataset with exclusively daylight city scenarios. Not realism-focusedSynthetic4.5kFront bumper
P4B synthetic dataset created using GTA5 game engine
Images 2.0 - Screenshot of synthetic dataset P4B
Synthia SF synthetic dataset showing city environment
Image 2.1 - Screenshot of synthetic dataset Synthia SF

Dataset Bias

SynCity rendered car model - an Audi with SYNCITY license plate demonstrating high-fidelity synthetic imagery
Image 2.2 - An example of a photorealistic 3D car model used in SynCity

Before we discuss our analysis, we must address the fact that all datasets are inherently flawed. Following is a summary of annotation issues discovered during the testing process:

Apollo: Small (100 images), single POV, camera color bias

BDD: >90% of occluded objects annotated as whole yet inconsistently treated, questionable small object annotations, single POV despite large size

Boxy Vehicles: Single camera type, missing annotations especially for oncoming traffic, inconsistent bounding box sizes

Cityscapes: Single car/camera/POV, daylight only, pristine image quality, missing annotations, inconsistent bounding boxes

Kitti: Single camera/POV, missing annotations, inconsistent bounding boxes, questionable group annotations

P4B: Annotated fully occluded objects, single pixels annotated, subject car bonnet annotated, mistagged objects

Synthia SF: Single scenario/POV, mistagged objects, annotated fully occluded objects, linear camera

The Bias Problem

Every dataset and algorithm has a unique bias. Even if unnoticeable from a human perspective, biases in datasets are an inherent reality - from lighting conditions, object distribution, camera aberrations, and POV, to liberties taken by the person annotating.

The side effects of biases are many:

  • Reduction of the operating envelope of a model
  • Poor accuracy
  • Lack of adaptability cross-domain
  • Misanchored detections

Real data suffers from an additional inescapable problem - the value of a single data point, and how distributed the occurrence of feature-rich content is.

Using CVEDIA’s proprietary SynCity simulation engine, we were able to control aspects of these biases, creating linearly distributed data points specifically targeted for the model using realistic and high-fidelity imagery, and aiming for bias eradication or mitigation on a large scale.

Results

As expected, models that were trained and validated using the same dataset had the highest performance scores, due to the inability of the model to detect bias when using itself as a reference. Looking at how each dataset performed compared against other datasets, however, flaws and biases surfaced.

Precision Weighted Average (mAP)

ModelScore
SynCity (CVEDIA)0.5359
BDD0.4175
Cityscape0.4007
COCO0.3845
P4B0.3320
Kitti0.2815
Synthia-SF0.2631
Boxy Vehicles0.2195
Apollo0.2186
Bar chart showing precision weighted average scores with SynCity leading at 0.5359
Table 2.2 - Precision weighted average comparison sorted by score

Recall Weighted Average (mAR)

ModelScore
SynCity (CVEDIA)0.5665
BDD0.4489
Cityscape0.4338
COCO0.4047
P4B0.3532
Kitti0.3095
Synthia-SF0.3011
Apollo0.2730
Boxy Vehicles0.2486
Bar chart showing recall weighted average scores with SynCity leading at 0.5665
Table 2.3 - Recall weighted average comparison sorted by score

Performance Analysis

Particular datasets performed worse than others, namely Boxy Vehicles and P4B. In our analysis this seemed to be due to mislabelled annotations - effectively damaging model convergence, making a portion of the training data unlearnable as it gives the model mixed signals. Models trained on these datasets performed worse than the baseline metric cross-domain and with significantly reduced precision.

Size Breakdown Results

We defined object size based on bounding box area:

  • Large: more than 9216 pixels
  • Medium: between 1024 and 9216 pixels
  • Small: less than 1024 pixels

CVEDIA’s SynCity model achieved the highest scores across all object sizes:

SizeSynCity mAPNext Best
Large0.8085COCO (0.6759)
Medium0.5792BDD (0.4530)
Small0.2199BDD (0.1408)

Precision by Object Size

Precision weighted average on large objects - SynCity leads at 0.8085
Table 3.2 - PWA for large objects, over 9216 pixels
Precision weighted average on medium objects - SynCity leads at 0.5792
Table 3.3 - PWA for medium objects, between 1024 and 9216 pixels
Precision weighted average on small objects - SynCity leads at 0.2199
Table 3.4 - PWA for objects less than 1024 pixels

Recall by Object Size

Recall weighted average on large objects - SynCity leads at 0.8358
Table 3.5 - RWA for large objects, over 9216 pixels
Recall weighted average on medium objects - SynCity leads at 0.6342
Table 3.6 - RWA for medium objects, between 1024 and 9216 pixels
Recall weighted average on small objects - SynCity leads at 0.2296
Table 3.7 - RWA for objects less than 1024 pixels

Conclusion

As shown in the confusion matrix tables, CVEDIA’s synthetic algorithm performed substantially better than algorithms created from major open source datasets when challenged by different domains and novelty features. The CVEDIA algorithm was able to retain the majority of the existing features from the backbone model and even indirectly improve them beyond the original dataset metrics.

This shows a new method for algorithm creation as no real data or manual labelling was required to reach our scores, allowing for the ability to virtually solve data requirements with carefully developed synthetic techniques.

Our team also noted that the addition of real data to the CVEDIA training set increased scores beyond what is seen here - which is worth exploring in future tests and a potential strategy for machine learning teams.

Key Takeaways

  1. Synthetic data outperforms real data cross-domain: CVEDIA’s synthetic algorithms achieved 0.5359 mAP vs. the next best (BDD) at 0.4175 - a 28% improvement
  2. No real data required: The CVEDIA model was trained exclusively on synthetic data yet outperformed all real-world datasets
  3. Better generalization: Cross-domain performance is where synthetic data truly shines, as it’s not biased to a single camera, location, or condition
  4. Scalable solution: Synthetic data eliminates the need for costly data collection and manual annotation while delivering superior results