Procedural Generation Engines for Vision Training: A Review of the Tooling

Abstract

This review surveys the engines and pipelines a computer-vision team can use to manufacture labelled training imagery, and organises them by the property that actually decides which tool fits a task rather than by the broad label of synthetic data. We separate the public tooling available through the review quarter into five recognisable families and credit each to its authors: photoreal game-engine renderers that recover dense labels from a rendering pipeline [1][2][3]; physically based and Blender-driven offline pipelines that script a renderer to emit images with their annotations [6][9]; structured domain-randomisation generators that abandon realism deliberately so a network learns to ignore it [5]; procedural geometry-and-texture synthesis embodied in engine-level packages [7]; and 2D compositing pipelines that paste rendered objects onto real photographs [4]. We argue these families sort cleanly along one axis the slogan hides, the trade between scene photorealism and the share of pixels a generator can label exactly because it drew them, and that the second property, not the first, is what most segmentation tasks reward. We then locate our own synthetic well-log curve generator, built for the VeerNet raster-log digitiser, at the label-fidelity corner of that map, and report the only quantities we measured: a 15,000-instance synthetic training set drawn at 3,200 to 12,800 pixels wide. The curve generator is a narrow-domain instance of the surveyed tooling, not a new category of it.

The idea of rendering a training set instead of collecting one is older than the deep-learning wave that made it urgent, but it became a practical engineering choice when the rendering pipelines that games and film already used were turned, deliberately, into label factories. The cleanest early demonstration took a commercial driving game and recovered dense semantic labels straight from its rendering pipeline, showing that the engine already knew, per pixel, which object it had drawn [1]. A parallel line did not borrow an engine but built one: a purpose-made virtual city rendered to produce pixel-accurate semantic masks for urban scene segmentation, with the scene composition under the authors' control rather than a game studio's [2]. A third effort cloned a real tracking benchmark inside a game engine so that ground truth could be generated, perturbed, and re-rendered at will, which made the synthetic copy a controllable laboratory for the real dataset [3]. Read together, these three established the game-engine family and its defining advantage: when the renderer draws the scene, the labels are a by-product of the draw, not a separate and error-prone act of human annotation.

A second family grew out of offline, physically based rendering rather than real-time game engines, and its tooling matured into reusable pipelines. A Blender-based system packaged photorealistic rendering with automatic annotation so that a practitioner could script scenes and harvest images with their masks, depth, and poses [6]. Later, an open generator scripted a renderer and a physics engine together to emit images with rich, exact annotations at scale, pushing the offline family toward reproducible, configurable dataset production [9]. The engine vendors themselves entered the picture as well: a package built on a mainstream game engine exposed randomisable scene parameters and label exporters directly, lowering the cost of standing up a generator for a new object class [7]. What unites the offline and engine-package families with the game-engine family is procedurality, the scene is specified by rules and parameters, so a generator can produce an arbitrary number of distinct, fully labelled examples without a human in the loop.

Two further strands complete the period-correct picture and they pull in opposite directions on realism. The domain-randomisation strand argued, against the photorealism instinct, that a generator should vary textures, lighting, and clutter so wildly and so unrealistically that the network stops treating any particular appearance as signal, which was shown to train detectors that transfer to real images [5]. The compositing strand stayed close to real data by pasting rendered content, text glyphs in the canonical case, onto real background photographs, so that the foreground was synthetic and exactly labelled while the background was genuine [4]. The survey literature on synthetic data is the right frame for all of these, because it is explicit that generated imagery substitutes for collected imagery only when the generator captures the variation that matters, and otherwise stops transferring [8]. The architecture that consumes the dense labels these generators emit is, in the cases nearest our own work, the symmetric encoder-decoder with skip connections designed to learn dense labels from scarce data [10]. None of this tooling is ours; the families and their exemplars belong to the groups cited here, and the contribution of this review is to organise them and then add a single honest instance.

Method

This is a tooling review with one embedded instance, and it is worth being exact about where each part comes from. The review portion selects procedural-generation tooling that was public on or before the review quarter and that bears on producing labelled imagery for supervised vision, and it characterises each tool from its published description rather than by re-running it. The selection is a sample chosen to expose the families above, not a census; it deliberately omits adjacent topics such as generative-adversarial image synthesis and neural rendering, which manufacture pixels but not, in the period covered, the exact dense labels a segmentation task needs.

The organising axis is chosen, not measured. We place each family on a plane whose horizontal extent is the scene photorealism the family can reach and whose vertical extent is the label fidelity it delivers for free, by which we mean the share of pixels the generator can label exactly because it produced them. The placements are an illustrative reading of how each tool is characterised in its own literature, and the interactive exhibit below flags that on its face. What the exhibit does measure honestly is the consequence of a priority: it lets the reader weight the two axes and re-ranks the families by fitness to that weighting, so the claim being tested is not where any single tool sits but which property a task should optimise for.

The embedded instance is our synthetic well-log curve generator, the procedural component of the VeerNet raster-log digitiser, whose task is to recover machine-readable curves from scanned paper logs. The facts we assert about it are these and only these: it is a domain-specific generator that draws curves on a synthetic log canvas and emits the segmentation mask together with the image, so its labels are exact by construction; it produced the 15,000-instance training set used for the multiclass phase; and the synthetic canvases span 3,200 to 12,800 pixels in width and 480 to 640 pixels in height. Those quantities are sourced from the engagement archive and are the only measured numbers in this review. The claim we attach to them is positional, not comparative: the generator sits at the label-fidelity corner of the surveyed map, and we did not run the other families on our task to produce a head-to-head score.

Results

Sorting the surveyed tooling along the chosen axis produces a clear and, we think, useful picture: the families do not compete on a single ladder of quality, they occupy different corners of a trade. The game-engine and physically based renderers cluster toward high photorealism, because their entire heritage is making scenes look real, and they carry high label fidelity too when the labels are read from the rendering pipeline rather than annotated by hand [1][2][6][9]. The domain-randomisation generators sit lower on photorealism by deliberate choice, since their thesis is that realism is a distraction, while keeping high label fidelity because the generator still knows exactly what it drew [5]. The compositing pipelines sit in the middle on realism, with a real background and synthetic foreground, and their label fidelity is high for the pasted content [4]. The instrument below renders this as a plane the reader can interrogate, and its single orange marker is ours.

A comparator of the ways a vision team can manufacture labelled training imagery, scored against two demands that decide whether a generator earns its keep. The x-axis is the scene photorealism a family can reach and the y-axis is the label fidelity it delivers for free, the share of pixels it can label exactly because it drew them. The teal markers are five public approaches credited in the article: photoreal game-engine renderers, physically based Blender and offline render pipelines, structured domain-randomisation generators, procedural geometry-and-texture synthesis, and 2D compositing or paste pipelines. Drag the lever to reweight the two demands from a photorealism-first priority, where the scene must look real, toward a label-fidelity-first priority, where the labels must be exact; the right-hand column re-ranks every approach by fitness to that weighting. The orange diamond is ours and it carries the argument: the synthetic well-log curve generator we built for VeerNet produced a 15,000-instance training set at 3,200 to 12,800 pixels wide, trading broad photorealism for total label fidelity because it emits the mask with the image. As the priority slides toward exact labels, that domain-specific generator becomes the fittest point on the surveyed plane. The instance count and image-width range are sourced from the engagement archive; the two demand axes and the fitness weighting are an illustrative reading of the cited procedural-generation literature, not a measured leaderboard.

The result our marker argues is what the plane makes visible once the reader stops treating photorealism as the figure of merit. Our curve generator is poor on the photorealism axis on purpose, a synthetic paper log does not need to look like a photograph of a real one, and it is at the ceiling on the label-fidelity axis because every curve it draws it also labels, with no human annotator and therefore no annotation noise. When the priority lever is pushed toward photorealism first, the renderer families dominate the ranking, exactly as their literature would predict. When the lever is pushed toward label fidelity first, which is the regime a thin-curve segmentation task actually inhabits, our domain-specific generator rises to the fittest point on the surveyed plane. The defensible content of the exhibit is structural: it shows that the property our generator maximises is a real and recognised one in the surveyed tooling, and that a task can rationally prefer a generator that is unphotorealistic but exactly labelled. The 15,000-instance scale and the 3,200 to 12,800 pixel width range are the sourced facts the marker rests on; the family coordinates and the fitness weighting are the illustrative reading the method line declares.

Discussion

The map clarifies something the synthetic-data slogan tends to flatten, which is that there is no single best generator, only a generator best matched to a task's figure of merit. For perception in driving and robotics, where the appearance of a real scene is most of the problem, the realism-heavy families are the natural fit, and the field's investment in photoreal engines and physically based pipelines is well placed [1][2][6][7][9]. For tasks where the structure to be labelled is thin, sparse, and synthesisable from rules, the calculus inverts: photorealism buys little, and the property that matters is whether the generator can emit a perfect mask for the structure it drew. Domain randomisation made the first half of that argument by showing realism can be sacrificed for transfer [5]; a domain-specific generator like ours makes the second half by showing that, in a narrow enough domain, the generator can be the annotator.

It is worth being precise about what is ours and what is borrowed, because this review is written by people who built one of the generators it describes. The procedural-generation idea is not ours; the game-engine, offline-render, randomisation, and compositing families are credited above to their authors, and the survey framing for when synthetic data transfers is theirs too [8]. The consumer architecture nearest our work, the skip-connected encoder-decoder, is prior art [10]. What we claim authorship of is the VeerNet system and, within it, the specific curve generator that draws labelled raster logs, along with the decision to spend the synthetic-data budget on label fidelity rather than on chasing a photorealistic facsimile of a scanned page. The contribution of our instance to this review is therefore narrow and exact: it is a worked example at the label-fidelity corner, demonstrating in an industrial setting that the corner is occupiable and that a real task lives there.

There is a practical reading for a team weighing whether to render its own training data, and it is the most useful part of the discussion. The first question to settle is not which engine to license but which axis the task rewards, because that answer chooses the family before any tool is evaluated. A team whose task is dominated by scene appearance should look first at the realism-heavy families and budget for the domain gap the synthetic-data literature warns about [8]. A team whose target is thin, rule-describable structure should look first at whether a narrow generator can draw and label that structure directly, because if it can, the generator buys exact labels at a throughput no human pipeline matches, and it does so without ever solving the harder and often irrelevant problem of looking real.

Limitations

This review carries the limitations of its dual form, and naming them is part of meeting the bar it sets for the tooling it surveys. As a review, its selection is a sample rather than a census, restricted to procedural-generation tooling public on or before the review quarter and chosen to expose five families, which means it omits both later systems and adjacent strands, notably generative-adversarial and neural-rendering image synthesis, that manufacture pixels without, in the period covered, the exact dense labels a segmentation task needs. The two-axis map is an organising device, not a measurement: the photorealism and label-fidelity coordinates assigned to each family are an illustrative reading of how those tools are described in their own literature, and the fitness weighting that re-ranks them is a chosen function, so the plane should be read as a way to reason about the trade and not as a benchmark of any tool. As a piece carrying one embedded instance, its empirical content is exactly the two facts we measured about our own generator, the 15,000-instance training-set size and the 3,200 to 12,800 pixel width range; it is not a controlled comparison. We did not run the surveyed families on our raster-log task, so the review cannot report how a game-engine or Blender pipeline would have scored against ours, only that our generator occupies the label-fidelity corner by construction. Finally, the placement is domain-specific: the conclusion that an unphotorealistic but exactly labelled generator can be the fittest choice holds for thin, rule-describable targets like well-log curves, and a reader should not carry it over to tasks dominated by real-scene appearance, where the realism-heavy families remain the right starting point.

What this review establishes

Procedural-generation tooling for vision training sorts into five public families, each credited to its authors: game-engine renderers, physically based or Blender-driven offline pipelines, domain-randomisation generators, engine-level procedural synthesis, and 2D compositing pipelines.
The axis the synthetic-data slogan hides is the trade between scene photorealism and label fidelity, the share of pixels a generator can label exactly because it drew them. Most segmentation tasks reward the second property more than the first.
Realism-heavy families fit tasks dominated by real-scene appearance, such as driving and robotics. Domain randomisation showed realism can be sacrificed for transfer, establishing that an unphotorealistic generator can still be useful.
Our VeerNet curve generator is a narrow-domain instance of this tooling at the label-fidelity corner: it draws synthetic raster logs and emits the mask with the image, so its labels are exact by construction. It produced a 15,000-instance training set at 3,200 to 12,800 pixels wide.
The practical first question is which axis a task rewards, because that chooses the family before any engine is evaluated. For thin, rule-describable targets, a generator that is the annotator beats one that merely looks real.

References

[1] Richter, S. R., Vineet, V., Roth, S., and Koltun, V. Playing for Data: Ground Truth from Computer Games. ECCV (2016). Showed that a commercial game engine can be turned into a source of densely labelled training imagery, with labels recovered from the rendering pipeline. https://arxiv.org/abs/1608.02192

[2] Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A. M. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. CVPR (2016). A purpose-built virtual city rendered to produce pixel-accurate semantic labels for urban scene segmentation. https://ieeexplore.ieee.org/document/7780721

[3] Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. Virtual Worlds as Proxy for Multi-Object Tracking Analysis. CVPR (2016). The Virtual KITTI work, which cloned a real benchmark inside a game engine so that ground truth could be generated and varied at will. https://arxiv.org/abs/1605.06457

[4] Gupta, A., Vedaldi, A., and Zisserman, A. Synthetic Data for Text Localisation in Natural Images. CVPR (2016). The SynthText pipeline, which composites rendered text onto real background photographs to manufacture detection labels at scale. https://arxiv.org/abs/1604.06646

[5] Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., and Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops (2018). Demonstrated that deliberately unrealistic, randomised renders can train networks that transfer to real images. https://arxiv.org/abs/1804.06516

[6] Denninger, M., Sundermeyer, M., Winkelbauer, D., Zidan, Y., Olefir, D., Elbadrawy, M., Lodhi, A., and Katam, H. BlenderProc. arXiv (2019). A procedural, physically based rendering pipeline built on Blender for generating photorealistic images with automatically derived annotations. https://arxiv.org/abs/1911.01911

[7] Borkman, S., Crespi, A., Dhakad, S., Ganguly, S., Hogins, J., Jhang, Y.-C., Kamalzadeh, M., Li, B., Leal, S., Parisi, P., Romero, C., Smith, W., Thaman, A., Warren, S., and Yadav, N. Unity Perception: Generate Synthetic Data for Computer Vision. arXiv (2021). A package that turns the Unity engine into a configurable, randomisable generator of labelled vision data with built-in label exporters. https://arxiv.org/abs/2107.04259

[8] Nikolenko, S. I. Synthetic Data for Deep Learning. Springer (2021). A survey of when generated data substitutes for collected data, and the domain-gap conditions under which the substitution stops transferring. https://arxiv.org/abs/1909.11512

[9] Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D. J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al. Kubric: A Scalable Dataset Generator. arXiv (2022). An open pipeline that scripts a renderer and a physics engine together to emit images with rich, exact annotations. https://arxiv.org/abs/2203.03570

[10] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder with skip connections, the consumer architecture for the dense labels a procedural generator produces. https://arxiv.org/abs/1505.04597

Procedural Generation Engines for Vision Training: A Review of the Tooling

Abstract

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on

Abstract

Background and related work

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on