Predicting PBR Maps from a Single Input Image with a Conditioned Diffusion Model

Leading the Innovation in Surface Digitization

15 minutes read

Turn images into PBR maps

Meet the colormass Material AI that can convert simple top-down images into a set of PBR maps.

Our ongoing research in material digitization has led us to develop a cutting-edge AI-based system for transforming photographs of textile samples into full material models that can be used for realistic 3D renders, web-based configurators, and more. Leveraging years of high-quality calibrated scanning data, we now train generative models using NVIDIA GPUs capable of predicting full Physically Based Rendering (PBR) maps from a single overhead image of a textile. In this post, we outline some of the technical details of our approach, from data acquisition to model architecture and training strategies.

Result

Before diving into the technical details of the AI model, the visualization below showcases the original, unseen data inputs on the left and the resulting PBR maps on the right, produced entirely by Material AI, without any manual editing or post-processing.

You can find the original input data and the resulting maps in the Example Inputs and Outputs section.

Training Data Acquisition

Over multiple generations of development, our material scanning systems have been refined to capture increasingly high-fidelity data, inreasing both the volume and consistency of our datasets. Each scanner is rigorously calibrated to ensure accurate color and surface measurements, producing tens of gigabytes of raw images per scan under carefully varied lighting conditions. Once this data is uploaded to our cloud platform, an “inverse rendering” algorithm processes it into physically based (PBR) maps. By pairing these calibrated raw captures with the resulting PBR outputs, we create a robust, real-world dataset that serves as the foundation for training our AI models.

Inverse rendering works best when a BRDF solver can leverage many carefully captured images, typically obtained through a specialized scanner like ours. Although this yields highly accurate PBR maps, it also requires physical access to each sample, which can become challenging for large catalogs. An AI-based solution can fill this gap by inferring material properties from a more limited set of measurements, broadening the range of assets that can be converted into photorealistic, 3D-ready materials.

We validate our system on a separate dataset drawn from sources beyond our own hardware scanning system, including overhead photographs taken under neutral lighting and scans from a commercial flatbed scanner. This ensures our model is tested beyond the ideal conditions of our specialized hardware, reflecting a wider range of real-world scenarios.

Example Inputs and Outputs

Below are examples from the colormass scanner dataset, demonstrating how the model derives a full set of PBR maps from a single image captured under uniform lighting. The top image is the overhead photo; the middle row shows the “ground truth” PBR maps from our calibrated inverse rendering pipeline (Base Color, Normal, Roughness, Metalness, Anisotropy, Specular Strength); and the bottom row shows the model’s predictions after 25 diffusion steps.

Next, we present examples of the model’s performance on unseen data acquired using a standard flatbed scanner. Ground truth maps are not available for these samples, as they were not captured with the colormass scanner. Nonetheless, these examples highlight the model’s ability to generalize beyond tightly controlled data.

The .zip files below contain the original source image, named ORIGINAL.png, along with the resulting maps.

Image-Conditioned Diffusion

Our system builds on the same diffusion principles popularized by text-to-image models, but here the conditioning input is a photo rather than a text prompt. This setup is particularly well-suited to generating PBR maps, because diffusion models sample from the full distribution of potential outputs instead of converging on a single, “average” result. By preserving the variability and high-frequency detail found in real-world surfaces, they avoid the blurriness often seen in straightforward regression methods, ultimately delivering more natural-looking renders.

Joint Generation in Pixel Space

Many state-of-the-art generative models rely on a Variational Auto Encoder (VAE) to move between raw pixels and a latent space. In our trials, however, we noticed that incorporating a VAE tended to blur some of the fine details in PBR maps, so we chose to remove it altogether. While training a diffusion model directly in pixel space can be more demanding, careful initialization, architectural design, and optimization strategies allow us to achieve high-resolution results even on limited GPU resources, without sacrificing image sharpness.

Architectural Highlights
  • Aliasing-Aware U-Net Backbone: We employ a modified U-Net architecture designed to reduce artifacts caused by aliasing and padding/truncation during upsampling and downsampling. This aspect is especially important for tiled textures, where even minor inconsistencies between the center and edges quickly become visible. By minimizing these artifacts, our approach maintains more uniform texture statistics, leading to cleaner, more seamless tiling.
  • Dual-Branch U-Net for Conditioned Diffusion: We adopt a dual-branch approach: one branch focuses on the conditioning image (feature extraction), while the other handles the diffusion sample (denoising). Skip connections from the feature-extraction branch feed into the downsampling layers of the denoising branch, which then guides the upsampling stack. By isolating the high-variance diffusion process, we find that the model more quickly reproduces key features in the PBR maps across multiple spatial scales, improving both convergence and final output quality.
  • Orientation and Color Equivariance: Our model is designed to remain consistent under translations, rotations, and hue shifts, a property that’s crucial for seamless tiling and accurate color reproduction. Although fully convolutional architectures provide translation invariance, the downsampling and upsampling stages in typical U-Nets can still introduce aliasing artifacts. By employing an aliasing-aware U-Net backbone and other measures to preserve color consistency, our system produces stable outputs regardless of how the input is transformed.
  • Resampling-Aware Tiled Inference: At inference time, we carefully split processing of each U-Net block into separate tiles of dynamically chosen size, while still preserving all of the properties described above. This allows us to trade off between GPU memory usage and GPU-host communication overhead without any impact on the model output. As a result, we can process high-resolution images efficiently, without compromising the model’s capacity or the quality of the final output.
Training

Our training pipeline integrates experimental yet promising optimization strategies. We adopt the recently introduced Muon optimizer, and draw inspiration from the modular duality framework to restructure typical neural blocks in a way that encourages fast, stable training. This combination helps stabilize training dynamics across different architectures and hyperparameter choices, making it easier to iterate and refine our model without exhaustive tuning.

If you'd like to test the Beta version of the Material AI, please contact us through the Contact Form.

Want to receive more articles like this?

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.