November 11, 2024

GPU texture encoder


Creating fast, real-time 3D applications always involves balancing quality and performance, especially when targeting platforms without top-tier GPUs. One of the main bottlenecks in these scenarios is memory throughput, which can significantly impact performance. The amount of texture data used directly affects memory bandwidth, and hardware texture compression helps alleviate this issue by reducing the required memory bandwidth and footprint.

All GPUs support block compression formats; however, there is no universal standard that works seamlessly across all platforms. Currently, different GPUs support three major compression formats:

  • BC1-5 (also known as DXT or S3TC) – Supported by all desktop GPUs.
  • BC6-7 – Supported by D3D11+ desktop GPUs.
  • ASTC – Supported by modern mobile GPUs.

There are also older mobile compression formats that are still in use today, such as ETC, ETC2, EAC, ATC, and PVRTC. Unfortunately, using BC formats on mobile devices and ASTC on desktops is not feasible, necessitating different data packs for various platforms.

Compressing textures to BC1-5 formats was relatively straightforward using either the CPU or GPU, as the encoding algorithm was simple. However, the introduction of BC6-7 increased the complexity due to additional compression modes, making the encoding process significantly slower. ASTC formats further increased complexity due to the vast number of modes and the use of integer coding with trits and quints.

Compressing textures offline and shipping them with the project is common but often suboptimal, especially for dynamic or procedural textures. For instance, GLTF and USD resources typically include embedded JPEG images to reduce asset size, while some algorithms generate procedural textures at runtime. In such cases, fast, real-time compression is necessary.

Tellusim SDK provides real-time compression for BC1-5, BC6-7, and ASTC formats on all platforms. BC1 texture compression remains a viable option for PCs because BC formats do not have variable block sizes and BC1 is twice as compact as BC7, with excellent compression speed. A practical use case for BC1 compression is in real-time applications like Google Maps or XYZ tile compression, where it helps minimize memory overhead and reduce compression stalls.

Our SDK provides GPU encoders via the EncoderBC15, EncoderBC67, and EncoderASTC interfaces. Each encoder has specific flags and can be initialized for required formats only since initial kernel compilation can take some time.

The encoder input is a standard texture, and the output is an integer texture in RGBAu16/RGBAu32 format, with one pixel per block dimension. The application must copy this intermediate integer texture into the final block texture because direct copying to block-compressed formats is typically unsupported. (RGBAu16 is required only for BC1 and BC4 formats).

Integer textures cannot fully represent all required mipmap levels due to size truncation. This issue needs to be managed manually, either by reducing the number of mipmaps being compressed or by increasing the size of the integer texture. The truncation occurs because the final 1×1 mipmap level in the integer texture represents a 4×4 (or 5×4, 5×5) pixel block, leaving no space for the smallest mipmaps (2×2 and 1×1).

Below is an example of GPU ASTC 5×5 texture compression using the Tellusim SDK:

// texture format
Format format = FormatASTC44RGBAu8n;

// create intermediate image
uint32_t width = src_texture.getWidth();
uint32_t height = src_texture.getHeight();
uint32_t block_width = getFormatBlockWidth(format);
uint32_t block_height = getFormatBlockHeight(format);
Image dest_image = Image(Image::Type2D, FormatRGBAu32, Size(udiv(width, block_width), udiv(height, block_height)));

// create intermediate texture
Texture dest_texture = device.createTexture(dest_image, Texture::FlagSurface | Texture::FlagSource);
if(!dest_texture) return 1;

// dispatch encoder
{
    Compute compute = device.createCompute();
    encoder.dispatch(compute, EncoderASTC::ModeASTC44RGBAu8n, dest_texture, src_texture);
}

// flush context
context.flush();

// get intermediate image data
if(!device.getTexture(dest_texture, dest_image)) return 1;

// copy image data
Image image = Image(Image::Type2D, format, Size(width, height));
memcpy(image.getData(), dest_image.getData(), min(image.getDataSize(), dest_image.getDataSize()));

// save encoded image
image.save("texture.astc");

Additionally, the SDK includes a fast GPU JPEG decompression interface, which significantly accelerates JPEG to BC or ASTC conversions.

Of course, achieving real-time compression speeds involves quality trade-offs, which can reduce the resulting texture quality. Below are tables with PSNR and time values for compressing a test 1024×512 RGB image on Apple M1 Max:

PSNR RGB (db) CPU Fast CPU Default CPU Best GPU
BC1 39.83 39.85 39.69
BC7 48.27 48.53 44.85
ASTC 4×4 48.13 48.29 48.50 44.97
ASTC 5×4 46.18 46.34 46.50 42.74
ASTC 5×5 44.48 44.61 44.73 40.96
PSNR RG (db) CPU Fast CPU Default CPU Best GPU
BC1 49.52 49.62 40.11
BC7 50.14 50.38 46.28
ASTC 4×4 51.07 51.18 51.37 47.43
ASTC 5×4 48.65 48.77 48.92 44.02
ASTC 5×5 46.41 46.54 46.67 41.85
Time RGB (ms) CPU Fast CPU Default CPU Best GPU
BC1 28 43 0.4
BC7 105 186 1.0
ASTC 4×4 100 157 386 2.2
ASTC 5×4 117 175 464 4.8
ASTC 5×5 138 212 542 3.1

Reducing the number of input texture components improves PSNR values, which is beneficial for normal maps and luminance-only textures. ASTC encoding performance can be further optimized by limiting the number of compression modes if needed. However, the current performance is satisfactory for applications using JPEG input textures.

The latest version of the reference astcenc compressor demonstrates excellent CPU encoding performance, while we stopped our CPU ASTC encoder optimizations at BC7 performance level:

PSNR RGB (db) Fast Medium Thorough
ASTC 4×4 47.31 48.19 48.50
ASTC 5×4 45.58 46.34 46.63
ASTC 5×5 43.56 44.61 44.97
Time RGB (ms) Fast Medium Thorough
ASTC 4×4 22 28 65
ASTC 5×4 20 26 64
ASTC 5×5 21 25 67

All textures and metrics were taken by Tellusim Image Processing Tool from Core SDK using this script:

November 9, 2024

10 Hello Image


Efficient image processing is crucial for any application, given the increasing resolutions and growing data volumes. Even small optimizations in this pipeline can save hours, days, or even weeks of processing time. Additionally, flexible access to image conversion and manipulation utilities across various languages and environments is a valuable feature.

The Image interface is a central component of image handling in Tellusim. It supports loading and saving 2D, 3D, Cube, 2D Array, and Cube Array images, as well as format/type conversion, component/region extraction, scaling, and mipmap generation. All heavy operations are optimized with SIMD and multithreading. The extension system allows for custom format and conversion operation support. The Python API ensures compatibility with popular libraries such as NumPy, PyTorch, and Pillow. The Image interface is simple to use and fully compatible with all supported programming languages.

The following Python snippets showcase basic image operations that can be useful for batch image processing.

Getting image information, including Exif metadata, without loading image content:

# fast operation without content loading
image.info("image.png")
print(image.description)

Performing basic operations with image:

# load Image from file
image.load("image.png")
print(image.description)

# swap red and blue components
image.swap(0, 2)

# rotate image by 90 degrees CCW
image = image.getRotated(-1)

# convert image to RGBA format
image = image.toFormat(FormatRGBAu8n)

# crop image
image = image.getRegion(Region(40, 150, 64, 94))

# upscale image using default Cubic filter
image = image.getResized(image.size * 4)

# create mipmap chain using default mipmap filter
image = image.getMipmapped(Image.FilterMip, Image.FlagGamma)

# save image
image.save("test_basic.dds")
print(image.description)

The ImageSampler interface provides access to individual pixels of a specific image layer, mipmap, or face. A high-order Catmull-Rom filter is available for high-quality image resampling when needed. The following snippet demonstrates how to create a simple procedural image:

# create new image
image.create2D(FormatRGBu8n, 512, 256)

# create image sampler from the first image layer
sampler = ImageSampler(image)

# fill image
color = ImageColor(255)
for y in range(image.height):
    for x in range(image.width):
        v = ((x ^ y) & 255) / 32.0
        color.r = int(math.cos(Pi * 1.0 + v) * 127.5 + 127.5)
        color.g = int(math.cos(Pi * 0.5 + v) * 127.5 + 127.5)
        color.b = int(math.cos(Pi * 0.0 + v) * 127.5 + 127.5)
        sampler.set2D(x, y, color)

# save image
image.save("test_xor.png")
print(image.description)

Conversions between panorama and cube formats are straightforward. The following example converts an RGB cube image to a panoramic projection:

# create Cube image
image.createCube(FormatRGBu8n, 128)
print(image.description)

# clear image
for face in range(0, 6, 3):
    ImageSampler(image, Slice(Face(face + 0))).clear(ImageColor(255, 0, 0))
    ImageSampler(image, Slice(Face(face + 1))).clear(ImageColor(0, 255, 0))
    ImageSampler(image, Slice(Face(face + 2))).clear(ImageColor(0, 0, 255))

# convert to 2D panorama
# it will be horizonal cross without Panorama flag
image = image.toType(Image.Type2D, Image.FlagPanorama)

image.save("test_panorama.png")
print(image.description)

Our CPU texture encoders are fast and deliver excellent compression quality. Only a single function call is needed to compress a texture to any BC or ASTC format. An Async interface can be supplied to the function for more precise thread control. By default, the compressors utilize all available CPU cores:

# load and resize Image
image.load("image.png")
image = image.getResized(image.size * 2)

# create mipmaps
image = image.getMipmapped()

# compress image to BC1 format
image_bc1 = image.toFormat(FormatBC1RGBu8n)
image_bc1.save("test_bc1.dds")
print(image_bc1.description)

# compress image to BC7 format
image_bc7 = image.toFormat(FormatBC7RGBAu8n)
image_bc7.save("test_bc7.dds")
print(image_bc7.description)

# compress image to ASTC4x4 format
image_astc44 = image.toFormat(FormatASTC44RGBAu8n)
image_astc44.save("test_astc44.ktx")
print(image_astc44.description)

# compress image to ASTC8x8 format
image_astc88 = image.toFormat(FormatASTC88RGBAu8n)
image_astc88.save("test_astc88.ktx")
print(image_astc88.description)

Python buffer protocol support simplifies data sharing between Tellusim and other Python frameworks. The following snippet demonstrates modifying image content using NumPy operations:

# load image and convert to float32 format
image.load("image.png")
image = image.toFormat(FormatRGBf32)

# create array with specified dimension and format
array = numpy.zeros(shape = ( image.width, image.height, 3 ), dtype = numpy.float32)

# copy image data into the array
image.getData(array)

# set inverted data into the image
image.setData(1.0 - array)

# save inverted image
image.save("test_numpy.dds")
print(image.description)

The following file formats can be loaded directly: ASTC, BMP, BW, CUR, DDS, DEM, EXR, HDR, HGT, ICO, IMAGE, JPEG, KTX, LA, PBM, PGM, PNG, PPM, PSD, RGB, RGBA, SGI, TGA, and TIFF. The list of supported saving formats excludes only DEM and HGT files. Any other formats can be added via a C++ plugin and will function as native formats.

While it is easy to create such scripts in Python or other supported languages, this can also be avoided by using the Tellusim Image Processing Tool. The command-line options work as a pipeline of operations on the loaded images, dramatically simplifying batch processing. For example, the following command loads all images in the directory, resizes them, creates gamma mipmaps, encodes them to ASTC55, and saves all images with an “_astc” postfix and a .ktx extension:

ts_image *.jpg -scale 0.5 -mipmaps gamma -format astc55rgbau8n -p _astc -e ktx

For GravityMark, we used the following script to convert NASA Topo maps to the appropriate dimensions and formats:

#!/bin/bash

SRC=world.topo.bathy.200412

SIZE=8192
SCALE=0.5
NAME=color
mkdir -p $NAME.$SIZE

ts_image -v -create rgbu8n 86400 43200 \
    $SRC.3x21600x21600.A1.png -insert 0     0     -remove \
    $SRC.3x21600x21600.A2.png -insert 0     21600 -remove \
    $SRC.3x21600x21600.B1.png -insert 21600 0     -remove \
    $SRC.3x21600x21600.B2.png -insert 21600 21600 -remove \
    $SRC.3x21600x21600.C1.png -insert 43200 0     -remove \
    $SRC.3x21600x21600.C2.png -insert 43200 21600 -remove \
    $SRC.3x21600x21600.D1.png -insert 64800 0     -remove \
    $SRC.3x21600x21600.D2.png -insert 64800 21600 -remove \
    -s $SCALE -cube -mipmaps gamma \
    -clone -push -face 0 -o $NAME.$SIZE/$NAME.0.jpg -remove -pop \
    -clone -push -face 1 -o $NAME.$SIZE/$NAME.1.jpg -remove -pop \
    -clone -push -face 2 -o $NAME.$SIZE/$NAME.2.jpg -remove -pop \
    -clone -push -face 3 -o $NAME.$SIZE/$NAME.3.jpg -remove -pop \
    -clone -push -face 4 -o $NAME.$SIZE/$NAME.4.jpg -remove -pop \
    -clone -push -face 5 -o $NAME.$SIZE/$NAME.5.jpg -remove -pop \
    -clone -push -format bc1rgbu8n -o $NAME.$SIZE/$NAME.bc1.ktx -remove -pop \
    -clone -push -format etc2rgbu8n -o $NAME.$SIZE/$NAME.etc2.ktx -remove -pop \
    -clone -push -format astc66rgbau8n -o $NAME.$SIZE/$NAME.astc.ktx -remove -pop

The Image Processing Tool supports fast GPU compression to BC and ASTC formats. To use this, you simply need to specify the “gpu” flag for the “format” operation:

We will compare the performance and quality of the CPU and GPU encoders in the next post. Stay tuned.

November 8, 2024

SDK 40


New Engine features:
  • GLSL user stucture SSBO views (references).
  • Real-time GPU ASTC encoder for 4×4, 4×5, and 5×5 blocks.
  • Bindless buffer support has been added via the BufferTable interface.
  • Bindless resource support has been added to macOS and iOS.
  • Full Brep render on macOS and iOS via Mesh shader.
Build changes:
  • ts_chword.sh, ts_echo.sh, ts_exec.sh asks to install clang++ if it is not available.
Plugin changes:
  • interface/flow plugin for block-based editors.
  • interface/element plugin for the single CanvasElement insertion into Controls system.
New samples:
  • tests/graphics/cube_filter sample showcases real-time Cube texture diffuse convolution.
  • tests/graphics/decoder_jpeg sample showcases real-time GPU JPEG texture loading.
  • tests/graphics/encoder_astc sample showcases real-time GPU ASTC texture encoding.
  • tests/graphics/multi_window sample showcases multi-window rendering.
  • tests/platform/reference sample showcases shader references.
  • tests/platform/alpha sample showcases anti-aliased alpha tests.
  • tests/platform/atomic sample showcases buffer atomic operations.
  • tests/platform/bindless sample showcases bindless buffer and texture tables.
  • tests/platform/blending sample showcases uniform Blend color parameter.
  • tests/platform/buffer sample showcases cross-API access to an array of buffers.
  • tests/platform/indirect sample showcases MDI per-instance parameters.
  • tests/platform/multisample sample showcases AA window target.
  • tests/platform/preprocessor sample showcases macro-based function generalization.
  • tests/parallel/spatial_tree sample has optional BVH tree visualization.
  • tests/core/radix sample showcases the radix sort algorithm.
Tools changes:
  • The project generation tool can create new draft project files for Core and Engine SDK and convert Makefile-based projects to Xcode, VS, CMake, and Gradle projects.
  • ts_image supports GPU ASTC texture compression with “gpu” command line argument flag.
  • ts_shader supports texture and sample argument buffer generation with flags command line argument.
API changes:
  • Set and Map containers can use const char* keys for fast constant table initializations.
  • radixSort algorithm has been added to TellusimSort.h
  • String::split() methods split the string by provided delimiters.
  • String::tprintf() method provides type-based printf system with {0}, {1}, … argument accessors.
  • The draw statistics methods have been added to the Canvas interface.
  • Comparison operators for FontStyle, StrokeStyle, and GradientStyle structures have been added.
  • BufferTable interface for bindless Buffer data has been added.
  • Command and Compute functions that receive an Array<Sampler|Texture|Buffer|Tracing|*Table> arguments have been added.
  • ShaderCompler flags for generated shader output control (MTLIndirect enables argument buffers for all textures and samplers on Metal).
  • BrepModel interface for low-level Brep rendering has been added.
  • EncoderASTC interface for GPU ASTC texture compression has been added.
  • Dedicated HDR ASTC color formats have been added.
  • LDR ASTC formats have u8n and u8ns postfixes.
  • Dynamic storage buffer argument has been removed from Kernel, Pipeline, and Traversal interface. The more flexible replacement is the BindFlags enum.
  • Improved Kernel, Pipeline, and Traversal interfaces support BufferTable and TextureTable bindings, Shader::Mask, and BindingFlags.
  • Swizzled viewport rendering feature has been removed from Pipeline.
  • CanvasStrip has dedicated methods for creating quadratic and cubic curves.
  • ControlArea supports content scaling and absolute children transformations.
Internal features:
  • AtomicCompareExchange operation has been fixed.
  • Swizzled accessors to bindless buffers have been added to HLSL shader generator.
  • HLSL and MSL shader translators support bindless arrays for buffers and textures.
  • Correct structure and array access for atomic and load/store operations in HLSL.
  • Fast MeshAttribute and MeshIndices allocation for addAttribute() and addIndices() methods.
  • ControlSplit renders lines via the CanvasStrip element.
  • WinApp Window supports all mouse buttons and wheel events.
  • Emscripten Window supports mouse wheel events.
  • Correct WebGPU window initialization due to the latest WebGPU API update.
  • Metal private resources are Heap allocated.
  • Material shader pragmas can control the pipeline type (Vertex or Mesh shader mode).
  • SceneTexture supports non-uniform block compression sizes.

September 25, 2024

09 Hello Controls


September 1, 2024

08 Hello Canvas


September 1, 2024

SDK 39


September 18, 2023

07 Hello Splatting


August 28, 2023

06 Hello Traversal


August 13, 2023

05 Hello Tracing


August 12, 2023

04 Hello Raster


July 31, 2023

03 Hello Mesh


July 7, 2023

Scene Import


June 25, 2023

02 Hello Compute


May 15, 2023

01 Hello USDZ


May 14, 2023

00 Hello Triangle


April 24, 2023

WebGPU Update


April 4, 2023

Tellusim Upscaler Demo


February 10, 2023

DLSS 3.1.1 vs DLSS 2.4.0


January 31, 2023

Dispatch, Dispatch, Dispatch


October 28, 2022

Tellusim upscaler


October 14, 2022

Upscale SDK comparison


September 20, 2022

Improved Blue Noise


June 19, 2022

Intel Arc 370M analysis


January 16, 2022

Mesh Shader Emulation


December 16, 2021

Mesh Shader Performance


October 10, 2021

Blue Noise Generator


October 7, 2021

Ray Tracing versus Animation


September 24, 2021

Ray Tracing Performance Comparison


September 13, 2021

Compute versus Hardware


September 9, 2021

MultiDrawIndirect and Metal


September 4, 2021

Mesh Shader versus MultiDrawIndirect


June 30, 2021

Shader Pipeline