A Statistics-Driven Differentiable Approach for Sound Texture Synthesis and Analysis

Esteban Gutiérrez¹, Frederic Font¹, Xavier Serra¹, and Lonce Wyse¹

¹ Department of Information and Communications Technologies, Universitat Pompeu Fabra

📄 Paper

TexDSP repository GitHub

TexStat repository 📚 Bibtex

This webpage provides supplementary materials for our paper "A Statistics-Driven Differentiable Approach for Sound Texture Synthesis and Analysis" to be presented at the 25th edition of the Digital Audio Effects (DAFx) Conference in Ancona, Italy.

1. Introduction

In this work we introduce TexStat, a perceptually grounded loss function inspired by McDermott and Simoncelli’s work. Alongside it, we present TexEnv, a lightweight differentiable synthesizer, and TexDSP, a DDSP-style generative model tailored for texture audio. All tools are open-source, implemented in PyTorch, and designed for efficient training and evaluation. Below, a small set of highlighted examples generated with TexDSP can be found.

Figure 1.1. TexDSP architecture diagram.

Fire Model

Water Model

Wind Model

Sound Examples 1.1. Sound examples generated using the TexDSP architecture trained using the TexStat loss function.

2. Models

This work introduces three models that can be used for texture sounds analysis and synthesis. These model can work in conjunction but also as pieces of other models. A brief introduction for the three of them can be found here.

2.1. TexStat Loss 📖

TexStat is a loss function based on a direct comparison of a revised version of McDermott and Simoncelli's summary of statistics [McDermott et al., 2020]. This approach allows the TexStat loss function to train texture sound generative models by focusing strictly on the statistical properties of sounds, rather than the sounds themselves. As a result, the synthesized textures naturally differ from the original inputs, while still preserving the essential perceptual qualities that define their type.

Figure 2.1. TexStat's summary of statistics extraction diagram.

2.2. TexEnv Synthesizer 📖

TexEnv is a differentiable signal processor that through the use of the Inverse Discrete Fourier Transform creates a series of cyclic functions that are later imposed as amplitude envelopes of a subband decomposition of white noise.

Figure 2.2. TexEnv synthesizer diagram.

2.3. TexDSP Architecture 📖

TexDSP is an architecture based on Differential Digital Signal Processing (DDSP) [Engel et al., 2020] introduced as a showcase of the previous models. At its core, TexDSP is a simple neural network whose goal is to learn a way to map simple features to the parameters needed by TexEnv to generate a particular texture sound by means of the TexStat loss function. This is done by finding statistical patterns between amplitude envelopes of a subband decomposition of the training data and then putting all together. A figure that summarizes this architecture can be found below.

Figure 2.3. TexDSP architecture diagram.

3. Experiments and Sound Examples

Several experiments were conducted to validate the ideas and models proposed in this work. The details regarding all these experiments can be found here.

3.1. TexStat Properties 📊

Two desirable properties for a loss function tailored to texture sounds are stability under time shifting and robustness to added noise. To evaluate these in the TexStat loss, we measured the loss between original and transformed sounds from the MicroTex dataset, focusing specifically on the Freesound class. This subset was selected because it includes the most representative environmental textures—long and dynamic enough to permit meaningful transformations. The other two classes were excluded as their sounds are generally too short or too quiet for these operations without introducing significant distortions. For comparison, the same analysis was conducted using the MSS loss, and a summary of the results is shown below.

Transformation	`TexStat`			MSS
	10%	30%	50%	10%	30%	50%
Time-Shift	0.04 ± 0.03	0.04 ± 0.03	0.04 ± 0.03	6.09 ± 1.22	6.27 ± 1.38	6.29 ± 1.41
Noise-Add	2.08 ± 1.99	2.51 ± 2.21	2.65 ± 2.27	11.79 ± 4.91	16.84 ± 5.92	19.57 ± 6.26

Table: Loss measurements (mean ± std) between original sounds in the Freesound class and their time-shifted or noise-added versions. Time shift is defined as a percentage of the total signal duration, and noise level is defined by its maximum amplitude relative to the original. All values were computed over one-second segments. For reference, well-trained TexStat models typically converge below 3, while MSS loss values remain acceptable below 10.

The results show that TexStat is highly stable under time shifting, consistently incurring only a minor loss increase. It also handles added noise with resilience, displaying a sublinear increase in loss as noise intensity grows—indicating strong robustness under both transformations.

3.2. TexStat Benchmarks 📊

To evaluate the computational efficiency of the TexStat loss function, we benchmarked its forward computation time, backward pass duration, and GPU memory usage. These metrics were measured over multiple runs, capturing the time taken for both loss evaluation and gradient descent while monitoring memory allocation. For reference, we included measurements for other commonly used loss functions such as MSS, MSE, and MAE. The results are summarized in the table below.

Loss	Forward Pass Time (ms)	Backward Pass Time (ms)	Memory Usage (MB)
`TexStat`	93.5 ± 0.4	154.6 ± 0.4	0.84 ± 2.5
MSS	3.9 ± 0.3	8.5 ± 0.3	0.85 ± 2.6
MSE	0.2 ± 0.3	0.2 ± 0.1	1.7 ± 5.0
MAE	0.1 ± 0.0	0.2 ± 0.1	0.8 ± 2.5

Table: Runtime and memory benchmarks for four loss functions on batches of 32 audio signals (each of size 65536, ~1.5 seconds at 44.1kHz). All measurements were performed using CUDA on an NVIDIA RTX 4090 GPU.

As expected, TexStat is computationally more intensive than simpler loss functions like MSE or MAE, due to its domain-specific structure. However, it maintains a comparable memory footprint to other losses, demonstrating that its expressiveness does not come at a significant memory cost.

3.3. Summary Statistics as a Feature Vector 📊

To evaluate the effectiveness of TexStat summary statistics as a powerful feature representation—comparable to embeddings used in metrics like FAD—we conducted a classification experiment. All data from the three selections in the MicroTex dataset were segmented, and both TexStat summary statistics and VGGish embeddings [VGGish] were computed. For each feature type, we trained a downstream multi-layer perceptron (MLP) classifier with hidden layers of size 128 and 64. The performance comparison is summarized in the table below.

Model	Selection	Accuracy	Precision	Recall	F1
`TexStat`	BOReilly	0.94	0.94	0.94	0.94
VGGish	BOReilly	0.71	0.73	0.71	0.71
`TexStat`	Freesound	0.99	0.99	0.99	0.99
VGGish	Freesound	0.98	0.99	0.98	0.98
`TexStat`	Syntex	1.00	1.00	1.00	1.00
VGGish	Syntex	0.95	0.95	0.95	0.94

Table: Classification performance of MLP models trained using either TexStat summary statistics or VGGish embeddings. Results are shown for the three subsets of the MicroTex dataset.

These results demonstrate that in the domain of texture sounds, TexStat summary statistics serve as a strictly more informative representation than general-purpose embeddings like VGGish. This makes them promising candidates for use in downstream evaluation metrics and perceptual comparisons.

3.4. TexEnv Resynthesis 🎧

Extensive exploration using the TexEnv synthesizer in resynthesis tasks, employing a signal processing-based parameter extractor, was conducted to better understand its behavior and limitations. A summary of sound examples can be found below.

Input Texture

N_F=16, parameters count=256

N_F=16, parameters count=512

N_F=24, parameters count=256

N_F=24, parameters count=512

Sound Examples 3.1. 4 sound textures are resynthesized using TexEnv. Parameters to run the synthesizer are computed using a DSP-based parameter extractor. The synthesis part is run using different combination of parameters to test the need for bigger filterbanks and parameters count per band. Parameters are counted for frames of around 0.74 seconds. For reference, 16 filters using 256 parameters each correspond to compression of 800% to the real sound at 44100 Hz, meanwhile for 24 filters using 512 parameters each correspond to compression of around 266%.

Some key findings were the following:

Water-like sounds (e.g., flowing water, rain, bubbling) benefited from larger filterbanks but not larger parameter sets.
Crackling sounds (e.g., fireworks, bonfires) improved with larger parameter sets but were less sensitive to filterbank size.

These insights were used to determine the optimal parameters for model training.

3.5. TexDSP Trained Models 🎧📊

To demonstrate the capabilities of TexStat, we trained a set of TexDSP models using it as the sole loss function. Each model was trained with different parameters suited to specific texture sound classes. The goal was to explore how well TexStat alone could guide learning in a generative setting.

Training Details: A curated selection of texture sounds from Freesound was used per model, each tailored with unique parameters chosen from prior resynthesis exploration (see Section: Resynthesis). The encoder and decoder MLPs had at most 3 layers and no more than 512 parameters. This ensured models stayed under 25 MB, suitable for real-time applications. All models used the default TexStat α and β parameters, a shared optimizer, and trained for up to 1500 epochs with early stopping. For comparison, a NoiseBandNet model was also trained under default settings for each case.

Validation Method: A held-out subset of each dataset was resynthesized using both the TexDSP and NoiseBandNet models. We then segmented both original and resynthesized signals and computed Fréchet Audio Distance (FAD) using VGGish and our proposed summary statistics. We also computed frame-level TexStat and MSS losses, reporting mean ± standard deviation. Results are shown below.

Texture	FAD				Loss Metrics
Texture	VGGish `TexDSP`	VGGish NBN	Ours `TexDSP`	Ours NBN	`TexStat` `TexDSP`	`TexStat` NBN	MSS `TexDSP`	MSS NBN
Bubbles	35.20	21.37	1.86	1.15	1.2 ± 0.3	0.7 ± 0.1	6.6 ± 0.3	4.7 ± 0.1
Fire	11.86	2.53	6.14	1.52	2.8 ± 2.1	1.7 ± 1.0	9.6 ± 1.3	4.5 ± 0.2
Keyboard	13.02	9.70	16.64	277.12	5.7 ± 2.0	20.0 ± 7.7	9.1 ± 0.7	13.8 ± 0.6
Rain	9.09	11.31	0.98	6.19	0.5 ± 0.2	2.4 ± 2.0	9.0 ± 0.2	9.1 ± 0.4
River	43.66	49.85	0.80	1.75	0.5 ± 0.1	0.6 ± 0.1	6.0 ± 0.6	6.7 ± 0.3
Shards	4.64	1.36	3.79	7.58	1.0 ± 0.2	1.1 ± 0.3	7.9 ± 0.2	8.8 ± 0.2
Waterfall	18.23	25.88	0.53	1.06	0.3 ± 0.0	0.4 ± 0.0	5.0 ± 0.0	6.3 ± 0.0
Wind	9.66	31.35	1.95	8.48	0.8 ± 0.5	1.1 ± 0.7	5.6 ± 0.1	5.8 ± 0.2

Table: Validation metrics for both TexDSP and NoiseBandNet models across various texture sounds. FAD metrics are computed using VGGish and our proposed feature representation (lower is better). Frame-level TexStat and MSS loss values are shown as mean ± std. Best results per row are highlighted in bold.

Results: These results yield three primary insights. First, performance varied between textures, mirroring observations from McDermott and Simoncelli and aligning with the limitations discussed in the last section. Second, although TexDSP was not designed for precise reconstruction, some models unexpectedly outperformed their NoiseBandNet counterparts—even in metrics favoring reconstruction. Third, the metrics derived from our models appeared to align more closely with perceptual quality as judged informally. However, to substantiate this, a formal subjective listening test would be necessary—an evaluation left for future work.

Texture

Input

Resynthesis

Sound Examples 3.2. Resynthesis sounds using different TexDSP trained models.

3.6. TexDSP Timbre Transfer 🎧

A notable application of DDSP is timbre transfer, where a model trained on one timbre can be influenced by another sound. The original paper showcased this by transferring the timbre of a violin to a voice recording, using pitch and loudness as key factors. Our models can achieve similar results with textural sounds, although the process is more intricate. Unlike musical timbres, where pitch plays a central role, textural sounds lack such defining features, which makes the transfer more complex. Nevertheless, some compelling examples of this phenomenon are highlighted below. For more examples, see the link below.

Input Texture

Texture Model

Result

Wind

Bubbles

Fire

Sound Examples 3.3. Timbre transfer examples using different TexDSP trained models.

🎧 See more examples here

Legend:

🎧 Sound examples included
📊 Numerical experiments included
📖 Theory included
🚧 Still under construction

Acknowledgements

This work has been supported by the project “IA y Música: Cátedra en Inteligencia Artificial y Música (TSI-100929-2023-1)”, funded by the “Secretaría de Estado de Digitalización e Inteligencia Artificial and the Unión Europea-Next Generation EU”.