Matcha-TTS

The model has 3 inputs:

x: (batch_size, max_num_tokens)

x_lengths: (batch_size,)

spks: None, or of shape (batch_size,)

If spks is not None, then it uses a embedding layer to convert it to a tensor of shape (batch_size, spk_embedding_dim).

x, x_lengths, spks, are sent to the TextEncoder.

Inside the TextEncoder, x is sent to another embedding layer, the output is a tensor of shape (batch_size, max_num_tokens, embedding_dim), then it is transposed to (batch_size, embedding_dim, max_num_tokens)

It builds a mask from x_lengths. The mask is of shape (batch_size, max_num_tokens). Valid positions in the mask is True. Padded positions in the mask is False.

The mask is then reshaped to (batch_size, 1, max_num_tokens).

Then x is sent to a prenet.

Prenet contains 3 blocks. Each block consists of: Conv1D, LayerNorm, ReLU, Dropout. The last layer of a prenet is Conv1D, which is equivalent to a Linear layer. The advantage of use a Conv1D to replace Linear is that it does not need to transpose x. Also note that the LayerNorm in Prenet is user-implemented, which can save two transpose operations.

Prenet also contains a residual connection, which means the input channels and output channels have to be the same.

Note: Inside Prenet, only the input of the Conv1D is multiplied with the mask. Also, the output of Prenet is also multiplied with the mask.

After the Prenet, it concatenates x and spk embeddings, so the resulting x is of shape (batch_size, embedding_dim + spk_embedding_dim, max_num_tokens).

At this point, we only need to process x and mask, where

x: (batch_size, embedding_dim + spk_embedding_dim, max_num_tokens)

mask: (batch_size, 1, max_num_tokens)

The next part is Encoder inside the TextEncoder.

attn_mask is of shape (1, 1, max_num_tokens, max_num_tokens) and is generated from mask.

The Encoder in the TextEncoder is a Transformer encoder.

After processing with the Encoder, it uses a Conv1D to convert the dim to n_feats.

The duration predictor outputs a tensor of shape (batch_size, 1, max_num_tokens).

Note that the duration predictor predicts log durations of each token. In the end, it uses exp to get the duration for each token.

It then uses ceil to get integral durations. It can optionally use length_scale to scale the duration of each token.

It uses sum to compute the total duration of each utterance.

fix_len_compatibility(): Convert to the next multiple of 4 if it is not a multiple of 4

generate_path: It actually returns a mask of shape (batch, max_num_tokens, max_y).

The decoder model

Its input:

x: (batch_size, num_feats, max_y)

mask: (batch_size, 1, max_y)

t: a scalar tensor

First, an embedding is computed from t using SinusoidPosEmb. Note that it is not a standard Sinusoid positional encoding. The shape of the embedding is (batch_size, in_channels).

The embedding is processed by TimestepEmbedding, which contains a linear layer, an activation layer, and a linear layer. The output shape is (batch_size, time_embed_dim).