음성인식 - 음성인식 알소리즘 소개

우리가 일상생활에서 자주 접하는 음성인식이 어떤 원리로 작동하는지 알아본다

이번 포스트로 대략적으로 음성 인식 모델이 어떠한 구조로 이루어져 있는지 알아보고 차후 포스트에서 중요한 부분들을 하나하나 살펴볼 예정이다

음성인식은 다음과 같이 6가지 파트로 구성된다

음성 전처리 Speech Preprocessing
텍스트 전처리 Text Preprocessing
데이터 증강 기법 Data Augmentation
음성인식 인코더 ASR Encoder
음성인식 디코더 ASR Decoder
언어모델 Language Model

encoder: conformer
encoder_conf:
    output_size: 512
    attention_heads: 8
    linear_units: 2048
    num_blocks: 12
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d
    normalize_before: true
    macaron_style: true
    pos_enc_layer_type: "rel_pos"
    selfattention_layer_type: "rel_selfattn"
    activation_type: "swish"
    use_cnn_module:  true
    cnn_module_kernel: 31

ESPnetASRModel https://github.com/espnet/espnet/blob/5c21f63e45e0961a5d817017c282b0cafd68a3aa/espnet2/asr/espnet_model.py#L142

PositionalEmbedding https://github.com/espnet/espnet/blob/5c21f63e45e0961a5d817017c282b0cafd68a3aa/espnet/nets/pytorch_backend/transformer/embedding.py#L173

subsampling + positionembedding + linearlayer https://github.com/espnet/espnet/blob/5c21f63e45e0961a5d817017c282b0cafd68a3aa/espnet/nets/pytorch_backend/transformer/subsampling.py#L44

conformer model https://github.com/espnet/espnet/blob/master/espnet2/asr/encoder/conformer_encoder.py

conformer block https://github.com/espnet/espnet/blob/5c21f63e45e0961a5d817017c282b0cafd68a3aa/espnet/nets/pytorch_backend/conformer/encoder_layer.py#L17

pos_enc_layer_type = “rel_pos”
https://arxiv.org/pdf/2005.09940.pdf 에 소개된 positional encoding 방식이다

positional embedding을 transformer에서 해주지 않을 경우 “Queen And King” 과 “King And Queen”의 결과가 동일하게 나올 것이다

absolute한 position이 아닌 relative position을 고려해야할 경우를 알아본다

distance between token is important

token 마다 positional embedding이 한개만 존재하는 것이 아닌 다른 token 들 모두에 relationship을 고려한다

5개의 토큰이 있다면 결과가 5개가 나온다

w_0 w_1 w_2 w_3 w_4 w_5 … w_-3 w_-2 w_-1, w_0, w_1, w_2 … w_-5 w_-4 w_-3 w_-2 w_-1 w_0

5개의 positional embedding 결과가 나온다

clipping 이 제시될 경우 k = -2 2,-2 를 벗어날 경우 -2와 2의 결과를 준다

모든 토큰을 다 더해야 할까 어떻게 해야할까

self-attention 을 이용한다?

weighted sum over tokens

Conv2dSubsampling

https://github.com/espnet/espnet/blob/5c21f63e45e0961a5d817017c282b0cafd68a3aa/espnet/nets/pytorch_backend/transformer/subsampling.py#L44

Subsampling Layer

[논문의 음성 feature extraction 방식]
80-channel filterbanks features computed from 25ms window with a stride of 10ms

10ms 단위의 음성 데이터를 40ms 단위로 subsampling한다

self.conv = torch.nn.Sequential(
    torch.nn.Conv2d(1, odim, 3, 2),
    torch.nn.ReLU(),
    torch.nn.Conv2d(odim, odim, 3, 2),
    torch.nn.ReLU(),
)
self.out = torch.nn.Sequential(
    torch.nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim),
    pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),
)

마지막 과정에서 relative positional embedding 을 한

Conv2dSubsampling

여기에서 conformer block 이전 과정들이랑 pos_enc까지 모두 해주는거넹?

Conformer.EncoderLayer(nn.Module)

https://github.com/espnet/espnet/blob/5c21f63e45e0961a5d817017c282b0cafd68a3aa/espnet/nets/pytorch_backend/conformer/encoder_layer.py#L17

x, pos_emb = x_input[0], x_input[1]

positional embedding 된 결과

if positionwise_layer_type == “linear”: positionwise_layer = PositionwiseFeedForward positionwise_layer_args = ( output_size, linear_units, dropout_rate, activation, )

x = residual + stoch_layer_coeff * self.ff_scale * self.dropout( self.feed_forward_macaron(x) )

내일 가서 할 것

Encoder Layers , conformer 개수 Encoder Dim : Attention Heads Conv Kernel size Decoder Layers Decoder Dim

Conv2dSubsampling

Subsampling Layer

Conv2dSubsampling

Conformer.EncoderLayer(nn.Module)

내일 가서 할 것

Share this post