Character Aware Neural Language Model(2016)

Jump to bottom

hwkim94 edited this page Jan 24, 2018 · 18 revisions

Abstract

CNN, highway Netwook, LSTM, RNN-LM 모델 사용
형태소가 많은 언어에 유리(아랍어, 체코어, 프랑스어, 독일어, 스페인어, 러시아어 등)
character 단위의 모델링은 semantic, orthographic information을 동시에 알아낼 수 있음

Introduction

Penn Treebank corpus 분석시에 parameter를 60% 줄일 수 있었다.
형태소가 많은 언어에서 이 모델이 더 적은 parameter로도 다른 LSTM 기반 모델보다 성능이 좋았다.

Model

Architecture

architecture

RNN

LSTM

RNN-LM

char-CNN

highway network

Experimental Setup

Data set
평가척도 : PPL
- PPL(Perplexity of a model over a sequence )
- NLL(Negative log-likelihood ()
English Penn Treebank 를 사용하여 hyperparmeter를 조정한 후, 이것을 다른 언어에 적용
singleton word 만 로 바꿈

Optimization

architecture2

stochastic gradient descent 방식으로 오차역전파
train for 25 epochs on non-Arabic and 30 epochs on Arabic data
dropout 방식 사용
- probability 0.5 on the LSTM input-to-hidden layers (except on the initial Highway to LSTM layer) and the hidden-to-output softmax layer.
gradient norm <= 5
cluster = ceiling(square_root(V))

Result

result

같은 아키텍쳐에 word/morph-based input 보다 char-based input이 성능이 좋음

English Penn Treebank

result1

Small Model의 경우 기존의 성능을 뛰어넘었다.(구체적인 지표는 보여주지 않음)
- Small Model : 200개의 hidden unit(word embedding size = 200)
Large Model의 경우 기존보다 paramter가 60% 줄어들었다.
- Large Model : 650개의 hidden unit(word embedding size = 650)

Other Language

result2

MLBL model(2014, Botha)와 비교했을 때도 확연히 적은 수의 parameter로 성능이 좋았다.
- MLBL 모델도 역시 LSTM을 사용
significant perplexity reductions even on English when V(vocabulary의 크기) is large
do not observe significant differences going from word to morpheme LSTMs on Spanish, French, and English

Discussion

Learned Word Representation

result3

highway layer가 있는 것이 학습을 더 잘했음
OOV word도 학습을 잘함

Learned Character N-gram Representation

result4

char-CNN 을 사용했을 때, 형태소가 잘못 tagging되는 경우가 많았음
prefix, suffix, hypenated를 각각 구별함
- **prefix = 접두사, suffix = 접미사, hyphenated = 하이픈으로 연결된 단어

Highway Layer

result5

MLP의 성능은 정말 별로다.
1개의 highway layer만 가진 것이 제일 성능이 좋다.
max-pooling 이전에 convolution layer를 더 쌓아도 성능이 좋아지지 않는다.
CNN 없이 highway network만 사용하면 소용없다.

Effect of Corpus/Vocab size

corpus의 크기가 커지면, perplexity reduction비율이 줄어든다.
하지만, corpus의 크기가 커지면 모든 경우에서 char-CNN이 성능이 좋아진다.

Further Observation

이후 여러가지 시도에 대한 보고

Combining word embeddings with the CharCNN’s output to form a combined representation of a word (to be used as input to the LSTM) resulted in slightly worse performance
다른 모델보다 시간은 오래걸리지만, GPU사용 효율성이 증가한다.

Reference

https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewFile/12489/12017