Poster
SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models
Xin Zhang · Dong Zhang · Shimin Li · Yaqian Zhou · Xipeng Qiu
Halle B
Current speech large language models build upon discrete speech representations,which can be categorized into semantic tokens and acoustic tokens. However,existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech languagemodels, we established the first benchmark, SLMTokBench. Our results indicatethat neither semantic nor acoustic tokens are ideal for this purpose. Therefore, wepropose SpeechTokenizer, a unified speech tokenizer for speech large languagemodels. SpeechTokenizer adopts the Encoder-Decoder architecture with residualvector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically acrossdifferent RVQ layers. Furthermore, We construct a Unified Speech LanguageModel (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstratesstrong performance on the SLMTokBench benchmark. Also, USLM outperformsVALL-E in zero-shot Text-to-Speech tasks. Code and models are available athttps://github.com/ZhangXInFD/SpeechTokenizer/.