Ã÷ÂԿƼ¼BlockformerÓïÒôʶ±ðÄ£×ÓÔÚAISHELL-1²âÊÔ¼¯ÉÏÈ¡µÃSOTAЧ¹û
2022-09-13
Ã÷ÂԿƼ¼¼´½«¿ªÔ´BlockformerÓïÒôʶ±ðÄ£×Ó£¬£¬£¬£¬£¬£¬£¬ÌáÉýÏúÊÛÀú³ÌÖеĻỰÖÇÄÜ£¬£¬£¬£¬£¬£¬£¬ÖúÁ¦¸÷ÐÐÒµÊýÖÇ»¯×ªÐÍ¡£¡£¡£¡£¡£¡£¡£
Éî¶ÈѧϰÒÑÀÖ³ÉÓ¦ÓÃÓÚÓïÒôʶ±ð£¬£¬£¬£¬£¬£¬£¬ÖÖÖÖÉñ¾ÍøÂç±»¸÷ÈËÆÕ±éÑо¿ºÍ̽Ë÷£¬£¬£¬£¬£¬£¬£¬ÀýÈ磬£¬£¬£¬£¬£¬£¬Éî¶ÈÉñ¾ÍøÂ磨Deep Neural Network£¬£¬£¬£¬£¬£¬£¬DNN£©¡¢¾í»ýÉñ¾ÍøÂ磨Convolutional Neural Network£¬£¬£¬£¬£¬£¬£¬CNN£©¡¢Ñ»·Éñ¾ÍøÂ磨Recurrent Neural Network£¬£¬£¬£¬£¬£¬£¬RNN£©ºÍ¶Ëµ½¶ËµÄÉñ¾ÍøÂçÄ£×Ó¡£¡£¡£¡£¡£¡£¡£
ÏÖÔÚ£¬£¬£¬£¬£¬£¬£¬Ö÷ÒªÓÐÈýÖֶ˵½¶ËµÄÄ£×Ó¿ò¼Ü£ºÉñ¾ÍøÂç´«¸ÐÆ÷£¨Neural Transducer£¬£¬£¬£¬£¬£¬£¬NT£©£¬£¬£¬£¬£¬£¬£¬»ùÓÚ×¢ÖØÁ¦µÄ±àÂëÆ÷-½âÂëÆ÷£¨Attention-based Encoder Decoder£¬£¬£¬£¬£¬£¬£¬AED£©ºÍÅþÁ¬Ê±Ðò·ÖÀࣨConnectionist Temporal Classification£¬£¬£¬£¬£¬£¬£¬CTC£©¡£¡£¡£¡£¡£¡£¡£
NTÊÇCTCµÄÔöÇ¿°æ±¾£¬£¬£¬£¬£¬£¬£¬ÒýÈëÁËÕ¹ÍûÍøÂçÄ£¿£¿£¿£¿£¿é£¬£¬£¬£¬£¬£¬£¬¿ÉÀà±È¹Å°åÓïÒôʶ±ð¿ò¼ÜÖеÄÓïÑÔÄ£×Ó£¬£¬£¬£¬£¬£¬£¬½âÂëÆ÷ÐèÒª°ÑÏÈǰչÍûµÄÀúÊ·×÷ΪÉÏÏÂÎÄÊäÈë¡£¡£¡£¡£¡£¡£¡£NTѵÁ·²»Îȹ̣¬£¬£¬£¬£¬£¬£¬ÐèÒª¸ü¶àÄڴ棬£¬£¬£¬£¬£¬£¬Õâ¿ÉÄÜ»áÏÞÖÆÑ·üçٶȡ£¡£¡£¡£¡£¡£¡£
AEDÓɱàÂëÆ÷£¬£¬£¬£¬£¬£¬£¬½âÂëÆ÷ºÍ×¢ÖØÁ¦»úÖÆÄ£¿£¿£¿£¿£¿é×é³É£¬£¬£¬£¬£¬£¬£¬Ç°Õß¶ÔÉùÑ§ÌØÕ÷¾ÙÐбàÂ룬£¬£¬£¬£¬£¬£¬½âÂëÆ÷ÌìÉú¾ä×Ó£¬£¬£¬£¬£¬£¬£¬×¢ÖØÁ¦»úÖÆÓÃÀ´¶ÔÆë±àÂëÆ÷ÊäÈëÌØÕ÷Ï¢ÕùÂë״̬¡£¡£¡£¡£¡£¡£¡£ÒµÄÚ²»ÉÙASRϵͳ¼Ü¹¹»ùÓÚAED¡£¡£¡£¡£¡£¡£¡£È»¶ø£¬£¬£¬£¬£¬£¬£¬AEDÄ£×ÓÖð¸öµ¥Î»Êä³ö£¬£¬£¬£¬£¬£¬£¬ÆäÖÐÿ¸öµ¥Î»¼ÈÈ¡¾öÓÚÏÈËÞÊÀ³ÉµÄЧ¹û£¬£¬£¬£¬£¬£¬£¬ÓÖÒÀÀµºóÐøµÄÉÏÏÂÎÄ£¬£¬£¬£¬£¬£¬£¬Õâ»áµ¼ÖÂʶ±ðÑÓ³Ù¡£¡£¡£¡£¡£¡£¡£
ÁíÍ⣬£¬£¬£¬£¬£¬£¬ÔÚÏÖʵµÄÓïÒôʶ±ðʹÃüÖУ¬£¬£¬£¬£¬£¬£¬AEDµÄ×¢ÖØÁ¦»úÖÆµÄ¶ÔÆëЧ¹û£¬£¬£¬£¬£¬£¬£¬ÓÐʱҲ»á±»ÔëÉùÆÆË𡣡£¡£¡£¡£¡£¡£
CTCµÄ½âÂëËÙÂʱÈAED¿ì£¬£¬£¬£¬£¬£¬£¬¿ÉÊÇÓÉÓÚÊä³öµ¥Î»Ö®¼äµÄÌõ¼þ×ÔÁ¦ÐÔºÍȱ·¦ÓïÑÔÄ£×ÓµÄÔ¼Êø£¬£¬£¬£¬£¬£¬£¬Æäʶ±ðÂÊÓÐÌáÉý¿Õ¼ä¡£¡£¡£¡£¡£¡£¡£
ÏÖÔÚÓÐһЩ¹ØÓÚÈÚºÏAEDºÍCTCÁ½ÖÖ¿ò¼ÜµÄÑо¿£¬£¬£¬£¬£¬£¬£¬»ùÓÚ±àÂëÆ÷¹²ÏíµÄ¶àʹÃüѧϰ£¬£¬£¬£¬£¬£¬£¬Ê¹ÓÃCTCºÍAEDÄ¿µÄͬʱѵÁ·¡£¡£¡£¡£¡£¡£¡£ÔÚÄ£×ӽṹÉÏ£¬£¬£¬£¬£¬£¬£¬TransformerÒѾÔÚ»úе·Ò룬£¬£¬£¬£¬£¬£¬ÓïÒôʶ±ð£¬£¬£¬£¬£¬£¬£¬ºÍÅÌËã»úÊÓ¾õÁìÓòÏÔʾÁ˼«´óµÄÓÅÊÆ¡£¡£¡£¡£¡£¡£¡£
Ã÷ÂԿƼ¼¼¯ÍŸ߼¶×ܼࡢÓïÒôÊÖÒÕÈÏÕæÈËÖì»á·åÏÈÈÝ£¬£¬£¬£¬£¬£¬£¬Ã÷ÂÔÍŶÓÖØµãÑо¿ÁËÔÚCTCºÍAEDÈÚºÏѵÁ·¿ò¼ÜÏ£¬£¬£¬£¬£¬£¬£¬ÔõÑùʹÓÃTransformerÄ£×ÓÀ´Ìá¸ßʶ±ðЧ¹û¡£¡£¡£¡£¡£¡£¡£

Ã÷ÂÔÍŶÓͨ¹ý¿ÉÊÓ»¯ÆÊÎöÁ˲î±ðBLOCKºÍHEADÖ®¼äµÄ×¢ÖØÁ¦ÐÅÏ¢£¬£¬£¬£¬£¬£¬£¬ÕâЩÐÅÏ¢µÄ¶àÑùÐÔÊǺÜÊÇÓÐ×ÊÖúµÄ£¬£¬£¬£¬£¬£¬£¬±àÂëÆ÷Ï¢ÕùÂëÆ÷ÖÐÿ¸öBLOCKµÄÊä³öÐÅÏ¢²¢²»ÍêÈ«°üÀ¨£¬£¬£¬£¬£¬£¬£¬Ò²¿ÉÄÜÊÇ»¥²¹µÄ¡£¡£¡£¡£¡£¡£¡££¨https://doi.org/10.48550/arXiv.2207.11697£©
»ùÓÚÕâÖÖ¶´²ì£¬£¬£¬£¬£¬£¬£¬Ã÷ÂÔÍŶÓÌá³öÁËÒ»ÖÖÄ£×ӽṹ£¬£¬£¬£¬£¬£¬£¬Block-augmented Transformer £¨BlockFormer£©£¬£¬£¬£¬£¬£¬£¬Ñо¿ÁËÔõÑùÒÔ²ÎÊý»¯µÄ·½·¨»¥²¹ÈÚºÏÿ¸ö¿éµÄ»ù±¾ÐÅÏ¢£¬£¬£¬£¬£¬£¬£¬ÊµÏÖÁËWeighted Sum of the Blocks Output£¨Base-WSBO£©ºÍSqueeze-and-Excitation module to WSBO£¨SE-WSBO£©Á½ÖÖblock¼¯³ÉÒªÁì¡£¡£¡£¡£¡£¡£¡£


ʵÑé֤ʵ£¬£¬£¬£¬£¬£¬£¬BlockformerÄ£×ÓÔÚÖÐÎÄͨË×»°²âÊÔ¼¯£¨AISHELL-1£©ÉÏ£¬£¬£¬£¬£¬£¬£¬²»Ê¹ÓÃÓïÑÔÄ£×ÓµÄÇéÐÎÏÂʵÏÖÁË4.35%µÄCER£¬£¬£¬£¬£¬£¬£¬Ê¹ÓÃÓïÑÔÄ£×ÓʱµÖ´ïÁË4.10%µÄCER¡£¡£¡£¡£¡£¡£¡£



AISHELL-1ÊÇÏ£¶û±´¿Ç2017Ä꿪ԴµÄÖÐÎÄͨË×»°ÓïÒôÊý¾Ý¿â£¬£¬£¬£¬£¬£¬£¬Â¼Òôʱ³¤178Сʱ£¬£¬£¬£¬£¬£¬£¬ÓÉ400ÃûÖйú²î±ðµØÇøÓïÑÔÈ˾ÙÐÐÂ¼ÖÆ¡£¡£¡£¡£¡£¡£¡£¸ÃÊý¾Ý¿âÉæ¼°ÖÇÄܼҾӡ¢ÎÞÈ˼ÝÊ»¡¢¹¤ÒµÉú²úµÈ11¸öÁìÓò£¬£¬£¬£¬£¬£¬£¬±»¸ßƵӦÓÃÔÚÓïÒôÊÖÒÕ¿ª·¢¼°ÊµÑéÖУ¬£¬£¬£¬£¬£¬£¬Êǵ±½ñÖÐÎÄÓïÒôʶ±ðÆÀ²âµÄȨÍþÊý¾Ý¿âÖ®Ò»¡£¡£¡£¡£¡£¡£¡£
AI WikiÍøÕ¾Papers With CodeÏÔʾ£¬£¬£¬£¬£¬£¬£¬BlockformerÔÚAISHELL-1ÉÏÈ¡µÃSOTAµÄʶ±ðЧ¹û£¬£¬£¬£¬£¬£¬£¬×Ö´íÂʽµµÍµ½4.10%£¨Ê¹ÓÃÓïÑÔÄ£×Óʱ£©¡£¡£¡£¡£¡£¡£¡£
£¨https://paperswithcode.com/sota/speech-recognition-on-aishell-1£©
Ã÷ÂԿƼ¼¼¯ÍÅCTOºÂ½ÜÌåÏÖ£¬£¬£¬£¬£¬£¬£¬Ã÷ÂԵĻỰÖÇÄܲúÆ·Õë¶Ô»ùÓÚÏßÉÏÆó΢»á»°ºÍÏßÏÂÃŵê»á»°µÄÏúÊÛ³¡¾°£¬£¬£¬£¬£¬£¬£¬ÓïÒôʶ±ðÍŶӾ۽¹ÃÀ×±¡¢Æû³µ¡¢½ÌÓýµÈÐÐÒµµÄ³¡¾°ÓÅ»¯ºÍ¶¨ÖÆÑµÁ·£¬£¬£¬£¬£¬£¬£¬¿ÉÊÇÒ²²»ËÉ¿ª¶ÔͨÓÃÓïÒôʶ±ðпò¼Ü¡¢ÐÂÄ£×ÓµÄ̽Ë÷£¬£¬£¬£¬£¬£¬£¬BlockformerÄ£×ÓµÄÕâ¸öSOTAЧ¹ûΪÓïÒôʶ±ðµÄ¶¨ÖÆÓÅ»¯ÌṩÁËÒ»¸ö¸ßÆðµã£¬£¬£¬£¬£¬£¬£¬Ã÷ÂÔ¼´½«¿ªÔ´Blockformer¡£¡£¡£¡£¡£¡£¡£
ÐÅÏ¢Ìîд