Shuo Sun

01 — About

Building the future of
multimodal AI

I am a researcher and technical lead at the Institute for Infocomm Research (I2R), A*STAR, Singapore. I lead the development of MERaLiON—Singapore's national multimodal large language model initiative—building AI systems that understand both speech and language with a focus on Southeast Asian contexts.

I received my Ph.D. in Computer Science from Johns Hopkins University, where I was part of the Center for Language and Speech Processing (CLSP), advised by Kevin Duh. Before JHU, I was a research engineer at the Baidu-I2R Research Centre in Singapore.

National AI Initiative

Leading Singapore's flagship multimodal LLM programme

Speech & Language

Models from 0.6B to 10B parameters

Southeast Asia Focus

Designed for regional languages and accents

02 — Flagship Project

MERaLiON

Multimodal Empathetic Reasoning
and Learning in One Network

MERaLiON is a family of multimodal AI models designed for speech and language understanding, with a focus on Southeast Asian languages and accents. The project spans speech-language models, speech encoders, automatic speech recognition, and speech emotion recognition—powering applications from conversational AI to multilingual audio understanding.

National LLM Funding Initiative · NRF Singapore & IMDA

Audio Language Models

End-to-end speech-language models that jointly understand audio and text. These models accept raw speech input and generate text responses, enabling conversational AI, audio question-answering, and speech understanding.

MERaLiON-3 is the latest generation with improved multilingual and reasoning capabilities
Available in 3B and 10B parameter sizes for different deployment needs
Optimized for Southeast Asian languages, accents, and acoustic conditions
AudioLLM integrates Whisper with SEA-LION for robust regional speech understanding

MERaLiON-3 10B MERaLiON-2 10B MERaLiON-2 3B AudioLLM 10B

Automatic Speech Recognition

Dedicated ASR model built on MERaLiON-2 for high-accuracy speech-to-text transcription, with strong performance on Southeast Asian accented English and regional languages.

Fine-tuned specifically for transcription accuracy on diverse accents
Strong performance on Singaporean English, Mandarin-accented English, and Malay
Built on the 10B MERaLiON-2 backbone for robust speech representations

MERaLiON-2-ASR 10B

Speech Encoders

Foundation speech encoders that extract rich acoustic representations from raw audio. These serve as the audio backbone for downstream tasks like ASR, speaker identification, and emotion detection.

SpeechEncoder-2 offers improved feature extraction over v1 with better multilingual coverage
Compact 0.6B parameter size enables efficient deployment
Can be used as a plug-in encoder for building custom audio-language pipelines

SpeechEncoder-2 0.6B SpeechEncoder-v1 0.6B

Speech Emotion Recognition

Detects emotional states from speech audio, enabling empathetic AI interactions. Classifies emotions such as happiness, sadness, anger, fear, and neutral states from voice characteristics.

Multi-class emotion classification from raw speech audio
Trained on diverse emotional speech datasets for robust cross-domain performance
Lightweight 0.8B model suitable for real-time applications

MERaLiON-SER 0.8B

Language Models

Instruction-tuned text language model built on LLaMA-3, fine-tuned for Southeast Asian language understanding and generation tasks including translation, summarization, and dialogue.

Based on Meta's LLaMA-3 architecture with regional fine-tuning
Instruction-tuned for helpful, safe, and contextually appropriate responses
Enhanced performance on Southeast Asian language tasks and cultural context

LLaMA-MERaLiON 8B

Datasets

Open datasets released by the MERaLiON team to advance speech and language research, with a focus on Singapore's multilingual, code-switching environment.

MNSC (Multitask National Speech Corpus) — A large-scale multitask speech dataset derived from IMDA's National Speech Corpus, with 15M+ rows covering ASR, spoken QA, dialogue summarization, and paralinguistic QA, specialized for Singlish and code-switching
CPQA (Contextual Paralinguistic Question-Answering) — An evaluation dataset of 2,647 LLM-generated QA pairs designed to assess speech-LLMs' ability to understand contextual and paralinguistic cues such as tone, emotion, and emphasis

MNSC v1 CPQA Evaluation Set MNSC Paper CPQA Paper

Benchmarks

Evaluation benchmarks developed to rigorously assess Audio LLM capabilities across diverse tasks and languages.

AudioBench — A universal benchmark for evaluating Audio LLMs, covering 50+ datasets across ASR, speech translation, spoken QA, audio scene understanding, emotion recognition, and code-switching
Supports multilingual evaluation including English, Chinese, Thai, Vietnamese, and Indonesian
Features a live leaderboard tracking model performance, accepted at NAACL 2025

AudioBench GitHub AudioBench Paper

All models on HuggingFace Try the Demo

National LLM Programme

03 — Research

Research Interests

Multimodal AI

Unifying speech, language, and audio understanding in joint architectures

Speech Processing

Recognition, encoding, and emotion detection from raw audio signals

Large Language Models

Training and adaptation for diverse languages and domains

Cross-lingual NLP

Information retrieval and transfer across language barriers

Machine Translation

Quality estimation and evaluation for multilingual systems

Southeast Asian AI

Language technology for underrepresented regions and accents

04 — Education

Academic Background

Ph.D. in Computer Science

Johns Hopkins University

Center for Language and Speech Processing (CLSP)
Advisor: Kevin Duh

B.S. in EECS

University of California, Berkeley

Electrical Engineering & Computer Science

ShuoSun

Building the future ofmultimodal AI