I2R, A*STAR, Singapore

Shuo
Sun

Technical Lead of MERaLiON, Singapore's national multimodal AI initiative. Working on multimodal large language models, speech processing, and language technologies for Southeast Asia.

Shuo Sun

01 — About

Building the future of
multimodal AI

I am a researcher and technical lead at the Institute for Infocomm Research (I2R), A*STAR, Singapore. I lead the development of MERaLiON—Singapore's national multimodal large language model initiative—building AI systems that understand both speech and language with a focus on Southeast Asian contexts.

I received my Ph.D. in Computer Science from Johns Hopkins University, where I was part of the Center for Language and Speech Processing (CLSP), advised by Kevin Duh. Before JHU, I was a research engineer at the Baidu-I2R Research Centre in Singapore.

National AI Initiative

Leading Singapore's flagship multimodal LLM programme

Speech & Language

Models from 0.6B to 10B parameters

Southeast Asia Focus

Designed for regional languages and accents

02 — Flagship Project

MERaLiON

MERaLiON
MERaLiON
Multimodal Empathetic Reasoning
and Learning in One Network
MERaLiON is a family of multimodal AI models designed for speech and language understanding, with a focus on Southeast Asian languages and accents. The project spans speech-language models, speech encoders, automatic speech recognition, and speech emotion recognition—powering applications from conversational AI to multilingual audio understanding.
National LLM Funding Initiative · NRF Singapore & IMDA
Audio Language Models
End-to-end speech-language models that jointly understand audio and text. These models accept raw speech input and generate text responses, enabling conversational AI, audio question-answering, and speech understanding.
  • MERaLiON-3 is the latest generation with improved multilingual and reasoning capabilities
  • Available in 3B and 10B parameter sizes for different deployment needs
  • Optimized for Southeast Asian languages, accents, and acoustic conditions
  • AudioLLM integrates Whisper with SEA-LION for robust regional speech understanding
Automatic Speech Recognition
Dedicated ASR model built on MERaLiON-2 for high-accuracy speech-to-text transcription, with strong performance on Southeast Asian accented English and regional languages.
  • Fine-tuned specifically for transcription accuracy on diverse accents
  • Strong performance on Singaporean English, Mandarin-accented English, and Malay
  • Built on the 10B MERaLiON-2 backbone for robust speech representations
Speech Encoders
Foundation speech encoders that extract rich acoustic representations from raw audio. These serve as the audio backbone for downstream tasks like ASR, speaker identification, and emotion detection.
  • SpeechEncoder-2 offers improved feature extraction over v1 with better multilingual coverage
  • Compact 0.6B parameter size enables efficient deployment
  • Can be used as a plug-in encoder for building custom audio-language pipelines
Speech Emotion Recognition
Detects emotional states from speech audio, enabling empathetic AI interactions. Classifies emotions such as happiness, sadness, anger, fear, and neutral states from voice characteristics.
  • Multi-class emotion classification from raw speech audio
  • Trained on diverse emotional speech datasets for robust cross-domain performance
  • Lightweight 0.8B model suitable for real-time applications
Language Models
Instruction-tuned text language model built on LLaMA-3, fine-tuned for Southeast Asian language understanding and generation tasks including translation, summarization, and dialogue.
  • Based on Meta's LLaMA-3 architecture with regional fine-tuning
  • Instruction-tuned for helpful, safe, and contextually appropriate responses
  • Enhanced performance on Southeast Asian language tasks and cultural context
Datasets
Open datasets released by the MERaLiON team to advance speech and language research, with a focus on Singapore's multilingual, code-switching environment.
  • MNSC (Multitask National Speech Corpus) — A large-scale multitask speech dataset derived from IMDA's National Speech Corpus, with 15M+ rows covering ASR, spoken QA, dialogue summarization, and paralinguistic QA, specialized for Singlish and code-switching
  • CPQA (Contextual Paralinguistic Question-Answering) — An evaluation dataset of 2,647 LLM-generated QA pairs designed to assess speech-LLMs' ability to understand contextual and paralinguistic cues such as tone, emotion, and emphasis
Benchmarks
Evaluation benchmarks developed to rigorously assess Audio LLM capabilities across diverse tasks and languages.
  • AudioBench — A universal benchmark for evaluating Audio LLMs, covering 50+ datasets across ASR, speech translation, spoken QA, audio scene understanding, emotion recognition, and code-switching
  • Supports multilingual evaluation including English, Chinese, Thai, Vietnamese, and Indonesian
  • Features a live leaderboard tracking model performance, accepted at NAACL 2025
03 — Research

Research Interests

Multimodal AI

Unifying speech, language, and audio understanding in joint architectures

Speech Processing

Recognition, encoding, and emotion detection from raw audio signals

Large Language Models

Training and adaptation for diverse languages and domains

Cross-lingual NLP

Information retrieval and transfer across language barriers

Machine Translation

Quality estimation and evaluation for multilingual systems

Southeast Asian AI

Language technology for underrepresented regions and accents


04 — Education

Academic Background

Ph.D. in Computer Science
Johns Hopkins University
Center for Language and Speech Processing (CLSP)
Advisor: Kevin Duh
B.S. in EECS
University of California, Berkeley
Electrical Engineering & Computer Science