🎤 SpeakWise: Build an AI Public Speaking Coach with Gemma 3n

SpeakWise: Build an AI Public Speaking Coach with Gemma 3n

Project by: Rabimba
Showcase for: Google Developer Expert (GDE) AI Sprint

Imagine having a private, real-time AI coach that watches your presentations, listens to your speech, analyzes your slides and gestures, and provides actionable feedback to help you improve confidently. Using Google’s new Gemma 3n, we built exactly that: SpeakWise, an AI-powered public speaking coach that leverages multimodal understanding to transcribe, analyze, and critique your talks—all while keeping your data private.

Github Code.

🚀 Why Gemma 3n?

Gemma 3n is Google’s open multimodal model designed for on-device, privacy-preserving, real-time AI applications. It is uniquely capable of:

📡 Simultaneously processing audio, image, and text, forming a holistic understanding of your talk.
🗂️ Following advanced instructions (“Act as a world-class presentation coach” and structuring output into clear, actionable insights).
🔐 Enabling offline, private skill development without requiring cloud uploads.

These features make Gemma 3n a perfect foundation for SpeakWise, letting us transform how presenters practice and refine their public speaking skills using AI that understands context deeply.

🛠️ Building SpeakWise: Step-by-Step

1️⃣ Setup & Installation

We start by installing the necessary libraries for **Gemma 3n execution, video processing, and audio extraction** in a Colab environment:

!pip install -q -U "transformers>=4.53.0" "timm>=1.0.16" bitsandbytes accelerate
!pip install -q decord ffmpeg-python librosa

2️⃣ Loading Gemma 3n

We load and compile Gemma 3n using mixed precision (bfloat16) for **speed and memory efficiency**:

from transformers import AutoModelForImageTextToText, AutoProcessor

GEMMA_PATH = "google/gemma-3n-E4B-it"

processor = AutoProcessor.from_pretrained(GEMMA_PATH)
model = AutoModelForImageTextToText.from_pretrained(
    GEMMA_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = torch.compile(model, fullgraph=False)

3️⃣ Extracting Frames and Audio

Using decord and ffmpeg, we extract evenly spaced frames for visual analysis and resample audio for speech transcription:

def process_video_with_decord(video_path, num_frames=8):
    # Uses decord for frame sampling and ffmpeg + librosa for audio extraction
    # Prepares data for multimodal input into Gemma 3n

4️⃣ Analysis Pipeline with Gemma 3n

We designed a three-step pipeline:

Transcribe Speech: Using Gemma 3n’s STT to create a full transcript.
Visual Frame Analysis: Checking posture, gestures, and slide clarity frame-by-frame.
Generate Coaching Report: Synthesizing findings into a clear, encouraging critique with actionable next steps.

generate_definitive_analysis(model, processor, quick_test=True)

✨ Results: What SpeakWise Delivers

✅ A full transcript of your presentation, automatically.
✅ Clear visual feedback on posture and slides.
✅ A personalized AI coaching report with actionable steps for improvement.

All of this is powered by Gemma 3n’s multimodal, privacy-first, on-device AI capabilities, showcasing how accessible it is to integrate advanced LLMs into skill-building workflows.

🌱 Next Steps

📲 Deploy SpeakWise as a mobile app for offline, real-time coaching.
⚡ Integrate with teleprompter tools for live presentation feedback.
🔍 Explore advanced gesture analysis using frame sequences.

🙌 Conclusion

Using Gemma 3n, we built SpeakWise to transform how presenters practice and refine their public speaking skills, leveraging real-time, multimodal, privacy-preserving AI. If you’re looking to build projects that truly understand human context using advanced LLMs while preserving user privacy, Gemma 3n is the tool you’ve been waiting for.

🔗 Project by Rabimba for Google Developer Expert (GDE) AI Sprint.

Colab: https://colab.research.google.com/drive/1j7S9QhoFYfyZq_rTKD_i1dRMFdqBZcu4?usp=sharing

RK's Rambling

Search This Blog