Skip to main content

🎤 SpeakWise: Build an AI Public Speaking Coach with Gemma 3n

SpeakWise: Build an AI Public Speaking Coach with Gemma 3n

Project by: Rabimba
Showcase for: Google Developer Expert (GDE) AI Sprint

SpeakWise AI Public Speaking Coach Demo

Imagine having a private, real-time AI coach that watches your presentations, listens to your speech, analyzes your slides and gestures, and provides actionable feedback to help you improve confidently. Using Google’s new Gemma 3n, we built exactly that: SpeakWise, an AI-powered public speaking coach that leverages multimodal understanding to transcribe, analyze, and critique your talks—all while keeping your data private.

Github Code.

🚀 Why Gemma 3n?

Gemma 3n is Google’s open multimodal model designed for on-device, privacy-preserving, real-time AI applications. It is uniquely capable of:

  • 📡 Simultaneously processing audio, image, and text, forming a holistic understanding of your talk.
  • 🗂️ Following advanced instructions (“Act as a world-class presentation coach” and structuring output into clear, actionable insights).
  • 🔐 Enabling offline, private skill development without requiring cloud uploads.
Gemma 3n multimodal capabilities

These features make Gemma 3n a perfect foundation for SpeakWise, letting us transform how presenters practice and refine their public speaking skills using AI that understands context deeply.


🛠️ Building SpeakWise: Step-by-Step

1️⃣ Setup & Installation

We start by installing the necessary libraries for **Gemma 3n execution, video processing, and audio extraction** in a Colab environment:

!pip install -q -U "transformers>=4.53.0" "timm>=1.0.16" bitsandbytes accelerate
!pip install -q decord ffmpeg-python librosa

2️⃣ Loading Gemma 3n

We load and compile Gemma 3n using mixed precision (bfloat16) for **speed and memory efficiency**:

from transformers import AutoModelForImageTextToText, AutoProcessor

GEMMA_PATH = "google/gemma-3n-E4B-it"

processor = AutoProcessor.from_pretrained(GEMMA_PATH)
model = AutoModelForImageTextToText.from_pretrained(
    GEMMA_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = torch.compile(model, fullgraph=False)

3️⃣ Extracting Frames and Audio

Using decord and ffmpeg, we extract evenly spaced frames for visual analysis and resample audio for speech transcription:

Frame and Audio Extraction
def process_video_with_decord(video_path, num_frames=8):
    # Uses decord for frame sampling and ffmpeg + librosa for audio extraction
    # Prepares data for multimodal input into Gemma 3n

4️⃣ Analysis Pipeline with Gemma 3n

We designed a three-step pipeline:

  • Transcribe Speech: Using Gemma 3n’s STT to create a full transcript.
  • Visual Frame Analysis: Checking posture, gestures, and slide clarity frame-by-frame.
  • Generate Coaching Report: Synthesizing findings into a clear, encouraging critique with actionable next steps.
AI Coaching Report Generation
generate_definitive_analysis(model, processor, quick_test=True)

✨ Results: What SpeakWise Delivers

✅ A full transcript of your presentation, automatically.
✅ Clear visual feedback on posture and slides.
✅ A personalized AI coaching report with actionable steps for improvement.

All of this is powered by Gemma 3n’s multimodal, privacy-first, on-device AI capabilities, showcasing how accessible it is to integrate advanced LLMs into skill-building workflows.


🌱 Next Steps

  • 📲 Deploy SpeakWise as a mobile app for offline, real-time coaching.
  • ⚡ Integrate with teleprompter tools for live presentation feedback.
  • 🔍 Explore advanced gesture analysis using frame sequences.
SpeakWise Future Possibilities

🙌 Conclusion

Using Gemma 3n, we built SpeakWise to transform how presenters practice and refine their public speaking skills, leveraging real-time, multimodal, privacy-preserving AI. If you’re looking to build projects that truly understand human context using advanced LLMs while preserving user privacy, Gemma 3n is the tool you’ve been waiting for.

🔗 Project by Rabimba for Google Developer Expert (GDE) AI Sprint.

Colab: 
https://colab.research.google.com/drive/1j7S9QhoFYfyZq_rTKD_i1dRMFdqBZcu4?usp=sharing



Comments

Popular posts from this blog

Curious case of Cisco AnyConnect and WSL2

One thing Covid has taught me is the importance of VPN. Also one other thing COVID has taught me while I work from home  is that your Windows Machine can be brilliant  as long as you have WSL2 configured in it. So imagine my dismay when I realized I cannot access my University resources while being inside the University provided VPN client. Both of the institutions I have affiliation with, requires me to use VPN software which messes up WSL2 configuration (which of course I realized at 1:30 AM). Don't get me wrong, I have faced this multiple times last two years (when I was stuck in India), and mostly I have been lazy and bypassed the actual problem by side-stepping with my not-so-noble  alternatives, which mostly include one of the following: Connect to a physical machine exposed to the internet and do an ssh tunnel from there (not so reliable since this is my actual box sitting at lab desk, also not secure enough) Create a poor man's socks proxy in that same box to have...

Deep Dive into the Google Agent Development Kit (ADK): Features and Code Examples

In our previous overview, we introduced the Google Agent Development Kit (ADK) as a powerful Python framework for building sophisticated AI agents. Now, let's dive deeper into some of the specific features that make ADK a compelling choice for developers looking to create agents that can reason, plan, use tools, and interact effectively with the world. 1. The Core: Configuring the `LlmAgent` The heart of most ADK applications is the LlmAgent (aliased as Agent for convenience). This agent uses a Large Language Model (LLM) for its core reasoning and decision-making. Configuring it effectively is key: name (str): A unique identifier for your agent within the application. model (str | BaseLlm): Specify the LLM to use. You can provide a model name string (like 'gemini-1.5-flash') or an instance of a model class (e.g., Gemini() ). ADK resolves string names using its registry. instruction (str | Callable): This is crucial for guiding the agent's be...

My Google I/O 2024 Adventure: A GDE's Front-Row Seat to the Gemini Era

Hey tech enthusiasts! Rabimba Karanjai here, your friendly neighborhood Google Developer Expert (GDE), back from an exhilarating whirlwind tour of Google I/O 2024. Let me tell you, this wasn't just your average tech conference – it was an AI-infused extravaganza that left me utterly mind-blown! And you know what made it even sweeter? I had front-row seats, baby! Huge shoutout to the GDE program for this incredible opportunity. Feeling grateful and a tad spoiled, I must admit. 😉 Gemini: The AI Marvel That's Stealing the Show Now, let's dive into the star of the show: Gemini . This ain't your grandpa's AI model – it's the multimodal powerhouse that's set to redefine how we interact with technology. Imagine an AI that doesn't just understand text, but images, videos, code, and even your wacky doodles. Yep, that's Gemini for you! Google's been cooking up this AI masterpiece, and boy, did they deliver! The keynote demo had us all gawk...