(For Educational Purpose like Tutorial creation or voice notes)
Creating AI-powered voice video notes sounds complex, but with the right setup, it becomes a smooth and repeatable process. In this guide, I’ll walk you through a fully stable, tested method to generate AI voice-cloned video notes using Google Colab, Coqui XTTS v2, and MoviePy.
This tutorial uses a safe, stable PyTorch environment (2.1.0) to avoid compatibility issues and includes a proven Hindi language fix, making it suitable for both English and Hindi voice generation.
If you follow each step exactly, this will work without errors.
🎥 Project Overview
We are building an AI Voice Clone & Video Generator that:
- Clones your voice from a short reference recording
- Converts text into natural speech (English or Hindi)
- Combines the generated audio with an image
- Outputs a ready-to-download MP4 video
✅ What You Need Before Starting
Prepare these two files on your local system:
reference_voice.wav- 10–15 seconds of your voice
- Clear audio, no background noise
slide_image.png- Any image you want as the video background
- Slide, note, poster, or plain background
🧩 Step 1: Install the Stable Environment (Must Be First)
Open Google Colab, create a new notebook, and paste the following code into the first cell.
This step removes unstable pre-installed libraries and installs the most reliable configuration for voice cloning.
# @title 1. Install & System Setup (Stable Version 2.1.0)
import os
print("🔄 Cleaning up unstable libraries (ensuring a clean slate)...")
os.system("pip uninstall -y torch torchaudio torchvision torchtext torchdata torchcodec")
print("⬇️ Installing Stable PyTorch 2.1.0...")
!pip install -q torch==2.1.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
print("⬇️ Installing Voice & Video tools...")
!pip install -q coqui-tts moviepy
!sudo apt-get install -y ffmpeg
# Hindi number conversion fix
!pip install num2words
print("✅ INSTALLATION COMPLETE! Move to Step 2.")
📌 Important:
Do not skip this step or upgrade PyTorch. This exact version avoids CUDA and XTTS crashes.
📂 Step 2: Upload Your Files (Mandatory)
Before running the next script:
- Go to the left sidebar in Colab
- Click the Folder icon (📁)
- Drag and drop:
reference_voice.wavslide_image.png
- Wait until uploads finish completely
⚠️ If the runtime disconnects, files are deleted — re-upload them.
🎙️ Step 3: Generate AI Voice & Video (Master Script)
Paste the following code into the second cell.
How to Customize:
- Change
LANGUAGE→"en"or"hi" - Edit
YOUR_TEXT→ what the AI should speak
import torch
from TTS.api import TTS
from moviepy.editor import AudioFileClip, ImageClip
import num2words
import warnings
import os
# ==========================================
# ⚙️ USER SETTINGS (EDIT THIS AREA)
# ==========================================
LANGUAGE = "en" # "en" for English, "hi" for Hindi
YOUR_TEXT = """
This is a test of the fully stable voice clone system.
I am creating this video note using Python 3.10 and Coqui TTS.
"""
INPUT_VOICE = "reference_voice.wav"
INPUT_IMAGE = "slide_image.png"
OUTPUT_VIDEO = "final_video_output.mp4"
# ==========================================
# 🚀 SYSTEM CODE (DO NOT EDIT BELOW)
# ==========================================
print("🔧 Applying system fixes...")
warnings.filterwarnings("ignore")
# Hindi number fix
try:
if 'hi' not in num2words.CONVERTER_CLASSES:
num2words.CONVERTER_CLASSES['hi'] = num2words.CONVERTER_CLASSES['en']
except:
pass
if not os.path.exists(INPUT_VOICE) or not os.path.exists(INPUT_IMAGE):
print("❌ ERROR: Required files not found.")
else:
print(f"✅ Files detected. Language: {LANGUAGE.upper()}")
print("\n🎙️ Generating AI Voice...")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device.upper()}")
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(
text=YOUR_TEXT,
speaker_wav=INPUT_VOICE,
language=LANGUAGE,
file_path="temp_audio.wav"
)
print("✅ Voice generated successfully.")
print("\n🎬 Creating video...")
audio = AudioFileClip("temp_audio.wav")
image = ImageClip(INPUT_IMAGE).set_duration(audio.duration)
video = image.set_audio(audio)
video.write_videofile(
OUTPUT_VIDEO,
fps=1,
codec="libx264",
audio_codec="aac",
preset="ultrafast",
logger=None
)
print("\n🎉 DONE!")
print(f"Your video file: {OUTPUT_VIDEO}")
⏳ Voice generation may take 30–60 seconds on first run.
⬇️ Step 4: Download the Final Video
- Open the Files sidebar
- Click Refresh
- Right-click
final_video_output.mp4 - Select Download
Your AI voice video note is ready.
🛠️ Troubleshooting & Best Practices
❌ File Not Found Error
- Files were not uploaded
- Runtime restarted → upload again
🗣️ Hindi Number Issues
- Write numbers in words
- ❌
10 - ✅
दस
- ❌
🎧 Poor Voice Quality
- Use a clearer reference recording
- Avoid background noise
- Speak naturally, not fast
🎯 Final Thoughts (From Ankit)
This setup is stable, scalable, and production-ready.
Once configured, you can reuse it for:
- AI lecture notes
- Voice-over slides
- Educational reels
- Personal branding videos
- Course content creation
If you want enhancements like multiple slides, subtitle generation, or batch processing, this foundation supports it.
Summary: AI Voice Clone & Video Notes Using Google Colab
By Ankit
AI-driven voice and video creation has reached a point where individual creators, educators, and professionals can produce studio-quality content using only a browser and a basic system setup. This guide presents a fully stable and error-free approach to creating AI voice–based video notes using Google Colab, Coqui XTTS v2, and MoviePy, with specific attention given to long-term reliability and multilingual support.
The core objective of this project is to convert written text into a natural-sounding voice clone and seamlessly combine it with a static image to produce a downloadable video file. Unlike experimental setups that frequently break due to library mismatches or GPU incompatibilities, this workflow is intentionally built on a safe-mode environment using PyTorch 2.1.0, which has proven to be the most dependable version for XTTS-based voice cloning. By explicitly removing pre-installed Colab libraries and installing a controlled stack, the tutorial eliminates common runtime crashes and CUDA conflicts.
A key strength of this implementation is its language flexibility. The system supports both English and Hindi voice generation using the same pipeline. Special care is taken to address known Hindi text-processing issues, particularly number conversion errors that can cause crashes or missing speech output. By applying a targeted monkey patch using the num2words library, the tutorial ensures uninterrupted Hindi narration, making the solution practical for Indian educators, trainers, and content creators.
The process begins with two essential inputs: a short voice reference file and a background image. The voice sample, typically 10 to 15 seconds long, allows the XTTS v2 model to accurately capture tone, pacing, and vocal characteristics. This reference-based cloning enables personalized narration without the need for extended training or fine-tuning. The background image acts as a visual canvas, transforming what would otherwise be an audio-only output into a shareable video format suitable for presentations, notes, and social media platforms.
Once the environment is prepared and files are uploaded, the master script handles the entire workflow in a single execution. It automatically detects GPU availability, loads the XTTS v2 multilingual model, generates the speech audio, and then merges it with the image using MoviePy. The video rendering process uses optimized settings such as the ultrafast encoding preset, ensuring that video creation completes quickly without freezing or memory errors—an issue many users face in Colab-based video workflows.
Another important advantage of this setup is its repeatability. After the initial configuration, the same notebook can be reused indefinitely. Users only need to update the text content, switch languages if required, and replace the reference voice or image when necessary. This makes the system ideal for batch content creation, lecture recording, revision notes, explainer videos, and even personal branding content.
From a practical standpoint, this workflow significantly lowers the barrier to AI-powered media production. There is no need for high-end hardware, paid software, or complex local installations. Everything runs inside Google Colab using open-source tools, making it accessible to students, educators, freelancers, and startups alike. The final output—a standard MP4 video—can be downloaded directly and used across platforms without additional processing.
In conclusion, this guide demonstrates that AI voice cloning and video generation can be both stable and production-ready when approached with the right architectural choices. By prioritizing compatibility, controlled dependencies, and real-world usability, the tutorial offers a reliable foundation that can be extended with subtitles, multiple slides, automation, or integration into larger content pipelines. For anyone looking to adopt AI voice technology without constant troubleshooting, this workflow provides a clear, tested, and future-proof starting point.
