feat(web-test): per-caption voice + speechRate for multi-voice narration

- addNarration: use cap.voice override per caption (fallback to global) - showCaption/showImage/showTitleSlide: pass opts.voice to caption entry - showCaption: record caption when text is empty but speech is explicit - startRecording: add speechRate option (default 70ms/char, 85 for ElevenLabs) - run.mjs: increase exec timeout to 30min for long recordings - docs: update recording.md and web-test-recording-guide.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-07-29 08:01:02 +03:00 · 2026-03-21 16:30:02 +03:00
parent ca0dac2693
commit 6f36e36166
4 changed files with 59 additions and 17 deletions
@@ -57,6 +57,7 @@ Start recording the browser viewport to an MP4 file.
 | `opts.fps` | number | 25 | Target framerate |
 | `opts.quality` | number | 80 | JPEG quality (1-100) |
 | `opts.ffmpegPath` | string | auto | Explicit path to ffmpeg binary |
+| `opts.speechRate` | number | 70 | Ms per character for smart TTS wait. Increase for slower TTS providers (e.g. 85 for ElevenLabs) |

 - Output directory is created automatically if it doesn't exist
 - Throws if already recording or browser not connected
@@ -89,10 +90,13 @@ Display a text overlay on the page (visible in recording). Calling again updates
 | `opts.background` | string | `'rgba(0,0,0,0.7)'` | Background color |
 | `opts.color` | string | `'#fff'` | Text color |
 | `opts.speech` | string \| false | - | TTS narration text. Omit = use displayed text, string = custom narration, false = skip narration |
+| `opts.voice` | string | - | Per-caption voice override (provider-specific voice name/ID). Used by `addNarration` instead of the global voice |
+
+When `text` is empty but `speech` is a string, the caption is still recorded for TTS (no visible overlay). Useful for narration-only captions (e.g. podcast mode).

 The overlay uses `pointer-events: none` — does not interfere with clicking.

-**Smart TTS wait** (during recording): `showCaption` automatically pauses for the estimated TTS speech duration (~70ms per character, min 2s). The next `wait()` call accounts for this — if the explicit pause is shorter than the TTS wait already done, no extra delay is added. If longer, only the remaining difference is waited. This means script authors don't need to calculate TTS timing manually.
+**Smart TTS wait** (during recording): `showCaption` automatically pauses for the estimated TTS speech duration (default ~70ms per character, min 2s; configurable via `startRecording({ speechRate })`). The next `wait()` call accounts for this — if the explicit pause is shorter than the TTS wait already done, no extra delay is added. If longer, only the remaining difference is waited. This means script authors don't need to calculate TTS timing manually.

 ### `hideCaption()`

@@ -110,6 +114,7 @@ Display a full-screen title slide overlay (gradient background, centered text).
 | `opts.color` | string | `'#fff'` | Text color |
 | `opts.fontSize` | number | 36 | Title font size in px |
 | `opts.speech` | string \| false | - | TTS narration text. String = custom text, `true` = use title text, omit/false = no narration |
+| `opts.voice` | string | - | Per-caption voice override for `addNarration` |

 The overlay covers the entire viewport with `z-index: 999999` and `pointer-events: none`.

@@ -128,6 +133,7 @@ Display a full-screen image overlay (e.g. presentation slide screenshot). Reads
 | `opts.background` | string | - | Custom background (overrides preset) |
 | `opts.shadow` | boolean | preset | Show drop shadow on image |
 | `opts.speech` | string \| false | - | TTS narration text while image is shown |
+| `opts.voice` | string | - | Per-caption voice override for `addNarration` |

 **Style presets:**
 - `blur` — blurred+dimmed copy of the image as background, centered image with shadow
@@ -287,7 +293,7 @@ Generate TTS and merge audio with video. Call after `stopRecording()`.
 | Parameter | Type | Description |
 |-----------|------|-------------|
 | `videoPath` | `string` | Path to the recorded MP4 file |
-| `opts.captions` | `Array` | Explicit captions (default: from last recording or `.captions.json`) |
+| `opts.captions` | `Array` | Explicit captions (default: from last recording or `.captions.json`). Each caption may include a `voice` field to override the global voice for that segment |
 | `opts.provider` | `string` | `'edge'` (default), `'openai'`, or `'elevenlabs'` |
 | `opts.voice` | `string` | Voice name (provider-specific) |
 | `opts.apiKey` | `string` | API key (for openai) |
@@ -300,7 +306,7 @@ Generate TTS and merge audio with video. Call after `stopRecording()`.

 ### `getCaptions()`

-Returns captions from the current or last recording: `Array<{ text, speech, time }>`.
+Returns captions from the current or last recording: `Array<{ text, speech, time, voice? }>`.

 ### Example: Record and narrate

@@ -3916,7 +3916,8 @@ export async function startRecording(outputPath, opts = {}) {
    if (dupes > 0) lastFrameTime = now;
  };

-  recorder = { cdp, ffmpeg, startTime: Date.now(), outputPath: resolvedPath, ffmpegError: '', captions: [], videoTimeMs: 0, _flushFrames };
+  const speechRate = opts.speechRate || 70; // ms per character for smart TTS wait
+  recorder = { cdp, ffmpeg, startTime: Date.now(), outputPath: resolvedPath, ffmpegError: '', captions: [], videoTimeMs: 0, _flushFrames, speechRate };
  // Redirect stderr accumulation to the recorder object
  ffmpeg.stderr.removeAllListeners('data');
  ffmpeg.stderr.on('data', d => { recorder.ffmpegError += d.toString(); });
@@ -3997,12 +3998,12 @@ export async function showCaption(text, opts = {}) {

  // Collect caption for TTS narration if recording
  let smartWaitMs = 0;
-  if (recorder && text.trim() && opts.speech !== false) {
+  if (recorder && (text.trim() || typeof opts.speech === 'string') && opts.speech !== false) {
    const speech = typeof opts.speech === 'string' ? opts.speech : text;
    // Use video timeline position (accounts for frame duplication) instead of wall-clock
-    recorder.captions.push({ text, speech, time: Math.round(recorder.videoTimeMs) });
+    recorder.captions.push({ text: text || speech, speech, time: Math.round(recorder.videoTimeMs), ...(opts.voice ? { voice: opts.voice } : {}) });
    // Estimate TTS duration and wait so the video has enough screen time for voiceover
-    smartWaitMs = Math.max(2000, speech.length * 70);
+    smartWaitMs = Math.max(2000, speech.length * (recorder.speechRate || 70));
  }
  const position = opts.position || 'bottom';
  const fontSize = opts.fontSize || 24;
@@ -4067,7 +4068,7 @@ export function getCaptions() {
 * Generates speech from captions and merges audio with the video.
 * @param {string} videoPath — path to the recorded MP4 file
 * @param {object} [opts]
- * @param {Array<{text: string, speech: string, time: number}>} [opts.captions] — explicit captions (default: from last recording or .captions.json)
+ * @param {Array<{text: string, speech: string, time: number, voice?: string}>} [opts.captions] — explicit captions (default: from last recording or .captions.json). Each caption may include a `voice` field to override the global voice for that segment
 * @param {string} [opts.provider='edge'] — TTS provider: 'edge' or 'openai'
 * @param {string} [opts.voice] — voice name (provider-specific)
 * @param {string} [opts.apiKey] — API key (for openai provider)
@@ -4145,12 +4146,13 @@ export async function addNarration(videoPath, opts = {}) {
      const promises = batch.map(async (cap, batchIdx) => {
        const idx = batchStart + batchIdx;
        const ttsFile = pathJoin(tempDir, `tts_${idx}.mp3`);
+        const capTtsOpts = cap.voice ? { ...ttsOpts, voice: cap.voice } : ttsOpts;
        try {
-          await ttsProvider(cap.speech, ttsFile, ttsOpts);
+          await ttsProvider(cap.speech, ttsFile, capTtsOpts);
        } catch (err) {
          // Retry once
          try {
-            await ttsProvider(cap.speech, ttsFile, ttsOpts);
+            await ttsProvider(cap.speech, ttsFile, capTtsOpts);
          } catch (retryErr) {
            warnings.push(`TTS failed for caption ${idx}: ${retryErr.message || retryErr.cause?.message || String(retryErr)}`);
            // Generate 1s silence as placeholder
@@ -4269,8 +4271,8 @@ export async function showTitleSlide(text, opts = {}) {
  if (recorder && speech && speech !== false) {
    const captionText = typeof speech === 'string' ? speech : text.replace(/\n/g, ' ');
    if (captionText) {
-      recorder.captions.push({ text: captionText, speech: captionText, time: Math.round(recorder.videoTimeMs) });
-      smartWaitMs = Math.max(2000, captionText.length * 70);
+      recorder.captions.push({ text: captionText, speech: captionText, time: Math.round(recorder.videoTimeMs), ...(opts.voice ? { voice: opts.voice } : {}) });
+      smartWaitMs = Math.max(2000, captionText.length * (recorder.speechRate || 70));
    }
  }

@@ -4380,8 +4382,8 @@ export async function showImage(imagePath, opts = {}) {
  if (recorder && speech && speech !== false) {
    const captionText = typeof speech === 'string' ? speech : '';
    if (captionText) {
-      recorder.captions.push({ text: captionText, speech: captionText, time: Math.round(recorder.videoTimeMs) });
-      smartWaitMs = Math.max(2000, captionText.length * 70);
+      recorder.captions.push({ text: captionText, speech: captionText, time: Math.round(recorder.videoTimeMs), ...(opts.voice ? { voice: opts.voice } : {}) });
+      smartWaitMs = Math.max(2000, captionText.length * (recorder.speechRate || 70));
    }
  }

@@ -230,7 +230,7 @@ async function cmdExec(fileOrDash, flags = {}) {
  const result = await new Promise((resolve, reject) => {
    const req = http.request({
      hostname: '127.0.0.1', port: sess.port, path: '/exec',
-      method: 'POST', timeout: 10 * 60 * 1000, headers,
+      method: 'POST', timeout: 30 * 60 * 1000, headers,
    }, res => {
      let data = '';
      res.on('data', chunk => data += chunk);
@@ -123,12 +123,12 @@ Claude вызовет `addNarration` с другим голосом. Текст
  "videoTimestamps": true,
  "captions": [
    { "text": "Переходим в раздел «Продажи»", "speech": "Переходим в раздел Продажи", "time": 3160 },
-    { "text": "Открываем заказы клиентов", "speech": "Открываем заказы клиентов", "time": 7040 }
+    { "text": "Открываем заказы клиентов", "speech": "Открываем заказы клиентов", "time": 7040, "voice": "bqbHGIIO5oETYIqhWmfk" }
  ]
 }
 ```

-Можно отредактировать `speech` (текст озвучки) и переозвучить:
+Можно отредактировать `speech` (текст озвучки) или добавить `voice` (голос для конкретной реплики) и переозвучить:

 ```
 > Отредактируй субтитры в recordings/demo.captions.json — замени "Продажи" на
@@ -204,6 +204,36 @@ await clickElement('Создать');
 await wait(2);   // форма загружается
 ```

+### Два голоса (подкаст / диалог)
+
+Параметр `voice` в `showCaption` задаёт голос для конкретной реплики. `addNarration` автоматически использует его вместо глобального:
+
+```js
+const MALE   = 'bqbHGIIO5oETYIqhWmfk'; // Alexander
+const FEMALE = '0ArNnoIAWKlT4WweaVMY'; // Elena Gromova
+
+// speechRate: 85 — ElevenLabs медленнее Edge TTS, нужен запас
+await startRecording('podcast.mp4', { speechRate: 85 });
+
+await showImage('slides/slide-01.png', { style: 'full', speech: false });
+await showCaption('', { speech: 'Привет! Сегодня поговорим...', voice: MALE });
+await wait(0.8);
+await showCaption('', { speech: 'А я буду задавать вопросы...', voice: FEMALE });
+await wait(0.8);
+
+const video = await stopRecording();
+const result = await addNarration(video.file, {
+  provider: 'elevenlabs',
+  apiKey: 'sk_...',
+  // глобальный voice не нужен — каждый caption несёт свой
+});
+```
+
+Приёмы:
+- `showCaption('', { speech, voice })` — пустой текст (без субтитра на экране), но speech записывается для озвучки
+- `showImage` со `speech: false` — слайд без озвучки, реплики идут через `showCaption`
+- `speechRate: 85` — для ElevenLabs увеличиваем множитель (по умолчанию 70мс/символ), чтобы фразы не наезжали друг на друга
+
 ### Разделение текста и озвучки

 Параметр `speech` в `showCaption` позволяет показывать одно, а озвучивать другое:
@@ -267,6 +297,9 @@ await showCaption('Технические детали', { speech: false });
 | Olga Orlova | `d60rsXo2p0OwikDR5bS7` | Clear and Engaging |
 | Artem | `WTn2eCRCpoFAC50VD351` | Friendly & Professional |
 | Denis | `0BcDz9UPwL3MpsnTeUlO` | Pleasant, Engaging and Friendly |
+| Alexander | `bqbHGIIO5oETYIqhWmfk` | Pleasant, Warm and Natural |
+| Elena Gromova | `0ArNnoIAWKlT4WweaVMY` | Podcasts & Conversation |
+| Victor | `9fjVd0EYNNXHllJquVdT` | Moscow accent |

 ```json
 {
@@ -345,6 +378,7 @@ const narrated = await addNarration(video.file, {
 | `No captions available` | Используйте `showCaption()` во время записи |
 | TTS timeout | Проверьте интернет-соединение (Edge TTS требует сеть) |
 | Озвучка обрезается | Увеличьте паузы `wait()` между субтитрами |
+| Фразы наезжают друг на друга | Увеличьте `speechRate` в `startRecording` (85 для ElevenLabs) |

 ## Связанные навыки