Hướng dẫn một flow tạo video với AI Veo 3: tạo prompt tự động, làm được video dài, nhân vật cố định

#ai

Làm sao để biến ý tưởng thành một video đúng ý với Veo 3 của Google? Làm cách nào để video dài hơn? Mình muốn điều khiển các góc quay trong đoạn clip thì phải làm sao? Sao mình kêu Veo 3 trong flow thoại tiếng Việt mà nó báo lỗi?... Đó là những câu hỏi mà mình thấy các bạn thường hay thắc mắc bữa giờ về cách tạo video với Veo 3.

Trong bài viết này mình sẽ chia sẻ với các bạn một flow tạo video với Veo 3: từ việc tạo một prompt chi tiết bằng tiếng Anh một cách đơn giản nhất, đỡ cực nhất cho tới những điều chỉnh trong giao diện của Flow để nhanh chóng có được video đúng ý các bạn.

Xem thêm:

Hướng dẫn tạo video bằng AI Google Veo: phân biệt các tài khoản, prompt tạo video, các lưu ý

Chúng ta sẽ có 2 phần chính là:

Hoàn thiện một prompt tạo video
Kêu Veo 3 trong flow tạo video và hơn thế nữa

Bước 1: hoàn thiện một prompt đưa vào Veo 3

Như mọi ứng dụng AI khác, việc tạo video sẽ cần một prompt để kêu AI tạo ra video đúng ý chúng ta muốn. Dưới đây là 1 prompt style cinematic chút, có thể áp dụng cho hầu hết các video tạo cảnh người thật đang nói chuyện ngoài đời. Nội dung prompt mình để bên dưới.

Template for Cinematic Scene Prompt:

SPECS: Camera specs, lenses etc
NO CAPTIONS OR TEXT.
[CHARACTER NAME] ([Age], [2–3 distinctive physical features]) [primary action]
[LOCATION DESCRIPTION]. [1–2 sentences establishing visual environment and lighting].
WIDE SHOT: [Description of what's visible in wide frame].
MEDIUM SHOT: [Focus on mid-range detail or character position].
CLOSE-UP: [Specific facial feature or important object detail].
EXTREME CLOSE-UP: [Micro detail that communicates emotion or importance].
DIALOGUE:
[CHARACTER NAME] ([voice quality description], translate to Vietnamese and say): "[First line of dialogue...]"
[CHARACTER NAME] ([voice quality description], translate to Vietnamese and say): “[Second line of dialogue...]”
[Brief description of environmental movement or lighting change].
AUDIO: [Background sounds], [ambient noises], [voice characteristics], [acoustic qualities].

KEY ELEMENTS: [4–5 essential thematic or visual elements separated by commas].

Giải thích một chút về prompt trên, chúng ta sẽ có 4 thành phần chính:

SPECS: Cái này sẽ mô tả chiếc máy quay tạo ra video đó, có thể là máy ảnh chuyên nghiệp, gắn ống kính siêu rộng hay tiêu cự bao nhiêu mm, hoặc một chiếc điện thoại đang quay bằng camera selfie,…
CHARACTER NAME: miêu tả số lượng, hình dáng, dặc điểm nhân vật
LOCATION DESCRIPTION: miêu tả bối cảnh của cảnh quay, môi trường, ánh sáng,….
Các góc máy: ở bên trên mình để 4 góc máy cơ bản khi quay video gồm 1 cảnh toàn, cảnh trung, cảnh cận và siêu cận. Các bạn có thể bỏ bớt hoặc giữ lại cho phù hợp với video mà các bạn muốn nha. Ngoài ra, các bạn cũng có thể sắp xếp thoại đan xen giữa các góc máy cho phù hợp nếu cần.
DIALOGUE: thoại của nhân vật tương ứng, chỗ này chúng ta miêu tả phong cách của thoại, chất giọng,… sau đó sẽ là nội dung của thoại. Nếu bạn nào tạo video bằng Veo 3 trong Flow và gặp lỗi do chèn thoại bằng tiếng Việt thì chuyển luôn thoại sang tiếng Anh, sau đó thêm dòng translate to Vietnamese and say sẽ khắc phục lỗi này. Bản chất do hiện nó chỉ hỗ trợ tiếng Anh nên mình mẹo chỗ này để đánh lừa nó chút.
AUDIO: miêu tả âm thanh môi trường, các hiệu ứng âm thanh nếu có,…
KEY ELEMENTS: thêm các đặc điểm miêu tả chi tiết phong cách, đặc thù,… của video.

Mình sẽ lưu prompt này dưới dạng file text txt hoặc md, mỗi lần muốn tạo video thì upload file này vào một chatbot AI như Gemini hay ChatGPT, sau đó nêu ngắn gọn ý tưởng, mong muốn và kêu nó tạo ra một prompt theo đúng template này là xong. Cách làm này sẽ rất tiện, bạn không cần phải gõ quá nhiều, đặc biệt là bạn nào ngại gõ bằng tiếng Anh.

Bước tiếp theo, mình sẽ đưa template này vào chatbot và miêu tả sơ nội dung video mà mình muốn, kêu nó gợi ý thêm các đặc điểm. Nếu prompt trả về mà chưa vừa ý thì mình kêu chatbot nó sửa tiếp. Kế đến, phần thoại sẽ là phần mà mình chủ động điều chỉnh để đúng ý nhất. Vậy là xong phần prompt cho một video dài 8 giây do Veo 3 tạo ra.

Ở chỗ này, nếu bạn nào muốn tạo video nhiều cảnh hơn, kiểu một phim ngắn chẳn hạn, có thể kêu chatbot Gemini hay ChatGPT tạo dùm luôn nhiều prompt tương ứng với số lượng Scene mà các bạn muốn có trong phim, sau đó chúng ta sẽ lần lượt paste vào để Veo 3 tạo ra.

Như thí dụ bên trên, mình kêu Gemini tạo ra cho mình 3 prompt tương ứng với 3 Scene cho một kịch bản video ngắn kinh dị, hài hước về một trường học có ma. Mình miêu tả cho nó phong cách, các nhân vật cơ bản, bối cảnh,…

Bước 2: Tạo video với Veo 3 trong Flow

Sau khi đã có prompt, chúng ta bắt đầu vào Flow (LINK) để tạo video bằng Veo 3. Các bạn nhớ chọn Từ văn bản sang video, ở tùy chọn Setting bên phải, nhớ chọn số video ra là 1 và chọn model cao cấp nhất là Veo 3 nha. Xong, giờ paste vào ngồi đợi thôi.

Và đây là kết quả

Nhưng nếu bạn muốn tạo video dài hơn 8 giây, với nhiều cảnh và nội dung phức tạp hơn thì sao? Rào cản ở đây chính là hiện tại tính năng dùng một hình ảnh làm tham chiếu chỉ mới hỗ trợ tạo bằng Veo 2, nghĩa là chỉ có video mà không có tiếng. Còn Veo 3 thì chưa hỗ trợ tính năng này. Dù vậy, chúng ta vẫn có thể dùng prompt để tối đa hóa tính ổn định của nhân vật qua mỗi lần gen từ text ra video khác nhau. Cách làm ở đây là mình sẽ chia nhỏ kịch bản ra thành các prompt con, mỗi prompt tương ứng với một đoạn video 8 giây. Tuy nhiên, phần mô tả nhân vật chính và đặc điểm môi trường sẽ được làm kỹ riêng ra, sau đó sẽ gắn phần mô tả này vào mỗi prompt cảnh quay để cố gắng tạo ra các nhân vật giống nhau.

Bên dưới là một video mình dùng 6 cảnh khác nhau, tổng thời lượng 48 giây áp dụng cách làm này. Prompt mình để bên dưới cho bạn nào quan tâm nha.

Có thể thấy là mặc dù đã ép model tạo các video khác nhau với nhân vật giống nhau bằng prompt nhưng có vẻ vẫn xuất hiện lỗi, có sự khác biệt về giọng nói, hình ảnh khuôn mặt của nhân vật qua mỗi lần tạo. Cơ bản thì video chỉ dừng lại ở mức tạm chứ chưa thể hoàn hảo được. Đối lại, nếu chúng ta sử dụng tính năng dùng một frame trong video để làm tham chiếu cho video gen ra tiếp theo thì chất lượng video sinh ra sau sẽ nhất quán hơn, nhưng lại không có âm thanh. Tất nhiên dây chỉ là beta nên chắc chắn, ít hôm nữa Google họ sẽ update để cho phép chúng ta dùng tính năng này. Khi đó thì việc tạo video dài với nhân vật nhất quán sẽ đơn giản hơn rất nhiều.

Prompt cho video ngắn trên cho bạn nào muốn thử:

Character & Setting Details (Recap):

TEACHER: (40s, Vietnamese) Impeccably dressed in a sharp, dark, well-fitted suit (charcoal grey) with a crisp white shirt. Always wears black, opaque sunglasses that completely hide his eyes. Neat hair. Carries an air of serious, academic authority mixed with a subtle, knowing smirk. Precise movements. Clear, commanding Southern Vietnamese accent.
CLASSROOM LOCATION DESCRIPTION: An old, somewhat decaying room in a Vietnamese building. Stuffy air smelling of old paper, dust, faint incense, and a hint of something chemical.
- Lighting: Primarily lit by a single, bare, flickering fluorescent tube light (cool white) overhead, casting harsh, shifting shadows. Grimy, barred windows let in minimal, fading daylight.
- Walls & Decor: Faded walls covered with pseudo-scientific diagrams of ghosts, faded occult symbols, crudely drawn illustrations of human-ghost encounters, and some out-of-place old human anatomical charts.
- Props: Worn wooden lectern. Mannequin with fake ectoplasm. Dusty shelves with jars of "essences." Stained chalkboard/whiteboard with bizarre formulas. Mismatched old wooden student desks.

Scene 1 Prompt
SPECS: Camera: ARRI Alexa Mini, Lens: Cooke S4/i 32mm.
NO CAPTIONS OR TEXT.
TEACHER (as described above) stands with imposing calm behind the worn wooden lectern, his posture erect as he surveys the sparsely filled room. CLASSROOM (as described above). The single fluorescent tube flickers erratically, casting long, uneasy shadows from the bizarre diagrams and the few students.
WIDE SHOT: Establishes the entire strange classroom: TEACHER at the lectern, the eerie wall decor harshly illuminated, and a few apprehensive STUDENTS scattered at old wooden desks. The sheet-draped mannequin is a silent, unsettling figure in a darker corner. MEDIUM CLOSE-UP: On the TEACHER. His face is partly in shadow due to the overhead light, his sunglasses reflecting the flickering room. A tiny, almost imperceptible smirk is present.
DIALOGUE: TEACHER (authoritative, clear Southern Vietnamese accent, translate to Vietnamese and say): "Welcome to 'Survival When Encountering Supernatural Entities.' First lesson: When a ghost scares you, absolutely do not scream."
A loose ceiling tile visibly trembles, dislodging a small shower of dust. AUDIO: Persistent low electric hum and occasional sharp CRACKLE from the fluorescent light, the faint sound of falling dust, Teacher's distinct voice. KEY ELEMENTS: Mysterious teacher, detailed eerie classroom, supernatural rules, unsettling atmosphere, strong opening.

Scene 2 Prompt
SPECS: Camera: RED Komodo, Lens: Zeiss Supreme Prime 29mm.
NO CAPTIONS OR TEXT.
TEACHER (as described above) pauses, letting his first rule sink in, then offers his peculiar reasoning. STUDENT A (20s, wearing a simple, slightly worn university jacket, looking genuinely anxious and pale) slowly raises a trembling hand. CLASSROOM (as described above). The flickering light seems to pulse, making the shadows writhe. One of the jars on a high shelf appears to subtly rattle.
MEDIUM SHOT: TEACHER leaning slightly over the lectern, his gloved hands (if wearing them, otherwise bare) pressing down on its surface as he explains. He then turns his head with deliberate slowness towards STUDENT A. CLOSE-UP: STUDENT A’s face, eyes wide with a mixture of fear and morbid curiosity. They gulp audibly before asking their question.
DIALOGUE: TEACHER (grave, Southern Vietnamese accent, translate to Vietnamese and say): "Why? Because it will make you... hoarse! Very bad for the vocal cords!" STUDENT A (timid, voice slightly shaky, translate to Vietnamese and say): "Teacher, if a ghost grabs my leg, what should I do?"
The fluorescent light emits a prolonged, louder BUZZ, then briefly dims before returning to its erratic flickering. AUDIO: Teacher's emphatic voice, Student A's hesitant voice, the distinct, prolonged BUZZ and dimming of the light fixture. KEY ELEMENTS: Dark humor, absurd logic, student interaction, building tension, consistent eerie environment, sensory details.

Scene 3 Prompt
SPECS: Camera: Sony Venice, Lens: Panavision Primo 40mm.
NO CAPTIONS OR TEXT.
TEACHER (as described above) straightens up from the lectern, a subtle shift in his posture suggesting he relishes this question. He steps out to the small open space before the desks. CLASSROOM (as described above). The shadows cast by the TEACHER elongate and distort dramatically as he moves. The mannequin in the corner seems, for a split second, to have its head tilted.
MEDIUM LONG SHOT: TEACHER, now center stage in the small clearing, addresses the class. He then begins his demonstration with surprisingly fluid and precise hand gestures, miming the act of tickling empty air with intense focus. CLOSE-UP: On the TEACHER’S face (from the nose down, sunglasses still prominent) showing the serious, almost scientific concentration he applies to the tickling mime.
DIALOGUE: TEACHER (assured, a hint of theatricality, Southern Vietnamese accent, translate to Vietnamese and say): "Very simple! Immediately... cù lét lại nó! Ma cũng biết nhột như ai thôi! Đảm bảo nó sẽ buông ra ngay và cười không nhặt được mồm!" (Accompanies this with the vigorous, precise tickling mime).
A faint, dry, rustling sound, like laughter made of dead leaves, is heard from a dark corner of the room. AUDIO: Teacher's confident and slightly playful voice, the swish of his suit fabric, the unsettling, dry rustling laughter. KEY ELEMENTS: Physical comedy, absurd solution, Teacher's unwavering bizarre confidence, unsettling subtle sound, focused character action.

Scene 4 Prompt
SPECS: Camera: Canon C300 Mark III, Lens: Canon CN-E 35mm T1.5.
NO CAPTIONS OR TEXT.
STUDENT B (20s, dressed in a dark, faded band t-shirt, arms crossed initially, now leaning forward on their desk with a challenging glint in their eye) interjects. TEACHER (as described above) turns smoothly to face Student B, listening with polite, unwavering attention. CLASSROOM (as described above). The minimal light from the grimy windows is now almost non-existent. The room is increasingly dependent on the single, failing fluorescent bulb.
MEDIUM SHOT: STUDENT B delivering their question with a clear, skeptical tone. The TEACHER stands patiently, his silhouette framed against a particularly grotesque diagram on the wall. MEDIUM CLOSE-UP: TEACHER, offering a slow, deliberate nod to Student B. His sunglasses reflect the student's challenging face. A slight, almost condescending smile touches his lips before he speaks.
DIALOGUE: STUDENT B (challenging, firm voice, translate to Vietnamese and say): "Còn nếu ma hiện hình mặt đầy máu me thì sao thầy?" TEACHER (smooth, unperturbed, a tone of explaining something obvious to a child, Southern Vietnamese accent, translate to Vietnamese and say): "À, trường hợp này cần sự tinh tế. Hãy nhẹ nhàng hỏi: 'Anh/chị ơi, mình xài app filter gì mà 'real' quá vậy? Chỉ em với!'"
A distant, mournful howl (dog or something more ambiguous) echoes from outside the building. AUDIO: Student B's challenging voice, Teacher's smooth, condescendingly patient tone, the distant, mournful HOWL. KEY ELEMENTS: Student challenge, more satirical advice, escalating absurdity, Teacher's unshakable composure, ominous external sounds.

Scene 5 Prompt
SPECS: Camera: Panasonic Lumix S1H, Lens: Leica SL 50mm f/1.4.
NO CAPTIONS OR TEXT.
TEACHER (as described above) pushes off lightly from the lectern he had momentarily leaned against, beginning a slow, deliberate pace across the front of the classroom, addressing all students. CLASSROOM (as described above). The room is now very dim. The flickering fluorescent light casts stark, moving shadows, making the eerie diagrams seem to writhe on the walls. The air feels colder.
MEDIUM SHOT: TEACHER pacing, his dark suit making him almost blend into the deeper shadows at the edge of the light's reach, then re-emerging. He makes a sharp, decisive zigzag motion with his hand as he speaks. CLOSE-UP: On a student's notebook, where they have shakily scrawled "CHẠY ZÍC ZẮC???" next to a crude drawing of a ghost.
DIALOGUE: TEACHER (voice now brisk and commanding, a shift in energy, Southern Vietnamese accent, translate to Vietnamese and say): "Và nhớ nhé, khi bị ma rượt, đừng chạy đường thẳng! Hãy chạy theo đường 'zíc zắc'. Ma nó chóng mặt là nó bỏ cuộc ngay!"
The building groans, a deep, structural sound, as if settling or under strain. AUDIO: Teacher's firm, instructive voice, the sound of his footsteps on the old floorboards, the deep GROAN of the building. KEY ELEMENTS: Further absurd advice, building atmosphere, dynamic movement of Teacher, tangible student reaction (notebook), sense of environmental instability.

Scene 6 Prompt
SPECS: Camera: Blackmagic Pocket Cinema Camera 6K Pro, Lens: Sigma Cine 35mm T1.5.
NO CAPTIONS OR TEXT.
TEACHER (as described above) stops his pacing directly under the weakest point of the flickering fluorescent light. He clasps his hands behind his back, assuming a formal, almost final stance. CLASSROOM (as described above). The room is steeped in gloom. The faces of the students are pale and wide-eyed in the unsteady light. The mannequin seems to have its head turned directly towards the Teacher.
CLOSE-UP: On the TEACHER'S face, specifically his mouth and the lower rim of his sunglasses. His expression is serious, almost grave, as he delivers the homework. The failing light flickers intensely across his features. EXTREME CLOSE-UP: The filament inside the fluorescent tube sputtering violently, glowing erratically.
DIALOGUE: TEACHER (tone becoming slightly more conspiratorial, yet firm, Southern Vietnamese accent, translate to Vietnamese and say): "Bài tập về nhà: Tối nay mỗi người tự tắt đèn ở một mình 15 phút, nếu có gì 'vui' thì mai lên chia sẻ kinh nghiệm."
The fluorescent light emits a final, loud POP and ZAP, then DIES COMPLETELY, plunging the room into absolute darkness. A collective, sharp GASP from the students. AUDIO: Teacher's distinct voice delivering the ominous homework, the loud POP and ZAP of the light, the collective student GASP, followed by sudden, heavy silence and perhaps a single, terrified whimper. KEY ELEMENTS: Ominous homework assignment, dramatic lighting failure, cliffhanger ending, heightened sensory impact (sound and sudden darkness), peak suspense.

Nguồn:tinhte.vn/thread/huong-dan-mot-flow-tao-video-voi-ai-veo-3-tao-prompt-tu-dong-lam-duoc-video-dai-nhan-vat-co-dinh.4022406/