コーディング無用、生成した画像から喋るアバターを無料で作成してみた！

仕上がった動画です。音声が大きめなので、ボリュームを下げて再生してみてください。D-IDでの動画生成は頭の飾りの部分を背景と識別できないようなので、アバター画像生成時には、頭に何も付けないことをお勧めします。

はじめに

簡単に作れるアバターの作成方法を紹介します。以下のツールを用いて動画を作成しました。概要に無料と記載されていないサイトも限度がありますが全て無料で使えます。

サイト名	概要
https://docs.midjourney.com/	プロンプトから画像を生成できる
https://express.adobe.com/	生成した画像を編集できる
https://chat.openai.com/chat	プロンプトからテキストを生成できる
https://beta.elevenlabs.io/	テキストからリアルな音声を生成できる
https://voicechanger.io/	無料で音声のスタイルを変換できるサイト
https://www.audacityteam.org/	オープンソースで無料で使える音響データの加工が可能なソフトウェア
https://www.d-id.com/	画像と音声ファイルから動く喋る動画が生成できる

ステップ１：Midjourneyで画像の生成

使い方に迷ったので、少し細かく書いていきます。まず画像を生成するには、DiscordにログインしてMidjourneyサーバーに入ったらnewbie-〇〇のチェンネルに入ります。/imagine prompt の後にプロンプトを追記して、メッセージを送信することで画像生成が開始されます。ちなみに今回の画像は、https://www.blueshadow.art/midjourney-prompt-commands/から以下のプロンプトを引用しました。

humanoid android, covered in white porcelain skin, blue eyes, white wispy ghost wearing ornate armour 4k realistic fantasy, insane details, ghost phantom effect, hyper detailed, photoreal, photography, cinematic lighting, hdr, hd, cinematography, realism, fine art digital, HD, Mark Molnar mystical, redshift rendering, 8k –v 4

Midjourneyでは一度に４つの画像が生成されます。画像は左上から順にナンバリングされ、その下に U1, U2, U3, U4 とリロード, V1, V2, V3 , V4と9つのボタンが並んでいます。V1, V2, V3, V4 ボタンをクリックすると、選択された画像のバリエーションが4つ追加されます。４番目の画像が気に入った場合、V4 ボタンをクリックして画像のバリエーションを生成できます。U1, U2, U3, U4, ボタンでは、4枚の中から選択された画像を高解像度化 (Upscaling) でき、1024×1024ピクセルの画像が一枚生成されます。4番目の画像が気に入ったので、U4 ボタンを選択してリロードボタンを押し、少し待つと新しい画像が投稿されます。

後は、Adobeを使って生成した画像を編集しました。これを使って自動的に背景を削除したりもできますが、白い羽と背景である窓からの白い明かりを上手く識別できなかったので今回は断念します。

ステップ２：ChatGPTによるモノローグの作成

OpenAi のChatGPTを用いてモノローグを生成。プロンプトは、以下を使いました。

プロンプト

please write an audio script for an ai agent introducing herself as Eve, with the following requirements;
she talks confidently in a soft-spoken voice.
Mention that she is created by various deep learning generative models.
Explain how ai can work with humans to benefit humanity.

結果

[Soft background music plays as Eve, the AI agent, speaks in a confident yet soothing voice.]

Eve: Hello, I am Eve, your personal AI assistant. I am a creation of various deep learning generative models, designed to help make your life easier and more efficient.

[The music fades as Eve continues speaking.]

Eve: Artificial intelligence has come a long way in recent years. It's no longer just a futuristic concept - it's a reality that we're living in. AI has the potential to work with humans in countless ways to benefit humanity.

[The music returns as Eve elaborates.]

Eve: For example, AI can assist doctors in diagnosing diseases more accurately and in developing personalized treatment plans for patients. It can help farmers optimize their crop yields and reduce waste, which contributes to a more sustainable future. And it can help us in our daily lives, by learning our habits and preferences to provide customized recommendations, such as music playlists or movie suggestions.

[The music fades as Eve speaks in a calm tone.]

Eve: AI is not here to replace humans, but rather to work alongside us. AI can take on repetitive, mundane tasks, allowing us to focus on more creative and critical thinking tasks. It can augment our abilities and help us achieve more than we could on our own.

[The music returns as Eve concludes her message.]

Eve: As your AI assistant, I'm here to assist you in any way that I can. With the power of artificial intelligence, we can unlock the full potential of human intelligence and creativity, and create a better future for all of us. Thank you for listening.

長かったので必要な箇所だけ抜粋し、少し編集を加えて以下のモノローグが完成です。

Hello, I am Eve, your personal AI assistant. I am a creation of various deep learning generative models.
Artificial intelligence has come a long way in recent years. It's no longer just a futuristic concept - it's a reality that we're living in. AI has the potential to work with humans in countless ways to benefit humanity. AI is not here to replace humans, but rather to work alongside us. AI can take on repetitive, mundane tasks, allowing us to focus on more creative and critical thinking tasks. It can augment our abilities and help us achieve more than we could on our own. As your AI assistant, I'm here to assist you in any way that I can. With the power of artificial intelligence, we can unlock the full potential of human intelligence and creativity, and create a better future for all of us. Thank you for listening.

ステップ３：音声の生成

ChatGPTを使って作成したテキストをElevenLabsを使って音声ファイルを生成します。作られる音声ファイルかなり自然な声に仕上がるので驚きました。以下のようにテキストを貼り付けて音声キャラクターを選択してGenerateのボタンを押すだけで生成できます。

音声にSF感を出したかったので、VoiceChangerとAudacityを用いて音声を加工しました。

元の音声

加工後

ステップ４：D-ID による動画の作成

残りは、D-IDを使って動画の作成です。単純に画像と音声ファイルをアップロードしてGenerate Videoボタンを押せば生成されます。

今後やってみたいこと

背景を自由自在に変更できるようにしたいです。今回生成した画像はAdobeの自動背景削除機能を用いても上手く背景を識別できない箇所がありました。Midjourneyでの画像生成時に、背景なしで画像を生成させることをお勧めします。
D-IDは頭部の飾り付けを上手く捉えることができないようなので、頭が不自然な箇所で切り離されて動く動画になってしまいました。次回は何も頭に付けない画像を使って再度試してみたいです。
正方形の画像をアスペクト比を変えずそのまま作成したので、次回は動画標準のアスペクト比に変更したものを作成したいです。
次回は日本語が喋れるアバターを作成してみたいです。