Welcome to the tutorial on building your first multimodal generative AI (Gen AI) app! This repository contains all the resources and code you need to get started with creating an app that can generate text, audio, images, and videos using various AI models and APIs.
Before you begin, make sure you have the following:
- A GitHub account
- GitHub Codespaces enabled (comes with your GitHub account)
- API keys for the following:
- Basic knowledge of Python and Bash
Note about GitHub Codespaces:
- GitHub Codespaces is included with every GitHub account.
- There's a substantial monthly free tier for personal accounts (120 core hours/month as of 2024).
- If you exceed the free tier, you may need to purchase additional usage.
- For the latest information on GitHub Codespaces pricing and usage limits, please check the official GitHub documentation.
Note for Workshop Participants: If you are taking this workshop at a conference or other event, please check with your instructors or teachers to see if they are providing the API keys for you.
To get up and running, you can watch the video below and/or follow the instructions (instructions at around 1:50 after motivating demo):
Building.a.Multi-Modal.AI.App.with.GitHub.Code.Spaces.--SUB.mp4
- Open the repository in GitHub.
- Click on the
Code
button and selectCreate codespace on main
. - Wait for the Codespace to spin up (this should take about 2 minutes).
- In the Codespace, navigate to the
.streamlit
directory inside themultimodal_app
folder. - Open the
secrets.toml
file. - Add your API keys as follows:
OPENAI_API_KEY = "your_openai_api_key" GROQ_API_KEY = "your_groq_api_key" REPLICATE_API_TOKEN = "your_replicate_api_token"
- Save the file and ensure these keys are kept private and secure.
You'll need API keys for either OpenAI or Groq, and Replicate (for full functionality).
- Once the Codespace finishes configuring, it will automatically install Poetry.
- In the Codespace terminal, activate the Poetry environment:
cd multimodal_app poetry shell
To run the Streamlit app:
- Ensure you're in the
multimodal_app
directory and have activated the Poetry shell. - Run the following command:
streamlit run main.py
- Click "Open in browser" when prompted to view the app.
The multimodal Gen AI app allows you to:
- Record speech or type text input.
- Transcribe speech to text.
- Generate text responses based on your input.
- Create audio versions of the text.
- Generate images based on the content.
- Create videos incorporating the generated content.
To use the app:
- Click the record button to speak, or type your input.
- Click "Transcribe" to convert speech to text (if applicable).
- Choose to run all tasks concurrently or step-by-step.
- Explore the generated text, audio, images, and videos.
If you encounter any issues:
- Ensure all API keys are correctly entered in the
secrets.toml
file. - Check that you're in the correct directory (
multimodal_app
) when running commands. - Verify that all dependencies are installed by running
poetry install
if needed.
We welcome contributions to improve this project! Please feel free to submit issues or pull requests.
Happy building! We hope you enjoy creating your first multimodal Gen AI app. If you have any questions or feedback, please don't hesitate to reach out.