Skip to content

A Deep Learning project that uses Diffusion transformers (DiT) to generate Grand Theft Auto V driving footage

License

Notifications You must be signed in to change notification settings

ikergarcia1996/AI-Generated-GTAV

Repository files navigation


AI Generated GTAV

Twitter GitHub license Models Dataset visitors Author

A Deep Learning project that uses Diffusion transformers (DiT) to generate Grand Theft Auto V driving footage. This project is based on the Open-Oasis Project

Driving Sample 1 Driving Sample 2

👉 Also check out TEDD1104: Self Driving Car in GTAV

Architecture

image

This project implements a diffusion-based video generation model trained on GTA V gameplay footage using:

  • Vision Transformer (ViT) for encoding/decoding frames
  • Diffusion Transformer (DiT) for the generative process
  • Optional action conditioning for controlled generation

Features

  • ✨ Pretrained Models
  • 🚀 Inference code for generating driving sequences
  • 💻 Complete training pipeline
  • 📊 Training dataset with 1.2M sequences

⚠️ This is a personal exploration project for video diffusion models. The code prioritizes readability and visualization over performance. While functional, results may be imperfect due to limited training resources. Feel free to experiment with the code!

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Torch Vision
  • PyTorch Image Models
  • Hugging Face Accelerate
  • Hugging Face Transformers
  • Hugging Face Datasets
  • Wandb (for logging)
pip install --upgrade torch torchvision transformers accelerate datasets einops wandb webdataset matplotlib timm 

Running Inference

First download the 🤖 Pretrained Models from 🤗Iker/AI-Generated-GTA-V.

Generate 32 frame video from random start frames from the test dataset without action conditioning:

python3 generate.py \
--total-frames 32 \
--dit_model_path download_path/dit.safetensors \
--vae_model_path download_path/vit-l-20.safetensors \
--noise_steps 100 \
--output_path your_video.mp4

Generate from custom start image without action conditioning:

python3 generate.py \
--total-frames 32 \
--dit_model_path download_path/dit.safetensors \
--vae_model_path download_path/vit-l-20.safetensors \
--noise_steps 100 \
--output_path your_video.mp4 \
--start_frame images/start_image_1.jpg

Enable action conditioning, by default all the actions will be pressing the key W to go forward. You should use the dit_action.safetensors model:

python3 generate.py \
--total-frames 32 \
--dit_model_path download_path/dit_action.safetensors \
--vae_model_path download_path/vit-l-20.safetensors \
--noise_steps 100 \
--output_path your_video_action_conditioning.mp4 \
--use_actions

Training your own model

The 📊 Full training dataset with 1,2M driving sequences is available in Iker/GTAV-Driving-Dataset.

In order to train your own model you first need to create a configuration file. See configs/train_dit_actions.yaml for an example of a training config with action conditioning and configs/train_dit.yaml for an example of a training config withour action conditioning.

Config params

Most of the params in the config files are self-explanatory. You can choose between dataset_type: hfdataset and hfdataset: webdataset. hfdataset is the most stable and faster setting, but it will download the entire dataset into your disk (~130GB) and then it will load the dataset into your RAM. So it required A LOT OF RAM. webdataset will stream the dataset from the hfrepo so you will only store ~6gb chunks into RAM at a time. It is more efficent more unestable and you might get connection errors.

The training will store the latest checkpoint in the output folder, if you set resume_from_checkpoint: true if a checkpoint exists, we will restore the training (optimizer, step, scheduler, dataset, etc...) from the checkpoint.

You can run the training with the following command, it will use as many GPUs as available (Data Parallelism):

accelerate launch --mixed_precision bf16 train_dit.py configs/train_dit_actions.yaml

See train_scripts/ for a slurm example to launch the training runs.

Results

View generated samples: