Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate the audio modality in CoCa #94

Draft
wants to merge 129 commits into
base: main
Choose a base branch
from
Draft

Conversation

manasMauryax
Copy link
Collaborator

These commits essentially bring in two things:

  • The Conformer audio encoder:
    The Conformer architecture is readily available via torchaudio, and only a few additional modules were coded.

  • Changes to the CoCa code which allow the Conformer encoder and the audio modality to be used with the CoCa architecture:
    These changes include renaming and introducing a few variables and defining usage for them, as well as, slightly modifying the forward pass logic.

@manasMauryax manasMauryax requested a review from spravil March 28, 2024 15:40
@manasMauryax manasMauryax marked this pull request as ready for review April 8, 2024 07:55
@manasMauryax manasMauryax self-assigned this Apr 16, 2024
dropout=pre_conformer_dropout,
)

self.conformer = Conformer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the dependency to conformer and build it with components from the vision transformer? Maybe we want to change the conformer arcitecture in the future.

super().__init__()
self.sample_key = sample_key
self.prediction_key = prediction_key
self.pre_conformer = PreConformer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a tokenization of the input audio? Maybe choose a better name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not tokenization, just reduction in frame rate of the input. I will come up with a better name.

self.post_conformer = nn.Sequential(
nn.Linear(
input_dims,
n_embd,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to project from input_dims to n_embd? input_dims != n_embd?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, precisely -> input_dims!=n_embd

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Conformer implementation that I have worked on now, this will not be needed. I will project it in the very beginning (before any computation occurs in the conformer blocks).

nn.Conv1d(
in_channels=n_input_dims,
out_channels=n_input_dims,
kernel_size=2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two conv1d layers? Is this common? I assumed we apply vit style patching with conv2d of the spectrogram.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, in speech, sub-sampling like the one being performed here is common.

text_cls_prediction_key: str
vision_encoder_config: VisionTransformerConfig
modality_encoder_config: AudioTransformerConfig | VisionTransformerConfig | AVConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should have vision and audio config with default None. If its set the model is created. With both None we should end up with a normal language model.

def _init_modality(self, encoder_class, encoder_config, n_queries):
encoder = encoder_class(**dict(encoder_config))
queries = nn.Parameter(torch.randn(n_queries + 1, encoder_config.n_embd))
attn_pool = AttentionPooling(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention pooling layer should attend to the combination of the audio and vision endcoder output tokens if both are activated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is something for the future, since, currently, we don't parallel data across all modalities.

vision_embd, vision_cls_token = self._forward_encode_vision(inputs)
# TODO: The "modality_key" needs to be implemented.
if inputs[self.modality_key][0] == self.AUDIO:
modality_embd, modality_cls_token = self._forward_encode_audio(inputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply if audio encoder exists. Im not sure if we also want to check if audio data is in the inputs. Explicitly checking would maybe help with training only on two modalites at a time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for the same reason as mentioned above, currently we can only train on two modalities at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants