Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer does not derive the newer config #6415

Open
1 task done
xiaosu-zhu opened this issue Dec 21, 2024 · 0 comments
Open
1 task done

Tokenizer does not derive the newer config #6415

xiaosu-zhu opened this issue Dec 21, 2024 · 0 comments
Labels
pending This problem is yet to be addressed

Comments

@xiaosu-zhu
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.10.15
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H100 80GB HBM3
DeepSpeed version: 0.15.4

Reproduction

I found the tokenizer_config.json -> model_max_length does not update when model_args.model_max_length (equals to cutoff_len) is changed.

The cause may from the model/loader.py:

def load_tokenizer(model_args: "ModelArguments") -> "TokenizerModule":
    r"""
    Loads pretrained tokenizer and optionally loads processor.

    Note: including inplace operation of model_args.
    """
    init_kwargs = _get_init_kwargs(model_args)
    config = load_config(model_args)
    try:
        tokenizer = AutoTokenizer.from_pretrained(
            model_args.model_name_or_path,
            use_fast=model_args.use_fast_tokenizer,
            split_special_tokens=model_args.split_special_tokens,
            padding_side="right",
            **init_kwargs,
        )
    except ValueError:  # try the fast one
        tokenizer = AutoTokenizer.from_pretrained(
            model_args.model_name_or_path,
            use_fast=True,
            padding_side="right",
            **init_kwargs,
        )
    except Exception as e:
        raise OSError("Failed to load tokenizer.") from e

...

Here the tokenizer loads by model_name_or_path, but the model_max_length is not configured from model_args.

Expected behavior

I don't know if the tokenizer.model_max_length would be configured elsewhere. But passing the arg during the tokenizer creation process still make sense.

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant