Tokenizer does not derive the newer config #6415

xiaosu-zhu · 2024-12-21T08:44:36Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.10.15
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H100 80GB HBM3
DeepSpeed version: 0.15.4

Reproduction

I found the tokenizer_config.json -> model_max_length does not update when model_args.model_max_length (equals to cutoff_len) is changed.

The cause may from the model/loader.py:

def load_tokenizer(model_args: "ModelArguments") -> "TokenizerModule":
    r"""
    Loads pretrained tokenizer and optionally loads processor.

    Note: including inplace operation of model_args.
    """
    init_kwargs = _get_init_kwargs(model_args)
    config = load_config(model_args)
    try:
        tokenizer = AutoTokenizer.from_pretrained(
            model_args.model_name_or_path,
            use_fast=model_args.use_fast_tokenizer,
            split_special_tokens=model_args.split_special_tokens,
            padding_side="right",
            **init_kwargs,
        )
    except ValueError:  # try the fast one
        tokenizer = AutoTokenizer.from_pretrained(
            model_args.model_name_or_path,
            use_fast=True,
            padding_side="right",
            **init_kwargs,
        )
    except Exception as e:
        raise OSError("Failed to load tokenizer.") from e

...

Here the tokenizer loads by model_name_or_path, but the model_max_length is not configured from model_args.

Expected behavior

I don't know if the tokenizer.model_max_length would be configured elsewhere. But passing the arg during the tokenizer creation process still make sense.

Others

No response

The text was updated successfully, but these errors were encountered:

github-actions bot added the pending This problem is yet to be addressed label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer does not derive the newer config #6415

Tokenizer does not derive the newer config #6415

xiaosu-zhu commented Dec 21, 2024

Tokenizer does not derive the newer config #6415

Tokenizer does not derive the newer config #6415

Comments

xiaosu-zhu commented Dec 21, 2024

Reminder

System Info

Reproduction

Expected behavior

Others