Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLLB is unable to translate into a complete long sentence in Chinese. #5549

Open
logicvv opened this issue Oct 10, 2024 · 1 comment
Open

Comments

@logicvv
Copy link

logicvv commented Oct 10, 2024

🐛 Bug

Hi, I tried to test nllb for translating some English sentences to Chinese, and all my sentences are less than 60 tokens. However, most of sentences which more than 30 tokens cannot be generated completely, only half or less part of them can be done.

I also tried the same code, but English to French, it works. All sentences can be generated completly.

I also setted min_length, but sometimes, if I got short sentence, the last part of sentence will be compeately generated.
My code is here, please help:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
r"nllb-200-distilled-600M", token=True, src_lang="eng_Latn"
)
model = AutoModelForSeq2SeqLM.from_pretrained(r"nllb-200-distilled-600M", token=True)

input_path = r"eng_test_short.txt"
output_path = "./nllb_chn.txt"

input_file = open(input_path,'r',encoding='utf-8')

with open(output_path,'w',encoding='utf-8')as f:
for article in input_file:
inputs = tokenizer(article, return_tensors="pt")
# print(article)
# print(inputs)
translated_tokens = model.generate(
# **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=200
**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("zho_Hans"), max_length=512

    )
    print(tokenizer.convert_tokens_to_ids("zho_Hans"))

    output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True,model_max_length=512)[0]

    print(output)
    f.writelines(output+'\n')

The output would be like:
input:
Politicians are loath to raise the tax even one penny when gas prices are high.
output:
政客们不愿意在高昂的燃油价格时,

@logicvv logicvv changed the title NLLN is unable to translate into a complete long sentence in Chinese. NLLB is unable to translate into a complete long sentence in Chinese. Oct 10, 2024
@LiPengtao0504
Copy link

I also encountered this problem.
Src:"We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.

Tgt:他补充道:“我们现在有4个月大没有糖尿病的老鼠,但它们曾经得过该病。”

Predict:他补充说:"我们现在有4个月的小鼠,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants