You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I tried to test nllb for translating some English sentences to Chinese, and all my sentences are less than 60 tokens. However, most of sentences which more than 30 tokens cannot be generated completely, only half or less part of them can be done.
I also tried the same code, but English to French, it works. All sentences can be generated completly.
I also setted min_length, but sometimes, if I got short sentence, the last part of sentence will be compeately generated.
My code is here, please help:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
logicvv
changed the title
NLLN is unable to translate into a complete long sentence in Chinese.
NLLB is unable to translate into a complete long sentence in Chinese.
Oct 10, 2024
🐛 Bug
Hi, I tried to test nllb for translating some English sentences to Chinese, and all my sentences are less than 60 tokens. However, most of sentences which more than 30 tokens cannot be generated completely, only half or less part of them can be done.
I also tried the same code, but English to French, it works. All sentences can be generated completly.
I also setted min_length, but sometimes, if I got short sentence, the last part of sentence will be compeately generated.
My code is here, please help:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
r"nllb-200-distilled-600M", token=True, src_lang="eng_Latn"
)
model = AutoModelForSeq2SeqLM.from_pretrained(r"nllb-200-distilled-600M", token=True)
input_path = r"eng_test_short.txt"
output_path = "./nllb_chn.txt"
input_file = open(input_path,'r',encoding='utf-8')
with open(output_path,'w',encoding='utf-8')as f:
for article in input_file:
inputs = tokenizer(article, return_tensors="pt")
# print(article)
# print(inputs)
translated_tokens = model.generate(
# **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"), max_length=200
**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("zho_Hans"), max_length=512
The output would be like:
input:
Politicians are loath to raise the tax even one penny when gas prices are high.
output:
政客们不愿意在高昂的燃油价格时,
The text was updated successfully, but these errors were encountered: