Cutoff length in continual pre-training in Llama-factory? #4681
Replies: 2 comments
-
See https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt#preparing-the-dataset for the method we adopted in preparing pretraining data |
Beta Was this translation helpful? Give feedback.
-
@hiyouga Question1: Why is there such a large decrease in the number of samples? Thank you so much! |
Beta Was this translation helpful? Give feedback.
-
When I set cutoff_len = 8192, what will happen to samples that are longer than cutoff_len?
Case 1: They are truncated to a maximum length of 8192.
Case 2: They are split into multiple samples. For example, a sample with a length of 10,000 will be split into one sample with a length of cutoff_len = 8192 and another sample with a length of 10,000 - 8192 = 1908.
Which case will occur with your llama factory library?
Beta Was this translation helpful? Give feedback.
All reactions