-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tokenizer #78
Add tokenizer #78
Conversation
I'd like to make the shard-writer to be a little smaller, and more specific to just receive tokens and write em. Maybe as a concurrent process. I want to be able to encode pfam in the lead-up to peptides. |
according to https://www.biorxiv.org/content/10.1101/2024.06.06.597716v1.full.pdf "Using the UniParc database with 250 million protein sequences, research on ESM [72] shows that the datasets UR50/S and UR50/D, with 45M and 65M unique sequences respectively, outperform Uniref100 in perplexity (PPL) on a ~670M parameter MLM model." If you take a look at figure 1 from that paper, they basically show that there is quite significant diminishing returns from using things beyond Uniref50. It notes later that basically uniref90/50 are the best. This is interesting for training sparser models. In uniref90 there are roughly 65B tokens. Encoded as uint8, that's like 60GB, plus I bet I could shave off a little if I zstd encoded it. |
I don't really have the time to pursue this. The code works, but I'd like more documentation, so I'm going to close for now. |
This PR is creating a tokenizer in the dnadesign lib. This is primarily for tokenizing amino acids for consumption of an LLM - in particular, llm.c.