Add tokenizer #78

Koeng101 · 2024-06-19T22:49:56Z

This PR is creating a tokenizer in the dnadesign lib. This is primarily for tokenizing amino acids for consumption of an LLM - in particular, llm.c.

Koeng101 · 2024-06-23T05:17:07Z

I'd like to make the shard-writer to be a little smaller, and more specific to just receive tokens and write em. Maybe as a concurrent process.

I want to be able to encode pfam in the lead-up to peptides. [PFAM][AA seq][EOS]. The idea here is that you could throw a PFAM to predict the next tokens.

Koeng101 · 2024-07-04T04:28:12Z

according to https://www.biorxiv.org/content/10.1101/2024.06.06.597716v1.full.pdf "Using the UniParc database with 250 million protein sequences, research on ESM [72] shows that the datasets UR50/S and UR50/D, with 45M and 65M unique sequences respectively, outperform Uniref100 in perplexity (PPL) on a ~670M parameter MLM model."

If you take a look at figure 1 from that paper, they basically show that there is quite significant diminishing returns from using things beyond Uniref50. It notes later that basically uniref90/50 are the best. This is interesting for training sparser models.

In uniref90 there are roughly 65B tokens. Encoded as uint8, that's like 60GB, plus I bet I could shave off a little if I zstd encoded it.

Koeng101 · 2024-10-24T22:29:02Z

I don't really have the time to pursue this. The code works, but I'd like more documentation, so I'm going to close for now.

Koeng101 added 3 commits June 18, 2024 21:40

tokenizer init

d80cfd6

made function rather than default

73e3ddd

misspell

8305439

Koeng101 added 23 commits June 25, 2024 09:35

Add Pfam test in uniprot, and fixed up tokenizer to be concurrent

42ae349

Add tests for writing

8eaed9e

linter fix

068c213

test openbsd

0f90ed3

Add cli

d8142a0

fix openbsd tests

6168b92

update

50a9921

tokenizer now prints out tokens

6baeff4

updated

ccc5240

change pow

13aa6f0

add gz

84a655b

add count

69e2396

added flag for if we dont have a ref file

71ac826

set flag to false

76d0d1f

pfam proper count

7d2ea0d

pfam test

0c588d2

add count

2610422

count

a71141e

make tokenizer work right

122f614

pfamCountMap

19e5ecc

parc compatibility only, remove pfam features

fc5034e

add wait

57040ae

Tokenizer now goes into sqlite first

3d0c670

update

d20c243

Koeng101 closed this Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer #78

Add tokenizer #78

Koeng101 commented Jun 19, 2024

Koeng101 commented Jun 23, 2024

Koeng101 commented Jul 4, 2024

Koeng101 commented Oct 24, 2024

Add tokenizer #78

Add tokenizer #78

Conversation

Koeng101 commented Jun 19, 2024

Koeng101 commented Jun 23, 2024

Koeng101 commented Jul 4, 2024

Koeng101 commented Oct 24, 2024