Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with train.py - chatset errors. #44

Open
ghost opened this issue May 17, 2018 · 8 comments
Open

Issue with train.py - chatset errors. #44

ghost opened this issue May 17, 2018 · 8 comments

Comments

@ghost
Copy link

ghost commented May 17, 2018

Any thoughts? I am using windows..

Preprocessing file 2/6 (reddit-parse/output\output 1.bz2)... Traceback (most recent call last): File "train.py", line 190, in <module> main() File "train.py", line 49, in main train(args) File "train.py", line 55, in train data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length) File "D:\bot\utils.py", line 39, in __init__ self._preprocess(self.input_files[i], self.tensor_file_template.format(i)) File "D:\bot\utils.py", line 107, in _preprocess data = file_reference.read() File "D:\python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 23267: character maps to <undefined>

@sashasmirnova
Copy link

hi, I'm having the same problem when I'm running train.py on new data.

@neofob
Copy link

neofob commented May 31, 2018

This might not be the right solution but...here is a patch for that.
neofob@1f56cb9

@zhou-daniel-dz
Copy link

zhou-daniel-dz commented Jul 30, 2018

Yea @neofob changes the encodings the utils are using to read the training sets, but this should match which encodings you used to write the training data as well. (i.e if your training files are encoded with utf-8, they should be read in utf-8)

Although this allows for training I'm not too sure if the char-rnn works with utf-8 encodings at all since I am just getting gibberish back from the model when trained this way. (karpathy/char-rnn#113)

@geroale
Copy link

geroale commented Aug 30, 2018

Any news? Same problem here.

The @neofob patch doesn't work for me: I guess it's because bz2.open errors="ignore" or errors="replace" param is not working.

I am using the same @pender reddit dataset (https://github.com/pender/chatbot-rnn)

@zhou-daniel-dz
Copy link

You just need to make sure the data you're training on is encoded in ANSI.

If your parser must read and write in a different encoding, just save the output text file as ANSI and it should be useable. Clearly certain characters cannot be mapped, but the percentage of those characters seems too small to make a difference.

@remotejob
Copy link

@neofob @zhou-daniel-dz I try figure out how make char-rnn work with utf-8 but simple path
in: utils.py
if input_file.endswith(".bz2"): file_reference = bz2.open(input_file, mode='rt', encoding="utf-8", errors="replace") elif input_file.endswith(".txt"): file_reference = io.open(input_file, mode='rt', encoding="utf-8", errors="replace")
Don't work for me probably it's not enough?

egg82 pushed a commit to egg82/chatbot-rnn that referenced this issue Mar 23, 2019
@breadbrowser
Copy link

no just bad or wrong format

@breadbrowser
Copy link

of bz2 or txt file or file renamed from zst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants