Skip to content

a Japanese character Python library aiming to let characters be more than `str`.

Notifications You must be signed in to change notification settings

alexhsu-nlp/japanese_character_library

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

the Japanese Character Library Project

Current status: Developing (incomplete; a basic version is planned be done on 2023.11.8 night (CST))

This is an extension of my NLP course project done in CUHK(SZ) in 2023.

The aim is to provide a means to analyze Japanese sentences with specialized Japanese character objects (kanas and kanjis) supported with some known Japanese linguistic rules.

This package/repository uses the JMdict/EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group, and are used in conformance with the Group's licence.


TODO List:

  • youon (拗音) and gairaigo-only syllabaries problem (sutegana problem)
  • Treatment of "half-voiced" sounds (半濁音) (ぱぴぷぺぽ)
  • Convertor considering edge cases of suteganas as the basis for SyllableStr
  • Redefine syllables to moras
  • Refactor the convertor above, add 'ー' (long vowel sound) treatment
  • Settle down design of JapaneseStr: including primitive kanjis
  • Settle down design of kanjis based on kanjidic2 of the KANJIDIC project
  • DP save all possible furiganas of words
  • Treatment for corner cases of pronunciation
  • Data reader support for the previously used corpus (https://github.com/ndl-lab/huriganacorpus-ndlbib)
  • A simple demo using jupyter notebook
  • Illustration of application: ruby generation for items of Korean-Japanese dictionary (https://korean.dict.naver.com/kojadict/)
  • Basic Testings using Pytest
  • Incorporate the old Japanese kana style (歴史的仮名遣い)
  • Kana Iteration marks (踊り字): ゝ, ゞ, ヽ, ヾ
  • Kanji iteration mark '々'
  • List of common special symbols: 〆 (しめ), ゟ (より), ヿ (こと), 〇 (れい), and the problem of ヵ/ヶ.

Other reference websites:

Difference between mora (モーラ) and syllable (音節):

About

a Japanese character Python library aiming to let characters be more than `str`.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages