Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's need to mofidify if we use this work for another language? #16

Open
huynhtruc0309 opened this issue Mar 30, 2021 · 2 comments
Open

Comments

@huynhtruc0309
Copy link

I am using your work for generating Vietnamese handwriting. I added my code to create_text_data.py

elif dataset=='VNOnDB':
        if mode == 'tr':
            split = 'train'
        elif mode == 'te':
            split = 'test'
        elif mode == 'val':
            split = 'validation'
        title = 'word' if words else 'line'
        
        data_fol = os.path.join(root_dir, title)
        label_paths = os.path.join(data_fol, split + '_' + title + '.csv')

        with open(label_paths, 'rt', encoding='utf8') as f:
            reader = csv.reader(f, delimiter = '\t')
            next(reader)
            
            for row in reader:
                image_path_list.append(os.path.join(data_fol, split + '_' + title, row[1] + '.png'))
                label_list.append(row[2])

and Lexicon to dataset_catalog.py

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT

_DATA_ROOT = '/extdata/ocr/truchlp/RESEARCH/generate_vietnamese_htr/Datasets/'
datasets = {"RIMEScharH32W16": _DATA_ROOT+'RIMES/h32char16to17/tr',
            "RIMEScharH32": _DATA_ROOT+'RIMES/h32/tr',
            "RIMEScharH32te": _DATA_ROOT+'RIMES/h32/te',
            "RIMEScharH32val": _DATA_ROOT+'RIMES/h32/val',
            
            "IAMcharH32W16rmPunct": _DATA_ROOT+'IAM/words/h32char16to17/tr_removePunc',
            "IAMcharH32rmPunct": _DATA_ROOT+'IAM/words/h32/tr_removePunc',
            "IAMcharH32rmPunct_te": _DATA_ROOT+'IAM/words/h32/te_removePunc',
            "IAMcharH32rmPunct_val1": _DATA_ROOT+'IAM/words/h32/va1',
            
            "CVLcharH32W16": _DATA_ROOT+'CVL/h32char16to17/tr',
            "CVLtrH32": _DATA_ROOT+'CVL/h32/train_new_partition',
            "CVLteH32": _DATA_ROOT+'CVL/h32/test_unlabeled',
            
            "VNOnDBcharH32W16": _DATA_ROOT+'VNOnDB/word/h32char16to17/tr',
            "VNOnDBcharH32": _DATA_ROOT+'VNOnDB/word/h32/tr',
            "VNOnDBcharH32te": _DATA_ROOT+'VNOnDB/word/h32/te',
            "VNOnDBcharH32val": _DATA_ROOT+'VNOnDB/word/h32/val'
            }

alphabet_dict = {'IAM': 'alphabetEnglish',
                 'RIMES': 'alphabetFrench',
                 'CVL': 'alphabetEnglish',
                 'VNOnDB': 'alphabetVietnamese'
                 }

lex_dict = {'IAM': _DATA_ROOT + 'Lexicon/english_words.txt',
            'RIMES': _DATA_ROOT + 'Lexicon/Lexique383.tsv',
            'CVL': _DATA_ROOT + 'Lexicon/english_words.txt',
            'VNOnDB': _DATA_ROOT + 'Lexicon/vietnamese_words.txt'}

and also added the alphabet

alphabetVietnamese = """! "#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrstuvwxyz{|}~°ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚÝàáâãèéêìíòóôõùúýĂăĐđĨĩŨũƠơƯưẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹ""" # VNOnDB & Cinnamon

I generated LMDB file for training.
But then when training, things got weird because the loss was negative.

(epoch: 1, iters: 36400, time: 0.034, data: 0.000) G: 8.161 D: 0.000 Dreal: 0.000 Dfake: 0.000 OCR_real: -57.267 OCR_fake: -1.565 grad_fake_OCR: 0.781 grad_fake_adv: 0.972

Can you tell me where did I get wrong? Thank you.

@rlit
Copy link
Contributor

rlit commented Apr 15, 2021

Please make sure you're making correct+consistent use of character encoding (should be UTF-8 by default in py3, no?) throughout *.py files as well as other file types as used.
We made some experiments using custom characters (currency symbols) and on some cases python did not read them correctly until UTF was correctly setup.

One sanity check I would recommend is that the GT labels (i.e. one-hot vectors) look as you would expect them to during train/test iterations.

Looking forward to see Vietnamese handwriting!

@giangnv125
Copy link

I am using your work for generating Vietnamese handwriting. I added my code to create_text_data.py

elif dataset=='VNOnDB':
        if mode == 'tr':
            split = 'train'
        elif mode == 'te':
            split = 'test'
        elif mode == 'val':
            split = 'validation'
        title = 'word' if words else 'line'
        
        data_fol = os.path.join(root_dir, title)
        label_paths = os.path.join(data_fol, split + '_' + title + '.csv')

        with open(label_paths, 'rt', encoding='utf8') as f:
            reader = csv.reader(f, delimiter = '\t')
            next(reader)
            
            for row in reader:
                image_path_list.append(os.path.join(data_fol, split + '_' + title, row[1] + '.png'))
                label_list.append(row[2])

and Lexicon to dataset_catalog.py

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT

_DATA_ROOT = '/extdata/ocr/truchlp/RESEARCH/generate_vietnamese_htr/Datasets/'
datasets = {"RIMEScharH32W16": _DATA_ROOT+'RIMES/h32char16to17/tr',
            "RIMEScharH32": _DATA_ROOT+'RIMES/h32/tr',
            "RIMEScharH32te": _DATA_ROOT+'RIMES/h32/te',
            "RIMEScharH32val": _DATA_ROOT+'RIMES/h32/val',
            
            "IAMcharH32W16rmPunct": _DATA_ROOT+'IAM/words/h32char16to17/tr_removePunc',
            "IAMcharH32rmPunct": _DATA_ROOT+'IAM/words/h32/tr_removePunc',
            "IAMcharH32rmPunct_te": _DATA_ROOT+'IAM/words/h32/te_removePunc',
            "IAMcharH32rmPunct_val1": _DATA_ROOT+'IAM/words/h32/va1',
            
            "CVLcharH32W16": _DATA_ROOT+'CVL/h32char16to17/tr',
            "CVLtrH32": _DATA_ROOT+'CVL/h32/train_new_partition',
            "CVLteH32": _DATA_ROOT+'CVL/h32/test_unlabeled',
            
            "VNOnDBcharH32W16": _DATA_ROOT+'VNOnDB/word/h32char16to17/tr',
            "VNOnDBcharH32": _DATA_ROOT+'VNOnDB/word/h32/tr',
            "VNOnDBcharH32te": _DATA_ROOT+'VNOnDB/word/h32/te',
            "VNOnDBcharH32val": _DATA_ROOT+'VNOnDB/word/h32/val'
            }

alphabet_dict = {'IAM': 'alphabetEnglish',
                 'RIMES': 'alphabetFrench',
                 'CVL': 'alphabetEnglish',
                 'VNOnDB': 'alphabetVietnamese'
                 }

lex_dict = {'IAM': _DATA_ROOT + 'Lexicon/english_words.txt',
            'RIMES': _DATA_ROOT + 'Lexicon/Lexique383.tsv',
            'CVL': _DATA_ROOT + 'Lexicon/english_words.txt',
            'VNOnDB': _DATA_ROOT + 'Lexicon/vietnamese_words.txt'}

and also added the alphabet

alphabetVietnamese = """! "#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrstuvwxyz{|}~°ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚÝàáâãèéêìíòóôõùúýĂăĐđĨĩŨũƠơƯưẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹ""" # VNOnDB & Cinnamon

I generated LMDB file for training. But then when training, things got weird because the loss was negative.

(epoch: 1, iters: 36400, time: 0.034, data: 0.000) G: 8.161 D: 0.000 Dreal: 0.000 Dfake: 0.000 OCR_real: -57.267 OCR_fake: -1.565 grad_fake_OCR: 0.781 grad_fake_adv: 0.972

Can you tell me where did I get wrong? Thank you.

@huynhtruc0309 I have some questions to ask you:

  1. Would you tell me if this issue was solved?
  2. Did you get the Vietnamese handwriting dataset on this link?
  3. Did the Vietnamese handwriting generation problem solve? Can you send me your code following email [email protected]?
    I hope to hear back from you soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants