What's need to mofidify if we use this work for another language? #16

huynhtruc0309 · 2021-03-30T06:47:00Z

I am using your work for generating Vietnamese handwriting. I added my code to create_text_data.py

elif dataset=='VNOnDB':
        if mode == 'tr':
            split = 'train'
        elif mode == 'te':
            split = 'test'
        elif mode == 'val':
            split = 'validation'
        title = 'word' if words else 'line'
        
        data_fol = os.path.join(root_dir, title)
        label_paths = os.path.join(data_fol, split + '_' + title + '.csv')

        with open(label_paths, 'rt', encoding='utf8') as f:
            reader = csv.reader(f, delimiter = '\t')
            next(reader)
            
            for row in reader:
                image_path_list.append(os.path.join(data_fol, split + '_' + title, row[1] + '.png'))
                label_list.append(row[2])

and Lexicon to dataset_catalog.py

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT

_DATA_ROOT = '/extdata/ocr/truchlp/RESEARCH/generate_vietnamese_htr/Datasets/'
datasets = {"RIMEScharH32W16": _DATA_ROOT+'RIMES/h32char16to17/tr',
            "RIMEScharH32": _DATA_ROOT+'RIMES/h32/tr',
            "RIMEScharH32te": _DATA_ROOT+'RIMES/h32/te',
            "RIMEScharH32val": _DATA_ROOT+'RIMES/h32/val',
            
            "IAMcharH32W16rmPunct": _DATA_ROOT+'IAM/words/h32char16to17/tr_removePunc',
            "IAMcharH32rmPunct": _DATA_ROOT+'IAM/words/h32/tr_removePunc',
            "IAMcharH32rmPunct_te": _DATA_ROOT+'IAM/words/h32/te_removePunc',
            "IAMcharH32rmPunct_val1": _DATA_ROOT+'IAM/words/h32/va1',
            
            "CVLcharH32W16": _DATA_ROOT+'CVL/h32char16to17/tr',
            "CVLtrH32": _DATA_ROOT+'CVL/h32/train_new_partition',
            "CVLteH32": _DATA_ROOT+'CVL/h32/test_unlabeled',
            
            "VNOnDBcharH32W16": _DATA_ROOT+'VNOnDB/word/h32char16to17/tr',
            "VNOnDBcharH32": _DATA_ROOT+'VNOnDB/word/h32/tr',
            "VNOnDBcharH32te": _DATA_ROOT+'VNOnDB/word/h32/te',
            "VNOnDBcharH32val": _DATA_ROOT+'VNOnDB/word/h32/val'
            }

alphabet_dict = {'IAM': 'alphabetEnglish',
                 'RIMES': 'alphabetFrench',
                 'CVL': 'alphabetEnglish',
                 'VNOnDB': 'alphabetVietnamese'
                 }

lex_dict = {'IAM': _DATA_ROOT + 'Lexicon/english_words.txt',
            'RIMES': _DATA_ROOT + 'Lexicon/Lexique383.tsv',
            'CVL': _DATA_ROOT + 'Lexicon/english_words.txt',
            'VNOnDB': _DATA_ROOT + 'Lexicon/vietnamese_words.txt'}

and also added the alphabet

alphabetVietnamese = """! "#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrstuvwxyz{|}~°ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚÝàáâãèéêìíòóôõùúýĂăĐđĨĩŨũƠơƯưẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹ""" # VNOnDB & Cinnamon

I generated LMDB file for training.
But then when training, things got weird because the loss was negative.

(epoch: 1, iters: 36400, time: 0.034, data: 0.000) G: 8.161 D: 0.000 Dreal: 0.000 Dfake: 0.000 OCR_real: -57.267 OCR_fake: -1.565 grad_fake_OCR: 0.781 grad_fake_adv: 0.972

Can you tell me where did I get wrong? Thank you.

The text was updated successfully, but these errors were encountered:

rlit · 2021-04-15T19:22:53Z

Please make sure you're making correct+consistent use of character encoding (should be UTF-8 by default in py3, no?) throughout *.py files as well as other file types as used.
We made some experiments using custom characters (currency symbols) and on some cases python did not read them correctly until UTF was correctly setup.

One sanity check I would recommend is that the GT labels (i.e. one-hot vectors) look as you would expect them to during train/test iterations.

Looking forward to see Vietnamese handwriting!

giangnv125 · 2023-08-03T08:47:53Z

I am using your work for generating Vietnamese handwriting. I added my code to create_text_data.py

elif dataset=='VNOnDB':
        if mode == 'tr':
            split = 'train'
        elif mode == 'te':
            split = 'test'
        elif mode == 'val':
            split = 'validation'
        title = 'word' if words else 'line'
        
        data_fol = os.path.join(root_dir, title)
        label_paths = os.path.join(data_fol, split + '_' + title + '.csv')

        with open(label_paths, 'rt', encoding='utf8') as f:
            reader = csv.reader(f, delimiter = '\t')
            next(reader)
            
            for row in reader:
                image_path_list.append(os.path.join(data_fol, split + '_' + title, row[1] + '.png'))
                label_list.append(row[2])

and Lexicon to dataset_catalog.py

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT

_DATA_ROOT = '/extdata/ocr/truchlp/RESEARCH/generate_vietnamese_htr/Datasets/'
datasets = {"RIMEScharH32W16": _DATA_ROOT+'RIMES/h32char16to17/tr',
            "RIMEScharH32": _DATA_ROOT+'RIMES/h32/tr',
            "RIMEScharH32te": _DATA_ROOT+'RIMES/h32/te',
            "RIMEScharH32val": _DATA_ROOT+'RIMES/h32/val',
            
            "IAMcharH32W16rmPunct": _DATA_ROOT+'IAM/words/h32char16to17/tr_removePunc',
            "IAMcharH32rmPunct": _DATA_ROOT+'IAM/words/h32/tr_removePunc',
            "IAMcharH32rmPunct_te": _DATA_ROOT+'IAM/words/h32/te_removePunc',
            "IAMcharH32rmPunct_val1": _DATA_ROOT+'IAM/words/h32/va1',
            
            "CVLcharH32W16": _DATA_ROOT+'CVL/h32char16to17/tr',
            "CVLtrH32": _DATA_ROOT+'CVL/h32/train_new_partition',
            "CVLteH32": _DATA_ROOT+'CVL/h32/test_unlabeled',
            
            "VNOnDBcharH32W16": _DATA_ROOT+'VNOnDB/word/h32char16to17/tr',
            "VNOnDBcharH32": _DATA_ROOT+'VNOnDB/word/h32/tr',
            "VNOnDBcharH32te": _DATA_ROOT+'VNOnDB/word/h32/te',
            "VNOnDBcharH32val": _DATA_ROOT+'VNOnDB/word/h32/val'
            }

alphabet_dict = {'IAM': 'alphabetEnglish',
                 'RIMES': 'alphabetFrench',
                 'CVL': 'alphabetEnglish',
                 'VNOnDB': 'alphabetVietnamese'
                 }

lex_dict = {'IAM': _DATA_ROOT + 'Lexicon/english_words.txt',
            'RIMES': _DATA_ROOT + 'Lexicon/Lexique383.tsv',
            'CVL': _DATA_ROOT + 'Lexicon/english_words.txt',
            'VNOnDB': _DATA_ROOT + 'Lexicon/vietnamese_words.txt'}

and also added the alphabet

alphabetVietnamese = """! "#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_abcdefghijklmnopqrstuvwxyz{|}~°ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚÝàáâãèéêìíòóôõùúýĂăĐđĨĩŨũƠơƯưẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹ""" # VNOnDB & Cinnamon

I generated LMDB file for training. But then when training, things got weird because the loss was negative.

(epoch: 1, iters: 36400, time: 0.034, data: 0.000) G: 8.161 D: 0.000 Dreal: 0.000 Dfake: 0.000 OCR_real: -57.267 OCR_fake: -1.565 grad_fake_OCR: 0.781 grad_fake_adv: 0.972

Can you tell me where did I get wrong? Thank you.

@huynhtruc0309 I have some questions to ask you:

Would you tell me if this issue was solved?
Did you get the Vietnamese handwriting dataset on this link?
Did the Vietnamese handwriting generation problem solve? Can you send me your code following email [email protected]?
I hope to hear back from you soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's need to mofidify if we use this work for another language? #16

What's need to mofidify if we use this work for another language? #16

huynhtruc0309 commented Mar 30, 2021

rlit commented Apr 15, 2021

giangnv125 commented Aug 3, 2023

What's need to mofidify if we use this work for another language? #16

What's need to mofidify if we use this work for another language? #16

Comments

huynhtruc0309 commented Mar 30, 2021

rlit commented Apr 15, 2021

giangnv125 commented Aug 3, 2023