What datasets are used in AnyText?
Question
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.
Answers ( 2 )
AnyText uses the AnyWord-3M dataset, which contains 3.03 million images and 9.18 million lines of text. It also includes subsets from Wukong and LAION, each with 1,000 images, for evaluating the accuracy and quality of text generation in Chinese and English. The dataset has been improved in AnyText-v1.1, with OCR annotations processed using PP-OCRv4 for Chinese and MARIO-LAION for English, resulting in a dataset ratio of approximately 1:1 for English and Chinese.
AnyText was trained on the AnyWord-3M dataset, which contains 3.03 million images and over 9.18 million lines of text, covering more than 21.5 million characters/words. The dataset includes 1.6 million Chinese and 1.39 million English lines, among other languages.