Illustrating Classic Brazilian Books using a Text-To-Image Diffusion Model
Keywords:
Image Generation, Diffusion Models, Text-to-Image, IllustrationAbstract
In recent years, Generative Artificial Intelligence (GenAI) has undergone a profound transformation in addressing intricate tasks involving diverse modalities such as textual, auditory, visual, and pictorial generation. Within this spectrum, text-to-image (TTI) models have emerged as a formidable approach to generating varied and aesthetically appealing compositions, spanning applications from artistic creation to realistic facial synthesis, and demonstrating significant advancements in computer vision, image processing, and multimodal tasks. The advent of Latent Diffusion Models (LDMs) signifies a paradigm shift in the domain of AI capabilities. This article delves into the feasibility of employing the Stable Diffusion LDM to illustrate literary works. For this exploration, seven classic Brazilian books have been selected as case studies. The objective is to ascertain the practicality of this endeavor and to evaluate the potential of Stable Diffusion in producing illustrations that augment and enrich the reader's experience. We will outline the beneficial aspects, such as the capacity to generate distinctive and contextually pertinent images, as well as the drawbacks, including any shortcomings in faithfully capturing the essence of intricate literary depictions. Through this study, we aim to provide a comprehensive assessment of the viability and efficacy of utilizing AI-generated illustrations in literary contexts, elucidating both the prospects and challenges encountered in this pioneering application of technology.
Downloads
References
OpenAI, “Gpt-4 technical report,” 2023, doi: https://doi.org/10.48550/arXiv.2303.08774.
K. I. Roumeliotis and N. D. Tselikas, “Chatgpt and open-ai models: A preliminary review,” Future Internet, vol. 15, no. 6, p. 192, 2023, doi: https://doi.org/10.3390/fi15060192.
J. Huang and M. Tan, “The role of chatgpt in scientific communication: writing better scientific review articles,” American Journal of Cancer Research, vol. 13, no. 4, p. 1148, 2023. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/37168339/
Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017, doi: https://doi.org/10.48550/arXiv.1703.10135.
A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A genera-tive model for raw audio,” arXiv preprint arXiv:1609.03499, 2016, doi: https://doi.org/10.48550/arXiv.1609.03499.
Y. Ning, S. He, Z. Wu, C. Xing, and L.-J. Zhang, “A review of deep learning based speech synthesis,” Applied Sciences, vol. 9, no. 19, p. 4050, 2019, doi: https://doi.org/10.3390/app9194050.
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo et al., “Improving image generation with better captions,” Computer Science, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:264403242
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” 2021, doi: https://doi.org/10.48550/arXiv.2102.12092.
(2022) Midjourney. [Online]. Available: https://www.midjourney.com
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2022, doi:https://doi.org/10.48550/arXiv.2112.10752.
I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016. [Online]. Available: https://www.deeplearningbook.org/
C. Wang, C. Xu, C. Wang, and D. Tao, “Perceptual adversarial networks for image-to-image transformation,” IEEE Transactions on Image Pro-cessing, vol. 27, no. 8, pp. 4066–4079, 2018, doi:https://doi.org/10.1109/TIP.2018.2836316.
J. Liu, C. Wang, H. Su, B. Du, and D. Tao, “Multistage gan for fabric defect detection,” IEEE Transactions on Image Processing, vol. 29, pp. 3388–3400, 2019, doi:https://doi.org/10.1109/TIP.2019.2959741.
T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-to-image generation by redescription,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2019, pp.1505–1514, doi:https://doi.org/10.1109/CVPR.2019.00160.
K. Walczak and W. Cellary, “Challenges for higher education in the era of widespread access to generative ai,” Economics and Business Review, vol. 9, no. 2, pp. 71–100, 2023, doi:https://doi.org/10.18559/ebr.2023.2.743.
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020, doi:https://doi.org/10.48550/arXiv.2006.11239.
F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, doi:https://doi.org/10.1109/TPAMI.2023.3261988.
H. Vartiainen and M. Tedre, “Using artificial intelligence in craft education: crafting with text-to-image generative models,” Digital Creativity, vol. 34, no. 1, pp. 1–21, 2023, doi: https://doi.org/10.1080/14626268.2023.2174557.
A. R. Doshi and O. Hauser, “Generative artificial intelligence enhances creativity,” Available at SSRN, 2023, doi: http://dx.doi.org/10.2139/ssrn.4535536.
H. Smart, “Making books with generative ai,” Master’s thesis, Concordia University, December 2023, unpublished. [Online]. Available: https://spectrum.library.concordia.ca/id/eprint/993284/
B. Tomlinson, R. W. Black, D. J. Patterson, and A. W. Torrance, “The carbon emissions of writing and illustrating are lower for ai than for humans,” Scientific Reports, vol. 14, no. 1, p. 3732, 2024, doi: https: //doi.org/10.1038/s41598-024-54271-x.
W. A. Cruz-Castañeda, M. Amadeus, A. Zanella, and F. R. Perche-Mahlow, Generative AI Methodology for Producing Assisted Art:
Representation of the Historical-Cultural Identity of Southern Brazil, S. Hai-Jew, Ed. IGI Global, 2024, doi: https://doi.org/10.4018/ 979-8-3693-1950-5.ch010.
A. M. Piskopani, A. Chamberlain, and C. Ten Holter, “Responsible ai and the arts: The ethical and legal implications of ai in the arts
and creative industries,” in Proceedings of the First International Symposium on Trustworthy Autonomous Systems, ser. TAS ’23. New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3597512.3597528
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models
for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023, doi:https://doi.org/10.48550/arXiv.2307.01952.
J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” 2022, doi:
https://doi.org/10.48550/arXiv.2104.08718.
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” 2016, doi: https://
doi.org/10.48550/arXiv.1606.03498.
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021, doi: https://doi.org/10.48550/arXiv.2108.07258