Although Text-to-Image tasks have been developing in recent years, the controlled generation of images that represent the layout of multiple complex objects remains a challenging problem. Specifically, challenges for the GLIDE model of image generation from text are: controlling the number of objects, scale conversion, instability of generation (objects indicated in the text do not appear in the image), and unnatural object structure. In addition, the task of generating images from textual information alone is so flexible that it is difficult to derive the relevance and requires a large number of data for training. In fact, the training of current generative models often uses hundreds of millions of text-image pairs. The large scale of the generative model itself makes the training cost enormous. In this study, we propose and validate a new image generation method that uses segmentation and text as input. That is, image generation from text is assisted by segmentation information. This should solve issues that GLIDE had difficulty with, such as controlling the number of objects, scale conversion, unstable generation, and unnatural object structure. Furthermore, the goal is to reduce the number of data needed for training by making it easier to find the relationship between textual information and image generation. As a result of the verification, our model achieved FID-10k score of 17.12 after training with only about 120,000 training data, and it was confirmed that it is capable of handling complex layouts and maintaining natural object structure even with a large number of objects.