Meet our new ruDALL-E neural network!

Write a text query and get an image generated by ruDALL-E

Try it out

Download app «Salute» for fast and good quality of image generation.

Bright bedroom with a large bed and large green palm trees around

Goal

Our task was to create a multimodal neural network that studies concepts in several modalities, namely in the verbal and visual forms, in order to build a better understanding of the world. The transformer is taught to autoregressively model text and image tokens as a single data stream.

Application

Image generation solves two important problems that cannot be solved by information retrieval: 1) it allows you to take into account the exact description of what you want, 2) it creates an original image that did not exist before. Image generation can be used, for example, for article illustration, in copywriting or advertising.

The biggest computational challenge in Russian history

On the Christofari cluster, the model was trained for 37 days on 512 TESLA V100 GPUs, and then another 11 days on 128 GPUs - a total of 20352 GPU days. Our largest trained XXL model (12 billion parameters) is comparable to the English DALL-E from OpenAI!

ruDALL-E Malevich (XL)

Based on a short text description, ruDALL-E generates bright and colourful images on a variety of topics and subjects. The model understands a wide range of concepts and generates completely new images and objects that did not exist in the real world.

Training and model parameters:

  • 1.3 billion parameters
  • Image encoder - a custom VQGAN model that converts an image into a sequence of 32×32 characters
  • YTTM text tokenizer with a dictionary of 16,000 tokens
  • Specialized attention masks for visual sequences
  • Support for re-ranking of results by the ruCLIP model
  • Raising resolution support using the RealESRGAN model

Beautiful mountain landscape
Very beautiful dog
Beautiful yellow bird with a red beak

ruDALL-E Kandinsky (XXL)

Russian text-to-image model that generates images from text. The architecture is the same as ruDALL-E XL. Even more parameters in the new version!

Training and model parameters:

  • 12 billion parameters
  • Image encoder - a custom VQGAN model that converts an image into a sequence of 32×32 characters
  • YTTM text tokenizer with a dictionary of 16,000 tokens
  • Specialized attention masks for visual sequences
  • Support for re-ranking of results by the ruCLIP model
  • Raising resolution support using the RealESRGAN model

Beautiful sunset over the sea
An armchair in the shape of an avocado.
Blue frog with bushy tail