We present a family of generative models from SberDevices and Sber AI!

Models allow you to create images that did not exist before. All you need is a text description in Russian or another language.

Below are the technical characteristics of each of the models, as well as examples of images created by them.

Try to create unique images together with generative artists using your own formulations. Ask generative artists to depict something special for you as well.

Meet our new Kandinsky 2.0 neural network!

This model generates colorful images in minutes.

You can try it here Kandinsky 2.0 or in the Salute mobile app (if you know Russian). In the app, just say «Позови художника» and then ask him to draw something with your voice 🧑‍🎨

Generative artists, predecessors of Kandinsky 2.0 are also at your service - ruDALL-E Malevich and ruDALLE-E Kandinsky

And with ruDALL-E Emojich you can generate new emojis.🧚👩‍💻💁‍♂️ Check it out 👀👀👀

Christmas night with a beautiful moon and a pretty Christmas tree

Kandinsky 2.0

The Kandinsky 2.0 model uses the reverse diffusion method and creates colorful images on various topics in a matter of seconds by text query in Russian and other languages. You can even combine different languages within a single query. This neural network has been developed and trained by Sber AI researchers in close collaboration with scientists from Artificial Intelligence Research Institure using joined datasets by Sber AI and SberDevices with 1 bn. pairs «text — image».

Try it out

Training and model parameters:

  • Two multilingual text encoders, the embeddings of which are concatenated —344M + 300M parameters, respectively
  • The reverse diffusion is based on the UNet model with 1.2B parameters
  • Dynamic thresholding in the sampling process
  • Number of diffusion steps — 1000
  • batch_size=48
  • The length of the text prompt is 77 tokens

You can learn more about Kandinsky 2.0 here Source code and model weights can be found at: GitHub, HuggingFace

A red colored panda in the space, photo
Ancient idols worshiped by ancient people
Cover for a CD with epic metal and female vocals
Fair on Red Square in Moscow in the 17th century in the style of Surikov

ruDALL-E Kandinsky (XXL)

Russian text-to-image model that generates images from text. The architecture is the same as ruDALL-E XL. Even more parameters in the new version!

Try it out

Training and model parameters:

  • 12 billion parameters
  • Image encoder - a custom VQGAN model that converts an image into a sequence of 32×32 characters
  • YTTM text tokenizer with a dictionary of 16,384 tokens
  • Specialized attention masks for visual sequences
  • Support for re-ranking of results by the ruCLIP model
  • Upscaling support: RealESRGAN or guided diffusion

Sunflowers in a vase, Vincent van Gogh
Surrealism, style
An anime stylized potato with electrical discharge effects on background of a modern city in neon cybepunk style
Sunset and the city

ruDALL-E Malevich (XL)

Based on a short text description, ruDALL-E generates bright and colourful images on a variety of topics and subjects. The model understands a wide range of concepts and generates completely new images and objects that did not exist in the real world.

Try it out

Training and model parameters:

  • 1.3 billion parameters
  • Image encoder - a custom VQGAN model that converts an image into a sequence of 32×32 characters
  • YTTM text tokenizer with a dictionary of 16,000 tokens
  • Specialized attention masks for visual sequences
  • Support for re-ranking of results by the ruCLIP model
  • Raising resolution support using the RealESRGAN model

Beautiful mountain landscape
Very beautiful dog
Beautiful yellow bird with a red beak

ruDALL-E Emojich

Based on a short text description, ruDALL-E generates bright and colourful images on a variety of topics and subjects. The model understands a wide range of concepts and generates completely new images and objects that did not exist in the real world.

Try it out

Training and model parameters:

  • ruDALL-E Emojich - ruDALL-E Malevich finetune. For model finetuning 2749 emoji icons and corresponding Russian-language descriptions were collected
  • 1.3 billion parameters
  • Image encoder - a custom VQGAN model that converts an image into a sequence of 32×32 characters
  • YTTM text tokenizer with a dictionary of 16,000 tokens
  • Specialized attention masks for visual sequences
  • Raising resolution support using the RealESRGAN model

Gandalf
Lego Donald Trump