Meta introduces AI image editing tool Emu Edit / Video: training with 10 million data sets, which claims to be far more than the competition.

Shulou( Report--, November 20, Meta announced yesterday the launch of two AI-based image editing tools for Facebook and Instagram, namely "Emu Edit" and "Emu Video", which include photos and videos. Meta has released more information about these two AI tools, and collates them as follows.

According to officials, the Emu Edit model can accurately edit images using only text instructions, and by breaking down the text-to-video (Text-to-Video,T2V) generation process, the development team unveiled a method called Emu Video, which can improve the quality and diversity of the resulting video.

It is reported that Emu Edit is known as an innovative image editing method, which aims to simplify a variety of audio and video operation tasks and provide more functions and higher accuracy for video editing.

Emu Edit can accept user instructions for various forms of editing, including regional and global editing, removing and adding background, adjusting color and vector image conversion, or detecting and segmenting image elements.

Meta said that Emu Edit uses visual tasks as instructions into the generated model, which in turn provides better control in video generation and editing. The researchers point out that the current image editing model usually overmodifies the image or undermodifies it, while the advantage of Emu Edit is that it can accurately follow the instructions.

Meta uses 10 million synthetic data sets to train Emu Edit, which claims to be the largest of its kind, resulting in better image editing capabilities, where each sample contains image input, task description, and target output images. It enables the model to faithfully execute instructions and produce "better results than all the current competitors".

Emu Video is a simple and efficient method to generate text-to-video, which uses diffusion model and is based on Emu Edit. The development team explained that this video-generating architecture can respond to a variety of external input methods, including text, images, picture-text combinations, etc., in addition, Emu Video can also accept text prompts to "animate" user-provided images, thus providing "the ability to surpass past models".

Emu Video divides the movie generation process into two steps: first, to generate an image based on text prompts, and then to generate video based on text and image generation. This split-step film generation method allows researchers to train the generation model effectively.

The researchers further explained that Emu Video is different from previous studies such as Make-A-Video, which require a series of deep generation models. Emu Video is simpler and uses only two diffusion models to generate 512x512 resolution, 16FPS per second, and 4-second long video. found that Meta cited evaluation data to prove that the video quality and "loyalty to follow the prompts" generated by Emu Video were better than those in the industry.

In terms of quality, 96 per cent of respondents preferred Emu Video to the previous Make-A-Video scheme, while in terms of "prompt loyalty", Emu Video was favored by 85 per cent of respondents.

