Facebook and Instagram will soon have new AI-powered tools with the introduction of two major advancements. In a Facebook post, Meta CEO Mark Zuckerberg announced two new Emu AI tools: Emu Edit and Emu Video. These two Meta Emu AI tools will enable users to edit images and create video clips using text descriptions.
The first tool, Emu Edit, lets users tweak images precisely based on text inputs. It resembles existing tools from Adobe or Google, allowing users to remove or replace objects in photos without needing professional editing skills. The second tool, Emu Video, generates videos from text or images. While not ultra-realistic, these videos seem better than Meta’s previous Make-A-Video system’s rough animations.
In this article, we will try to cover everything about these two new Meta’s Emu AI tools. So, without any further wait, let’s get started!
Emu Edit: Precise Image Editing Via Text Instructions
Emu Edit is a multi-task image editing model that sets a new state-of-the-art in instruction-based image editing. It employs multi-task learning to train a single model capable of diverse image editing and computer vision tasks. This represents a departure from prior work that focused on individual tasks like object removal or color editing.
Emu Edit’s Editing Tasks
The researchers compiled a dataset covering 16 distinct tasks grouped into three categories:
1. Region-Based Editing
This involves tasks like adding, removing, or substituting objects and changing textures. For example, a user could input the text “Add a parrot” to an image of a forest, and Emu Edit would add a parrot to the image without altering other elements.
2. Free-Form Editing
This includes tasks like changing the color or shape of an object, or altering the texture of an image. For instance, a user could input the text “Change the sky to be gray” to an image, and Emu Edit would change the sky color to gray.
3. Computer Vision Tasks
These tasks involve tasks like object detection, segmentation, and depth estimation. For example, a user could input the text “Detect all dogs in the image” to an image, and Emu Edit would highlight all the dogs in the image.

Emu Edit’s Image Editing Capabilities
1. Composition of Add and Detect Tasks
Emu Edit can add objects to an image and then detect them in the same or subsequent images. This is particularly useful for tasks where the presence of a certain object needs to be confirmed.
2. Composition of Add and Style Tasks
Emu Edit can add objects to an image and then apply style changes to them. This is useful for tasks where the style of an object needs to be modified after it has been added to the image.
3. Image Inpainting
Emu Edit can fill in missing or corrupted parts of an image. This is a complex task that involves understanding the context of the image and generating plausible content to fill in the gaps.
4. Contour Detection
Emu Edit can detect the boundaries of objects in an image. This is useful for tasks that require understanding the shape or outline of objects.
5. Super-Resolution
Emu Edit can increase the resolution of an image. This is useful for tasks that require high-quality images, such as zooming in on a detail or printing a high-resolution image.
Emu Edit’s Multi-Turn Image Editing
Emu Edit is capable of multi-turn image editing. In this process, each subsequent image is derived from the prior one, using its associated caption. The initial image is based on a zeroed reference. This means that the model can generate a series of edited images based on a sequence of text instructions, with each new image being an edited version of the previous one.

Emu Edit’s Learned Task Embeddings
Emu Edit utilizes “learned task embeddings” to guide the model with tasks. For each task, the researchers train an embedding vector that encodes the task identity. During training, the task embedding is provided to the model and optimized jointly with the model weights. At inference time, a text classifier predicts the most appropriate task embedding based on the instruction. The embedding guides the model to apply the correct type of transformation.

Emu Edit vs. InstructPix2Pix and MagicBrush
InstructPix2Pix and MagicBrush are models that edit images based on given instructions. However, they often struggle to understand and execute these instructions accurately, limiting their adaptability to different tasks.
Emu Edit is introduced to address these limitations. It is trained on diverse tasks, making it better at following instructions while preserving image quality. Emu Edit outperforms both InstructPix2Pix and MagicBrush in accurately executing editing instructions and maintaining the original image’s visual quality.
Human evaluators showed a strong preference for Emu Edit over InstructPix2Pix and MagicBrush. Apart from one method that uses specific ground-truth captions, Emu Edit also outperformed these models in automatic metrics, indicating its strength in instruction-based image editing.

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
In addition to image editing, Meta’s AI team has also been working on enhancing video generation. Emu Video, developed by Meta, is a unique tool for video generation that leverages the power of generative AI. It’s designed to create videos based on text inputs.
Emu Video uses just two diffusion models to generate high-resolution videos, which is a significant improvement over Meta’s previous tool, Make-A-Video, which used five models. This approach allows Emu Video to generate videos at a higher resolution (512×512) at 16 frames per second.
Emu Video Factorized Approach
Emu Video uses a unique factorized approach to video generation, which simplifies the process and makes it more efficient.
This factorized approach involves two steps.
- First, a high quality is generated based on a text prompt.
- Then, a full video is created based on both the synthesized image and the original text prompt.
This method is more efficient and effective than prior methods that required multiple models.
This image factorization strengthens the overall conditioning signal, providing vital missing information to guide the video generation process. The generated image acts as a starting point that the model can then imagine moving and evolving over time based on the text description.
Example Videos Generated By Emu Video
Let’s have a look at some of the videos generated by Meta Emu Video. Also, check the Meta Emu Video generation demo.
1. A Cute Raccoon (Photorealistic Style)
2. A Panda (Cubist Painting Style)
3. An Origami Brown Bear (Anime Manga Style)
4. A Miniature Blue Dragon (Paper Cut Craft Illustration Style)
5. A Gray British Shorthair (Steampunk Style)
Emu Video vs. Other AI Video Generation Tools
When comparing Emu VIDEO to Align Your Latents, PYOCO, Reuse & Diffuse, Gen2, and PikaLabs, Emu VIDEO stands out as a better choice. The reasons are primarily pixel sharpness and the motion smoothness of Emu videos over these models. The amount of motion in Emu VIDEO generations is also an impactful winning factor against PYOCO and PikaLabs.

However, when pitted against Make-A-Video, Imagen Video, and Gen2, Emu VIDEO may not be a good choice. Make-A-Video videos are preferred over Emu VIDEO ones because of object consistency. For Imagen Video generations, they’re liked more due to the amount of motion. The Gen2 videos are chosen more over Emu VIDEO due to their motion smoothness and pixel sharpness.

Emu Video Win Rate Percentage: Video Quality and Text Faithfulness
Emu VIDEO outperforms previous methods in video quality and text faithfulness, with win rates ranging from 56.4% to 100%. Compared to Imagen Video, PYOCO, and Make-A-Video, Emu VIDEO scored 81%, 90%, and 96%, respectively, in human evaluations. It also surpasses commercial solutions like Gen2 and PikaLabs. Emu Video also stands out for its ability to animate user-provided images based on text prompts. This feature surpasses prior works by 96%.

Emu Video represents a significant advancement in AI video generation. Its factorized approach, simplicity, and high performance make it a powerful tool for creating videos based on text inputs.
The Road Ahead
The introduction of Emu Edit and Emu Video by Meta represents a significant milestone in the field of AI. These tools offer precise control over image and video editing tasks, ensuring that only relevant pixels are altered. This approach brings a novel approach that aims to streamline various image and video manipulation tasks, bringing enhanced capabilities and precision to image and video editing.
The potential use cases for these tools are vast. They can enable users to create their own animated stickers and GIFs on the fly, rather than searching for existing ones that match the idea they’re trying to convert. It can also enable people to edit their own photographs without using complicated tools such as Photoshop.
- ByteDance Drops UI-TARS-1.5, The AI Agent That Can Control Your Screen

- Diffusion Arc, the Ultimate Open Database for AI Image Models – A Civitai Alternative

- Tiny Agents, A 50-Line MCP-Powered AI Framework by HuggingFace Co-Founder for the Modern Dev

- Ziff Davis, PCMag’s Parent Company, is Suing OpenAI for Copyright Theft, Here’s Why

- Cornell’s RHyME Lets Robots Learn from Humans Just by Watching Videos







