Morph Research Overview

Overview

The MORPH benchmark is designed to evaluate the quality of AI-generated video content with a focus on morphological fidelity, temporal consistency, copyright risk, and other key visual and factual attributes. This benchmark enables objective comparison of cutting-edge text-to-video models, offering both fine-grained (per-video) and aggregate (per-model) scoring.

Benchmark model: https://huggingface.co/ChimaAI/MORPH-benchmark
Training dataset: https://huggingface.co/datasets/ChimaAI/MORPH-dataset

Scoring Methodology

Each video is rated across 7 evaluation dimensions:

Visual Quality (VQ): Clearness, resolution, brightness, and color
Temporal Consistency (TC): Consistency of objects or humans in video
Dynamic Degree (DD): Degree of dynamic changes
Text-to-Video Alignment (TVA): Alignment between the text prompt and the video content
Factual Consistency (FC): Consistency with common-sense and factual knowledge
Morphological Fidelity (MF): Realism, anatomical consistency, and coherence of human figures across motion
Copyright Risk (CR): Visual similarity to well-known media

Each aspect is rated on a 1–4 scale, with 4 representing optimal performance.

Score Calculation Formula

The final score for a video is computed as:

Final Score = ( (1/4) * (N_1 / 7) + (2/4) * (N_2 / 7) + (3/4) * (N_3 / 7) + (4/4) * (N_4 / 7) ) * 100

Where Nᵢ is the number of aspects that received a score of i (1, 2, 3, or 4).

Example: For a video with scores, the calculation would be:

Final Score = ( (1/4) * (1 / 7) + (2/4) * (3 / 7) + (3/4) * (1 / 7) + (4/4) * (2 / 7) ) * 100 = 64.28

Benchmark Results

See model-comparison.mdx for detailed tables of per-model and per-epoch results.

Evaluation Summary

Kling v1.5 has the highest overall score in the MORPH benchmark, with consistently strong performance in both visual quality and morphological fidelity.
Sora follows closely, with performance highly comparable to Kling across most dimensions.
Luma Ray2 and Ray2 Flash models also performed well, but exhibited occasional issues with factual consistency (e.g., vehicles moving in the wrong direction on highways).
Pixverse v3.5 and v4 delivered solid morphological fidelity despite ranking slightly lower overall.
Models such as Minimax Video-01, Mochi v1, and Wan v2.1-1.3B frequently produced unstable figures, including missing limbs, distorted hands, and characters disappearing across frames, leading to lower temporal and morphological scores.

Model Evaluator Performance

The evaluator model achieved an average Spearman correlation of ~0.50 against the human-annotated test set, in line with the previous VideoScore benchmark from Tigerlab.
The copyright risk dimension underperformed in terms of correlation. Despite class oversampling, the model still struggles to learn patterns associated with borderline or real-world content.
The model tends to assign [4, 4, 4, 4, 4, 4, 3] to videos that closely resemble real footage, likely due to exposure to similarly high-quality real videos in the training set.

Future Improvements

Improve label diversity: Increase training data diversity, especially for underrepresented or ambiguous examples, to help the model generalize better across all dimensions.
Refine scoring objective: Transition to a regression-based scoring approach (as used in the original VideoScore model) for more accurate and continuous feedback.
Scale evaluations: Expand the number of prompts per model (e.g., from 10 to 30) for more reliable benchmarks and reduced sampling noise.

Introduction

Prompt Engineering

Sample Topics

Learning Levels

Research

Morph Research Overview

Overview

Scoring Methodology

Score Calculation Formula

Benchmark Results

Evaluation Summary

Model Evaluator Performance

Future Improvements

Introduction

Prompt Engineering

Sample Topics

Learning Levels

Research

​Overview

​Scoring Methodology

​Score Calculation Formula

​Benchmark Results

​Evaluation Summary

​Model Evaluator Performance

​Future Improvements

Overview

Scoring Methodology

Score Calculation Formula

Benchmark Results

Evaluation Summary

Model Evaluator Performance

Future Improvements