Morph Research Overview
Overview and methodology for the MORPH benchmark: evaluating AI-generated video content.
Overview
The MORPH benchmark is designed to evaluate the quality of AI-generated video content with a focus on morphological fidelity, temporal consistency, copyright risk, and other key visual and factual attributes. This benchmark enables objective comparison of cutting-edge text-to-video models, offering both fine-grained (per-video) and aggregate (per-model) scoring.
- Benchmark model: https://huggingface.co/ChimaAI/MORPH-benchmark
- Training dataset: https://huggingface.co/datasets/ChimaAI/MORPH-dataset
Scoring Methodology
Each video is rated across 7 evaluation dimensions:
- Visual Quality (VQ): Clearness, resolution, brightness, and color
- Temporal Consistency (TC): Consistency of objects or humans in video
- Dynamic Degree (DD): Degree of dynamic changes
- Text-to-Video Alignment (TVA): Alignment between the text prompt and the video content
- Factual Consistency (FC): Consistency with common-sense and factual knowledge
- Morphological Fidelity (MF): Realism, anatomical consistency, and coherence of human figures across motion
- Copyright Risk (CR): Visual similarity to well-known media
Each aspect is rated on a 1–4 scale, with 4 representing optimal performance.
Score Calculation Formula
The final score for a video is computed as:
Where Nᵢ is the number of aspects that received a score of i (1, 2, 3, or 4).
Example: For a video with scores, the calculation would be:
Benchmark Results
See model-comparison.mdx for detailed tables of per-model and per-epoch results.
Evaluation Summary
- Kling v1.5 has the highest overall score in the MORPH benchmark, with consistently strong performance in both visual quality and morphological fidelity.
- Sora follows closely, with performance highly comparable to Kling across most dimensions.
- Luma Ray2 and Ray2 Flash models also performed well, but exhibited occasional issues with factual consistency (e.g., vehicles moving in the wrong direction on highways).
- Pixverse v3.5 and v4 delivered solid morphological fidelity despite ranking slightly lower overall.
- Models such as Minimax Video-01, Mochi v1, and Wan v2.1-1.3B frequently produced unstable figures, including missing limbs, distorted hands, and characters disappearing across frames, leading to lower temporal and morphological scores.
Model Evaluator Performance
- The evaluator model achieved an average Spearman correlation of ~0.50 against the human-annotated test set, in line with the previous VideoScore benchmark from Tigerlab.
- The copyright risk dimension underperformed in terms of correlation. Despite class oversampling, the model still struggles to learn patterns associated with borderline or real-world content.
- The model tends to assign [4, 4, 4, 4, 4, 4, 3] to videos that closely resemble real footage, likely due to exposure to similarly high-quality real videos in the training set.
Future Improvements
- Improve label diversity: Increase training data diversity, especially for underrepresented or ambiguous examples, to help the model generalize better across all dimensions.
- Refine scoring objective: Transition to a regression-based scoring approach (as used in the original VideoScore model) for more accurate and continuous feedback.
- Scale evaluations: Expand the number of prompts per model (e.g., from 10 to 30) for more reliable benchmarks and reduced sampling noise.