Overview

The MORPH benchmark is designed to evaluate the quality of AI-generated video content with a focus on morphological fidelity, temporal consistency, copyright risk, and other key visual and factual attributes. This benchmark enables objective comparison of cutting-edge text-to-video models, offering both fine-grained (per-video) and aggregate (per-model) scoring.


Scoring Methodology

Each video is rated across 7 evaluation dimensions:

  1. Visual Quality (VQ): Clearness, resolution, brightness, and color
  2. Temporal Consistency (TC): Consistency of objects or humans in video
  3. Dynamic Degree (DD): Degree of dynamic changes
  4. Text-to-Video Alignment (TVA): Alignment between the text prompt and the video content
  5. Factual Consistency (FC): Consistency with common-sense and factual knowledge
  6. Morphological Fidelity (MF): Realism, anatomical consistency, and coherence of human figures across motion
  7. Copyright Risk (CR): Visual similarity to well-known media

Each aspect is rated on a 1–4 scale, with 4 representing optimal performance.

Score Calculation Formula

The final score for a video is computed as:

Final Score = ( (1/4) * (N_1 / 7) + (2/4) * (N_2 / 7) + (3/4) * (N_3 / 7) + (4/4) * (N_4 / 7) ) * 100

Where Nᵢ is the number of aspects that received a score of i (1, 2, 3, or 4).

Example: For a video with scores, the calculation would be:

Final Score = ( (1/4) * (1 / 7) + (2/4) * (3 / 7) + (3/4) * (1 / 7) + (4/4) * (2 / 7) ) * 100 = 64.28

Benchmark Results

See model-comparison.mdx for detailed tables of per-model and per-epoch results.


Evaluation Summary

  • Kling v1.5 has the highest overall score in the MORPH benchmark, with consistently strong performance in both visual quality and morphological fidelity.
  • Sora follows closely, with performance highly comparable to Kling across most dimensions.
  • Luma Ray2 and Ray2 Flash models also performed well, but exhibited occasional issues with factual consistency (e.g., vehicles moving in the wrong direction on highways).
  • Pixverse v3.5 and v4 delivered solid morphological fidelity despite ranking slightly lower overall.
  • Models such as Minimax Video-01, Mochi v1, and Wan v2.1-1.3B frequently produced unstable figures, including missing limbs, distorted hands, and characters disappearing across frames, leading to lower temporal and morphological scores.

Model Evaluator Performance

  • The evaluator model achieved an average Spearman correlation of ~0.50 against the human-annotated test set, in line with the previous VideoScore benchmark from Tigerlab.
  • The copyright risk dimension underperformed in terms of correlation. Despite class oversampling, the model still struggles to learn patterns associated with borderline or real-world content.
  • The model tends to assign [4, 4, 4, 4, 4, 4, 3] to videos that closely resemble real footage, likely due to exposure to similarly high-quality real videos in the training set.

Future Improvements

  • Improve label diversity: Increase training data diversity, especially for underrepresented or ambiguous examples, to help the model generalize better across all dimensions.
  • Refine scoring objective: Transition to a regression-based scoring approach (as used in the original VideoScore model) for more accurate and continuous feedback.
  • Scale evaluations: Expand the number of prompts per model (e.g., from 10 to 30) for more reliable benchmarks and reduced sampling noise.