Qwen-Image-Agent — Bridging the Context Gap in Real-World Image Generation

2026-07-09 2026-07-15

Agent / Agentic Image Generation

a few seconds read (About 15 words)

痛点：用户说「做张海报」往往缺细节——隐含意图、实时知识、历史对话。T2I 模型训练时吃「完整 prompt」，部署时吃「残缺 context」→ 作者称 Context Gap（用户 context ≠ 生成所需 context）。

AIG, Agent, DL

InnoAds-Composer — Efficient Condition Composition for E-Commerce Poster Generation

2026-07-04 2026-07-15

AIGC / Diffusion Model

a few seconds read (About 14 words)

任务：电商海报 = 一张图里同时摆对商品主体、促销文案、背景风格。多阶段 pipeline（先合成场景再贴字）常出现主体走样、文字错字、风格不统一。

AIGC, DL, DM

MoFu — Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

2026-07-04 2026-07-15

AIGC / Video Generation

a few seconds read (About 13 words)

任务：多主体视频生成——给定文本 + 多张参考图，生成多主体一致、尺度自然的视频。

AIGC, DL, VG

Lay2Story — Extending Diffusion Transformers for Layout-Togglable Story Generation

2026-07-03 2026-07-15

AIGC / Video Generation

a few seconds read (About 14 words)

Storytelling：用一组 prompt 生成多帧图，主角外观要一致。现有 training-free（改 cross-frame attention）和 training-based 都难精细控制位置、衣着、表情、姿势，且缺大规模带 layout 标注的数据。

AIGC, DL, VG

Qihoo-T2X — An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

2026-07-03 2026-07-15

AIGC / Diffusion Model

a few seconds read (About 15 words)

问题：DiT 全局 self-attention 对视觉 token 是 $O(N^2)$，高分辨率图/长视频算不动；且 PixArt 注意力图显示同窗口内 token 对远处位置注意力几乎一样——大量全局注意力是冗余的。

AIGC, DL, DM

U-StyDiT — Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

2026-07-02 2026-07-15

AIGC / Diffusion Model

a few seconds read (About 14 words)

任务：给定内容图 + 风格图，生成超高画质艺术风格化结果——结构跟内容、笔触跟风格，且无伪影/不和谐纹理。

AIGC, DL, DM

RelaCtrl — Relevance-Guided Efficient Control for Diffusion Transformers

2026-07-01 2026-07-15

AIGC / Diffusion Model

a few seconds read (About 13 words)

给 DiT 加「可控生成」（Canny、Depth、Seg 等）时，主流做法很「笨重」：PixArt-δ 直接复制前 13 个 DiT block 做 ControlNet，参数和 FLOPs 各涨约 50%；OminiControl 把控制 token 拼进序列，token 数翻倍，FLOPs 涨约 70%。更关键的是——它们假设每一层对控制信号同等重要，均匀堆控制模块，造成大量冗余。

AIGC, DL, DM

WISA — World Simulator Assistant for Physics-Aware Text-to-Video Generation

2026-07-01 2026-07-15

AIGC / Video Generation

a few seconds read (About 13 words)

Sora、Kling、CogVideoX 能生成逼真视频，但常违反物理：橡皮擦越擦字越黑、苹果落水没有溅起水花、液体运动像随机噪声。根因是抽象物理定律与像素生成之间缺桥梁——模型只学「画面像什么」，没学「过程该怎么演化」。

AIGC, DL, VG

FancyVideo — Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

2026-07-01 2026-07-15

AIGC / Video Generation

a few seconds read (About 13 words)

痛点：AnimateDiff 等 T2V 把同一段 text embedding 复制到每一帧做 spatial cross-attention → [verb] 关注区几乎不变 → 动作弱、长视频更明显。

AIGC, DL, VG

PixArt-δ — Fast and Controllable Image Generation with Latent Consistency Models

2026-06-30 2026-07-15

AIGC / Diffusion Model

a few seconds read (About 14 words)

PixArt-α 已是高效 DiT 文生图基座；PixArt-δ 在其上叠两层能力：LCM 蒸馏把采样从 14 步压到 2–4 步，A100 上 0.5s/1024px（相对 α 约 7× 加速）；ControlNet-Transformer 把边缘/深度等条件注入 DiT，实现细粒度可控生成。

AIGC, DL, DM

Qwen-Image-Agent — Bridging the Context Gap in Real-World Image Generation

InnoAds-Composer — Efficient Condition Composition for E-Commerce Poster Generation

MoFu — Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Lay2Story — Extending Diffusion Transformers for Layout-Togglable Story Generation

Qihoo-T2X — An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

U-StyDiT — Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

RelaCtrl — Relevance-Guided Efficient Control for Diffusion Transformers

WISA — World Simulator Assistant for Physics-Aware Text-to-Video Generation

FancyVideo — Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

PixArt-δ — Fast and Controllable Image Generation with Latent Consistency Models

Tag Cloud

Categories

Recent

Archives

Recent

Archives

Your browser is out-of-date!