优秀的多模态大模型(LLM)资源库

山行AI

发布于 2023-06-26 10:49:20

2.1K0

文章被收录于专栏：山行AI山行AI

前言

在AI盛起的当下，各类AI应用不断地出现在人们的视野中，AI正在重塑着各行各业。笔者认为，如果说ChatGPT引领了AI革命的开端，那么多模态大模型一定代表着AI应用的未来。

本文是一个多模态大语言模型的资源库，里面罗列了大大小小很多个多模态大语言模型的论文、应用、数据集等学习资源，建议大家点赞收藏。

对于本文中的部分项目，笔者之前也有文章介绍，部分罗列如下，感兴趣的同学可以查看：

GPT4All——可本地布署的AI助理

MiniGPT-4：使用先进的大型语言模型提升视觉语言理解

Audiocraft——一个基于PyTorch的AI音频生成深度学习研究库

Recognize_Anything-Tag2Text——一款强大的图像标签模型和Tag2Text

MLC LLM——本地应用程序上原生部署任何语言模型

......

超棒的-多模态-大型语言模型资源库

🔥🔥🔥 这是一个精心策划的多模态大型语言模型（MLLM）列表，包括数据集、多模态指令调整、多模态情境学习、多模态思维链条、由LLM辅助的视觉推理、基础模型，以及其他。

🔥🔥🔥 这个列表会实时更新。

🔥🔥🔥 MLLM的综述论文正在准备中，很快就会发布！

•超棒的论文[1] •多模态指令调整[2] •多模态情境学习[3] •多模态思维链条[4] •由LLM辅助的视觉推理[5] •基础模型[6] •其他[7]•超棒的数据集[8] •对齐预训练的数据集[9] •多模态指令调整的数据集[10] •情境学习的数据集[11] •多模态思维链条的数据集[12] •其他[13]

优秀论文

下面的部分论文笔者有中文版，有需要的可以联系笔者获取。

多模态指导调优

标题	发布会议	日期	代码	演示
StarMacaw-LLM: 图像，音频，视频和文本整合的多模态语言模型[14]	arXiv	2023-06-15	Github[15]	即将到来[16]
StarLAMM: 语言辅助的多模态指导调优数据集，框架和基准[17]	arXiv	2023-06-11	Github[18]	演示[19]
StarVideo-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解[20]	arXiv	2023-06-08	Github[21]	演示[22]
StarMIMIC-IT: 多模态上下文指导调优[23]	arXiv	2023-06-08	Github[24]	演示[25]
M3IT: 面向多模态多语言指导调优的大规模数据集[26]	arXiv	2023-06-07	-	-
StarVideo-LLaMA: 为视频理解的指导调优的音频视觉语言模型[27]	arXiv	2023-06-05	Github[28]	演示[29]
StarLLaVA-Med：在一天内训练用于生物医学的大型语言和视觉助手[30]	arXiv	2023-06-01	Github[31]	-
StarGPT4Tools：通过自我指导教大型语言模型使用工具[32]	arXiv	2023-05-30	Github[33]	Demo[34]
StarPandaGPT：一种用于全面指令跟随的模型[35]	arXiv	2023-05-25	Github[36]	Demo[37]
StarChatBridge：通过大型语言模型作为语言催化剂来桥接模式[38]	arXiv	2023-05-25	Github[39]	-
Star简便快捷：大型语言模型的高效视觉语言指令调优[40]	arXiv	2023-05-24	Github[41]	本地演示
StarDetGPT：通过推理检测你需要的东西[42]	arXiv	2023-05-23	Github[43]	Demo[44]
StarVisionLLM: 大型语言模型也是视觉中心任务的开放式解码器[45]	arXiv	2023-05-18	Github[46]	Demo[47]
StarListen, Think, and Understand[48]	arXiv	2023-05-18	Github[49]	Demo[50]
StarVisualGLM-6B	-	2023-05-17	Github[51]	本地演示
StarPMC-VQA: 医疗视觉问答的视觉指导优化[52]	arXiv	2023-05-17	Github[53]	-
StarInstructBLIP: 通过指导优化实现通用的视觉语言模型[54]	arXiv	2023-05-11	Github[55]	本地演示
StarVideoChat: 以聊天为中心的视频理解[56]	arXiv	2023-05-10	Github[57]	Demo[58]
StarMultiModal-GPT: 用于与人类对话的视觉和语言模型[59]	arXiv	2023-05-08	Github[60]	Demo[61]
StarX-LLM: 通过将多模态视为外语来引导先进的大型语言模型[62]	arXiv	2023-05-07	Github[63]	-
StarLMEye: 用于大型语言模型的交互式感知网络[64]	arXiv	2023-05-05	Github[65]	本地演示
StarLLaMA-Adapter V2: 高效参数的视觉指导模型[66]	arXiv	2023-04-28	Github[67]	Demo[68]
StarmPLUG-Owl: 模块化使大型语言模型具备多模态能力[69]	arXiv	2023-04-27	Github[70]	Demo[71]
StarMiniGPT-4: 通过先进的大型语言模型增强视觉语言理解[72]	arXiv	2023-04-20	Github[73]	-
StarVisual Instruction Tuning[74]	arXiv	2023-04-17	GitHub[75]	Demo[76]
StarLLaMA-Adapter: 使用零初始化注意力高效微调语言模型[77]	arXiv	2023-03-28	Github[78]	Demo[79]
StarMultiInstruct: 通过指导调整提高多模态零样本学习[80]	ACL	2022-12-21	Github[81]	-

中文版论文

笔者整理了部分论文的中文版，有需要的可以私聊笔者获取，大概效果如下：

多模态上下文学习

Title	Venue	Date	Code	Demo
StarMIMIC-IT: 多模态上下文中的指导调整[82]	arXiv	2023-06-08	Github[83]	Demo[84]
StarChameleon: 使用大型语言模型进行即插即用的组合推理[85]	arXiv	2023-04-19	Github[86]	Demo[87]
StarHuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务[88]	arXiv	2023-03-30	Github[89]	Demo[90]
StarMM-REACT: 用于多模态推理和操作的ChatGPT提示[91]	arXiv	2023-03-20	Github[92]	Demo[93]
Star利用答案启发的提示为基于知识的视觉问答提供支持[94]	CVPR	2023-03-03	Github[95]	-
Star视觉编程：无需训练的组合视觉推理[96]	CVPR	2022-11-18	Github[97]	Local Demo
StarGPT-3的经验研究：用于少样本知识驱动视觉问答的实证研究[98]	AAAI	2022-06-28	Github[99]	-
StarFlamingo：一种用于少样本学习的视觉语言模型[100]	NeurIPS	2022-04-29	Github[101]	演示[102]
冻结语言模型的多模态少样本学习[103]	NeurIPS	2021-06-25	-	-

多模态思维链

标题	会议/期刊	日期	代码	演示
StarEmbodiedGPT: 通过多模态思维链进行视觉语言预训练[104]	arXiv	2023-05-24	Github[105]	-
让我们逐帧思考：通过视频补全和预测评估视频思维链[106]	arXiv	2023-05-23	-	-
StarCaption Anything: 利用多样的多模态控制进行交互式图像描述[107]	arXiv	2023-05-04	Github[108]	演示[109]
视觉思维链：用多模态补全填补逻辑间隙[110]	arXiv	2023-05-03	即将推出[111]	-
StarChameleon: 使用大型语言模型进行即插即用的组合推理[112]	arXiv	2023-04-19	Github[113]	演示[114]
视觉语言模型中的思维链提示微调[115]	arXiv	2023-04-16	即将推出	-
StarMM-REACT: 多模态推理与交互式ChatGPT[116]	arXiv	2023-03-20	Github[117]	演示[118]
StarVisual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑[119]	arXiv	2023-03-08	Github[120]	演示[121]
Star多模态思维链推理[122]	arXiv	2023-02-02	Github[123]	-
Star视觉编程：无需训练的组合视觉推理[124]	CVPR	2022-11-18	Github[125]	本地演示
Star学会解释：通过思维链进行多模态推理解答科学问题[126]	NeurIPS	2022-09-20	Github[127]	-

LLM辅助的视觉推理

标题	会议	日期	代码	演示
StarGPT4Tools: 通过自我教育教授大型语言模型使用工具[128]	arXiv	2023-05-30	Github[129]	演示[130]
StarLayoutGPT: 利用大型语言模型进行组合式视觉规划和生成[131]	arXiv	2023-05-24	Github[132]	-
StarIdealGPT: 通过大型语言模型迭代分解视觉和语言推理[133]	arXiv	2023-05-24	Github[134]	本地演示
StarAccountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令[135]	arXiv	2023-05-10	Github[136]	-
StarCaption Anything: 多样多模态控制的交互式图像描述[137]	arXiv	2023-05-04	Github[138]	演示[139]
StarChameleon: 大型语言模型的即插即用组合式推理[140]	arXiv	2023-04-19	Github[141]	演示[142]
StarHuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务[143]	arXiv	2023-03-30	Github[144]	演示[145]
StarMM-REACT: 多模态推理和行动中的ChatGPT提示[146]	arXiv	2023-03-20	Github[147]	演示[148]
StarViperGPT: 通过Python执行进行视觉推理[149]	arXiv	2023-03-14	Github[150]	本地演示
StarChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问[151]	arXiv	2023-03-12	Github[152]	本地演示
StarVisual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑[153]	arXiv	2023-03-08	Github[154]	演示[155]
StarPrompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器[156]	CVPR	2023-03-03	Github[157]	-
StarPointCLIP V2: 适应强大的3D开放世界学习的CLIP[158]	CVPR	2022-11-21	Github[159]	-
StarVisual Programming: 无需训练的组合式视觉推理[160]	CVPR	2022-11-18	Github[161]	本地演示
StarSocratic Models: 使用语言进行零样本多模态推理[162]	arXiv	2022-04-01	Github[163]	-

基础模型

标题	发表会议/期刊	日期	代码	演示
StarTransfer Visual Prompt Generator across LLMs[164]	arXiv	2023-05-02	Github[165]	演示[166]
GPT-4 技术报告[167]	arXiv	2023-03-15	-	-
PaLM-E: 一种具有身体感知的多模态语言模型[168]	arXiv	2023-03-06	-	演示[169]
StarPrismer: 具有多个专家的视觉语言模型[170]	arXiv	2023-03-04	Github[171]	演示[172]
Star语言并非唯一需求：将感知与语言模型对齐[173]	arXiv	2023-02-27	Github[174]	-
StarBLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练[175]	arXiv	2023-01-30	Github[176]	演示[177]
StarVIMA: 基于多模态提示的通用机器人操作[178]	ICML	2022-10-06	Github[179]

其他

标题	发表会议/期刊	日期	代码	演示
大型预训练模型能帮助视觉模型处理感知任务吗？[180]	arXiv	2023-06-01	即将推出[181]	-
Star多模态大型语言模型在上下文目标检测中的应用[182]	arXiv	2023-05-29	Github[183]	演示[184]
Star利用多模态语言模型生成图像[185]	arXiv	2023-05-26	Github[186]	-
Star评估大型视觉-语言模型的对抗鲁棒性[187]	arXiv	2023-05-26	Github[188]	-
Star在大型视觉-语言模型中评估对象虚构[189]	arXiv	2023-05-17	Github[190]	-
Star将语言模型与图像进行模态间输入输出关联[191]	ICML	2023-01-31	Github[192]	演示[193]

精彩数据集

用于对齐的预训练数据集

名称	论文	类型	模态
MS-COCO	Microsoft COCO: Common Objects in Context[194]	标题	图像-文本
SBU Captions	Im2Text: Describing Images Using 1 Million Captioned Photographs[195]	标题	图像-文本
Conceptual Captions	Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning[196]	标题	图像-文本
LAION-400M	LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs[197]	标题	图像-文本
VG Captions	Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations[198]	标题	图像-文本
Flickr30k	Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models[199]	标题	图像-文本
AI-Caps	AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding[200]	标题	图像-文本
悟空标注	Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark[201]	标题	图像-文本
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks[202]	标题	视频-文本
MSR-VTT	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language[203]	标题	视频-文本
Webvid10M	Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval[204]	标题	视频-文本
WavCaps	WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research[205]	标题	音频-文本
AISHELL-1	AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline[206]	ASR	音频-文本
AISHELL-2	AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale[207]	ASR	音频-文本
VSDial-CN	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[208]	ASR	图像-音频-文本

多模态指令调整数据集

名称	论文	链接	备注
Macaw-LLM	Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration[209]	链接[210]	一个大规模的多模态指令数据集，具有多轮对话
LAMM-Dataset	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[211]	链接[212]	一个全面的多模态指令调整数据集
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[213]	链接[214]	10万个高质量视频指令数据集
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning[215]	即将推出[216]	多模态上下文指令调整
M3IT	M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning[217]	链接[218]	大规模、广覆盖的多模态指令调整数据集
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day[219]	即将推出[220]	一个大规模、广覆盖的生物医学指令跟随数据集
GPT4Tools	GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction[221]	链接[222]	与工具相关的指令数据集
MULTIS	ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst[223]	即将推出[224]	覆盖16种多模态任务的多模态指令调整数据集
DetGPT	DetGPT: Detect What You Need via Reasoning[225]	链接[226]	一个包含5000张图像和约30000个查询-回答对的指令调整数据集
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering[227]	即将推出[228]	大规模的医学视觉问答数据集
VideoChat	VideoChat: Chat-Centric Video Understanding[229]	链接[230]	以视频为中心的多模态指令数据集
X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[231]	链接[232]	中文多模态指令数据集
LMEye	LMEye: An Interactive Perception Network for Large Language Models[233]	链接[234]	一个多模态指令调整数据集
cc-sbu-align	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models[235]	链接[236]	用于改善模型可用性和生成流畅性的多模态对齐数据集
LLaVA-Instruct-150K	Visual Instruction Tuning[237]	链接[238]	由GPT生成的多模态指令跟随数据
MultiInstruct	MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning[239]	链接[240]	第一个多模态指令调整基准数据集

在上下文学习中的数据集

名称	论文	链接	备注
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning[241]	即将推出[242]	多模态上下文指令数据集

在多模态思维链中的数据集

名称	论文	链接	备注
EgoCOT	EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought[243]	即将推出[244]	大规模的具身化规划数据集
VIP	Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction[245]	即将推出	用于评估VideoCOT的推理时间数据集
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering[246]	链接[247]	大规模的多项选择数据集，涵盖了多模态科学问题和多个领域

其他数据集

名称	论文	链接	备注
IMAD	IMAD: IMage-Augmented multi-modal Dialogue[248]	链接[249]	多模态对话数据集
LAMM-Benchmark	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[250]	链接[251]	用于评估MLLM在各种2D/3D视觉任务上的定量性能的基准测试
OwlEval	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality[252]	链接[253]	用于评估多种能力的数据集
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[254]	链接[255]	用于视频对话模型的定量评估框架
LVLM-eHub	LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models[256]	链接[257]	MLLM的评估平台
CLEVR-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[258]	链接[259]	用于学习拒绝指令的合成多模态微调数据集
Fruit-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[260]	链接[261]	用于学习拒绝指令的手动拍摄的多模态微调数据集

声明

文章内容主要翻译整理自：GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models[262]，后续会持续更新，请点赞收藏！

References

[1] 超棒的论文: #超棒的论文 [2] 多模态指令调整: #多模态指令调整 [3] 多模态情境学习: #多模态情境学习 [4] 多模态思维链条: #多模态思维链条 [5] 由LLM辅助的视觉推理: #由llm辅助的视觉推理 [6] 基础模型: #基础模型 [7] 其他: #其他 [8] 超棒的数据集: #超棒的数据集 [9] 对齐预训练的数据集: #对齐预训练的数据集 [10] 多模态指令调整的数据集: #多模态指令调整的数据集 [11] 情境学习的数据集: #情境学习的数据集 [12] 多模态思维链条的数据集: #多模态思维链条的数据集 [13] 其他: #其他-1 [14] Macaw-LLM: 图像，音频，视频和文本整合的多模态语言模型: https://arxiv.org/pdf/2306.09093.pdf [15] Github: https://github.com/lyuchenyang/Macaw-LLM [16] 即将到来: [17] LAMM: 语言辅助的多模态指导调优数据集，框架和基准: https://arxiv.org/pdf/2306.06687.pdf [18] Github: https://github.com/OpenLAMM/LAMM [19] 演示: https://huggingface.co/spaces/openlamm/LAMM [20] Video-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解: https://arxiv.org/pdf/2306.05424.pdf [21] Github: https://github.com/mbzuai-oryx/Video-ChatGPT [22] 演示: https://www.ival-mbzuai.com/video-chatgpt [23] MIMIC-IT: 多模态上下文指导调优: https://arxiv.org/pdf/2306.05425.pdf [24] Github: https://github.com/Luodian/Otter [25] 演示: https://otter.cliangyu.com/ [26] M3IT: 面向多模态多语言指导调优的大规模数据集: https://arxiv.org/pdf/2306.04387.pdf [27] Video-LLaMA: 为视频理解的指导调优的音频视觉语言模型: https://arxiv.org/pdf/2306.02858.pdf [28] Github: https://github.com/DAMO-NLP-SG/Video-LLaMA [29] 演示: https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA [30] LLaVA-Med：在一天内训练用于生物医学的大型语言和视觉助手: https://arxiv.org/pdf/2306.00890.pdf [31] Github: https://github.com/microsoft/LLaVA-Med [32] GPT4Tools：通过自我指导教大型语言模型使用工具: https://arxiv.org/pdf/2305.18752.pdf [33] Github: https://github.com/StevenGrove/GPT4Tools [34] Demo: https://huggingface.co/spaces/stevengrove/GPT4Tools [35] PandaGPT：一种用于全面指令跟随的模型: https://arxiv.org/pdf/2305.16355.pdf [36] Github: https://github.com/yxuansu/PandaGPT [37] Demo: https://huggingface.co/spaces/GMFTBY/PandaGPT [38] ChatBridge：通过大型语言模型作为语言催化剂来桥接模式: https://arxiv.org/pdf/2305.16103.pdf [39] Github: https://github.com/joez17/ChatBridge [40] 简便快捷：大型语言模型的高效视觉语言指令调优: https://arxiv.org/pdf/2305.15023.pdf [41] Github: https://github.com/luogen1996/LaVIN [42] DetGPT：通过推理检测你需要的东西: https://arxiv.org/pdf/2305.14167.pdf [43] Github: https://github.com/OptimalScale/DetGPT [44] Demo: https://d3c431c0c77b1d9010.gradio.live/ [45] VisionLLM: 大型语言模型也是视觉中心任务的开放式解码器: https://arxiv.org/pdf/2305.11175.pdf [46] Github: https://github.com/OpenGVLab/VisionLLM [47] Demo: https://igpt.opengvlab.com/ [48] Listen, Think, and Understand: https://arxiv.org/pdf/2305.10790.pdf [49] Github: https://github.com/YuanGongND/ltu [50] Demo: https://github.com/YuanGongND/ltu [51] Github: https://github.com/THUDM/VisualGLM-6B [52] PMC-VQA: 医疗视觉问答的视觉指导优化: https://arxiv.org/pdf/2305.10415.pdf [53] Github: https://github.com/xiaoman-zhang/PMC-VQA [54] InstructBLIP: 通过指导优化实现通用的视觉语言模型: https://arxiv.org/pdf/2305.06500.pdf [55] Github: https://github.com/salesforce/LAVIS/tree/main/projects/instructblip [56] VideoChat: 以聊天为中心的视频理解: https://arxiv.org/pdf/2305.06355.pdf [57] Github: https://github.com/OpenGVLab/Ask-Anything [58] Demo: https://ask.opengvlab.com/ [59] MultiModal-GPT: 用于与人类对话的视觉和语言模型: https://arxiv.org/pdf/2305.04790.pdf [60] Github: https://github.com/open-mmlab/Multimodal-GPT [61] Demo: https://mmgpt.openmmlab.org.cn/ [62] X-LLM: 通过将多模态视为外语来引导先进的大型语言模型: https://arxiv.org/pdf/2305.04160.pdf [63] Github: https://github.com/phellonchen/X-LLM [64] LMEye: 用于大型语言模型的交互式感知网络: https://arxiv.org/pdf/2305.03701.pdf [65] Github: https://github.com/YunxinLi/LingCloud [66] LLaMA-Adapter V2: 高效参数的视觉指导模型: https://arxiv.org/pdf/2304.15010.pdf [67] Github: https://github.com/ZrrSkywalker/LLaMA-Adapter [68] Demo: http://llama-adapter.opengvlab.com/ [69] mPLUG-Owl: 模块化使大型语言模型具备多模态能力: https://arxiv.org/pdf/2304.14178.pdf [70] Github: https://github.com/X-PLUG/mPLUG-Owl [71] Demo: https://huggingface.co/spaces/MAGAer13/mPLUG-Owl [72] MiniGPT-4: 通过先进的大型语言模型增强视觉语言理解: https://arxiv.org/pdf/2304.10592.pdf [73] Github: https://github.com/Vision-CAIR/MiniGPT-4 [74] Visual Instruction Tuning: https://arxiv.org/pdf/2304.08485.pdf [75] GitHub: https://github.com/haotian-liu/LLaVA [76] Demo: https://llava.hliu.cc/ [77] LLaMA-Adapter: 使用零初始化注意力高效微调语言模型: https://arxiv.org/pdf/2303.16199.pdf [78] Github: https://github.com/ZrrSkywalker/LLaMA-Adapter [79] Demo: https://huggingface.co/spaces/csuhan/LLaMA-Adapter [80] MultiInstruct: 通过指导调整提高多模态零样本学习: https://arxiv.org/pdf/2212.10773.pdf [81] Github: https://github.com/VT-NLP/MultiInstruct [82] MIMIC-IT: 多模态上下文中的指导调整: https://arxiv.org/pdf/2306.05425.pdf [83] Github: https://github.com/Luodian/Otter [84] Demo: https://otter.cliangyu.com/ [85] Chameleon: 使用大型语言模型进行即插即用的组合推理: https://arxiv.org/pdf/2304.09842.pdf [86] Github: https://github.com/lupantech/chameleon-llm [87] Demo: https://chameleon-llm.github.io/ [88] HuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务: https://arxiv.org/pdf/2303.17580.pdf [89] Github: https://github.com/microsoft/JARVIS [90] Demo: https://huggingface.co/spaces/microsoft/HuggingGPT [91] MM-REACT: 用于多模态推理和操作的ChatGPT提示: https://arxiv.org/pdf/2303.11381.pdf [92] Github: https://github.com/microsoft/MM-REACT [93] Demo: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react [94] 利用答案启发的提示为基于知识的视觉问答提供支持: https://arxiv.org/pdf/2303.01903.pdf [95] Github: https://github.com/MILVLG/prophet [96] 视觉编程：无需训练的组合视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf [97] Github: https://github.com/allenai/visprog [98] GPT-3的经验研究：用于少样本知识驱动视觉问答的实证研究: https://ojs.aaai.org/index.php/AAAI/article/download/20215/19974 [99] Github: https://github.com/microsoft/PICa [100] Flamingo：一种用于少样本学习的视觉语言模型: https://arxiv.org/pdf/2204.14198.pdf [101] Github: https://github.com/mlfoundations/open_flamingo [102] 演示: https://huggingface.co/spaces/dhansmair/flamingo-mini-cap [103] 冻结语言模型的多模态少样本学习: https://arxiv.org/pdf/2106.13884.pdf [104] EmbodiedGPT: 通过多模态思维链进行视觉语言预训练: https://arxiv.org/pdf/2305.15021.pdf [105] Github: https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch [106] 让我们逐帧思考：通过视频补全和预测评估视频思维链: https://arxiv.org/pdf/2305.13903.pdf [107] Caption Anything: 利用多样的多模态控制进行交互式图像描述: https://arxiv.org/pdf/2305.02677.pdf [108] Github: https://github.com/ttengwang/Caption-Anything [109] 演示: https://huggingface.co/spaces/TencentARC/Caption-Anything [110] 视觉思维链：用多模态补全填补逻辑间隙: https://arxiv.org/pdf/2305.02317.pdf [111] 即将推出: https://github.com/dannyrose30/VCOT [112] Chameleon: 使用大型语言模型进行即插即用的组合推理: https://arxiv.org/pdf/2304.09842.pdf [113] Github: https://github.com/lupantech/chameleon-llm [114] 演示: https://chameleon-llm.github.io/ [115] 视觉语言模型中的思维链提示微调: https://arxiv.org/pdf/2304.07919.pdf [116] MM-REACT: 多模态推理与交互式ChatGPT: https://arxiv.org/pdf/2303.11381.pdf [117] Github: https://github.com/microsoft/MM-REACT [118] 演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react [119] Visual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑: https://arxiv.org/pdf/2303.04671.pdf [120] Github: https://github.com/microsoft/TaskMatrix [121] 演示: https://huggingface.co/spaces/microsoft/visual_chatgpt [122] 多模态思维链推理: https://arxiv.org/pdf/2302.00923.pdf [123] Github: https://github.com/amazon-science/mm-cot [124] 视觉编程：无需训练的组合视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf [125] Github: https://github.com/allenai/visprog [126] 学会解释：通过思维链进行多模态推理解答科学问题: https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf [127] Github: https://github.com/lupantech/ScienceQA [128] GPT4Tools: 通过自我教育教授大型语言模型使用工具: https://arxiv.org/pdf/2305.18752.pdf [129] Github: https://github.com/StevenGrove/GPT4Tools [130] 演示: https://c60eb7e9400930f31b.gradio.live/ [131] LayoutGPT: 利用大型语言模型进行组合式视觉规划和生成: https://arxiv.org/pdf/2305.15393.pdf [132] Github: https://github.com/weixi-feng/LayoutGPT [133] IdealGPT: 通过大型语言模型迭代分解视觉和语言推理: https://arxiv.org/pdf/2305.14985.pdf [134] Github: https://github.com/Hxyou/IdealGPT [135] Accountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令: https://arxiv.org/pdf/2303.05983.pdf [136] Github: https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat [137] Caption Anything: 多样多模态控制的交互式图像描述: https://arxiv.org/pdf/2305.02677.pdf [138] Github: https://github.com/ttengwang/Caption-Anything [139] 演示: https://huggingface.co/spaces/TencentARC/Caption-Anything [140] Chameleon: 大型语言模型的即插即用组合式推理: https://arxiv.org/pdf/2304.09842.pdf [141] Github: https://github.com/lupantech/chameleon-llm [142] 演示: https://chameleon-llm.github.io/ [143] HuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务: https://arxiv.org/pdf/2303.17580.pdf [144] Github: https://github.com/microsoft/JARVIS [145] 演示: https://huggingface.co/spaces/microsoft/HuggingGPT [146] MM-REACT: 多模态推理和行动中的ChatGPT提示: https://arxiv.org/pdf/2303.11381.pdf [147] Github: https://github.com/microsoft/MM-REACT [148] 演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react [149] ViperGPT: 通过Python执行进行视觉推理: https://arxiv.org/pdf/2303.08128.pdf [150] Github: https://github.com/cvlab-columbia/viper [151] ChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问: https://arxiv.org/pdf/2303.06594.pdf [152] Github: https://github.com/Vision-CAIR/ChatCaptioner [153] Visual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑: https://arxiv.org/pdf/2303.04671.pdf [154] Github: https://github.com/microsoft/TaskMatrix [155] 演示: https://huggingface.co/spaces/microsoft/visual_chatgpt [156] Prompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器: https://arxiv.org/pdf/2303.02151.pdf [157] Github: https://github.com/ZrrSkywalker/CaFo [158] PointCLIP V2: 适应强大的3D开放世界学习的CLIP: https://arxiv.org/pdf/2211.11682.pdf [159] Github: https://github.com/yangyangyang127/PointCLIP_V2 [160] Visual Programming: 无需训练的组合式视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf [161] Github: https://github.com/allenai/visprog [162] Socratic Models: 使用语言进行零样本多模态推理: https://arxiv.org/pdf/2204.00598.pdf [163] Github: https://github.com/google-research/google-research/tree/master/socraticmodels [164] Transfer Visual Prompt Generator across LLMs: https://arxiv.org/pdf/2305.01278.pdf [165] Github: https://github.com/VPGTrans/VPGTrans [166] 演示: https://3fc7715dbc44234a7f.gradio.live/ [167] GPT-4 技术报告: https://arxiv.org/pdf/2303.08774.pdf [168] PaLM-E: 一种具有身体感知的多模态语言模型: https://arxiv.org/pdf/2303.03378.pdf [169] 演示: https://palm-e.github.io/#demo [170] Prismer: 具有多个专家的视觉语言模型: https://arxiv.org/pdf/2303.02506.pdf [171] Github: https://github.com/NVlabs/prismer [172] 演示: https://huggingface.co/spaces/lorenmt/prismer [173] 语言并非唯一需求：将感知与语言模型对齐: https://arxiv.org/pdf/2302.14045.pdf [174] Github: https://github.com/microsoft/unilm [175] BLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练: https://arxiv.org/pdf/2301.12597.pdf [176] Github: https://github.com/salesforce/LAVIS/tree/main/projects/blip2 [177] 演示: https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb [178] VIMA: 基于多模态提示的通用机器人操作: https://arxiv.org/pdf/2210.03094.pdf [179] Github: https://github.com/vimalabs/VIMA [180] 大型预训练模型能帮助视觉模型处理感知任务吗？: https://arxiv.org/pdf/2306.00693.pdf [181] 即将推出: [182] 多模态大型语言模型在上下文目标检测中的应用: https://arxiv.org/pdf/2305.18279.pdf [183] Github: https://github.com/yuhangzang/ContextDET [184] 演示: https://huggingface.co/spaces/yuhangzang/ContextDet-Demo [185] 利用多模态语言模型生成图像: https://arxiv.org/pdf/2305.17216.pdf [186] Github: https://github.com/kohjingyu/gill [187] 评估大型视觉-语言模型的对抗鲁棒性: https://arxiv.org/pdf/2305.16934.pdf [188] Github: https://github.com/yunqing-me/AttackVLM [189] 在大型视觉-语言模型中评估对象虚构: https://arxiv.org/pdf/2305.10355.pdf [190] Github: https://github.com/RUCAIBox/POPE [191] 将语言模型与图像进行模态间输入输出关联: https://arxiv.org/pdf/2301.13823.pdf [192] Github: https://github.com/kohjingyu/fromage [193] 演示: https://huggingface.co/spaces/jykoh/fromage [194] Microsoft COCO: Common Objects in Context: https://arxiv.org/pdf/1405.0312.pdf [195] Im2Text: Describing Images Using 1 Million Captioned Photographs: https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf [196] Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning: https://aclanthology.org/P18-1238.pdf [197] LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs: https://arxiv.org/pdf/2111.02114.pdf [198] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations: https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf [199] Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models: https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf [200] AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding: https://arxiv.org/pdf/1711.06475.pdf [201] Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark: https://proceedings.neurips.cc/paper_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf [202] Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks: https://arxiv.org/pdf/2306.04362.pdf [203] MSR-VTT: A Large Video Description Dataset for Bridging Video and Language: https://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR-VTT_A_Large_CVPR_2016_paper.pdf [204] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval: https://arxiv.org/pdf/2104.00650.pdf [205] WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research: https://arxiv.org/pdf/2303.17395.pdf [206] AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline: https://arxiv.org/pdf/1709.05522.pdf [207] AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale: https://arxiv.org/pdf/1808.10583.pdf [208] X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf [209] Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration: https://arxiv.org/pdf/2306.09093.pdf [210] 链接: https://github.com/lyuchenyang/Macaw-LLM/tree/main/data [211] LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf [212] 链接: https://github.com/OpenLAMM/LAMM#lamm-dataset [213] Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf [214] 链接: https://github.com/mbzuai-oryx/Video-ChatGPT#video-instruction-dataset-open_file_folder [215] MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf [216] 即将推出: https://github.com/Luodian/Otter [217] M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning: https://arxiv.org/pdf/2306.04387.pdf [218] 链接: https://huggingface.co/datasets/MMInstruction/M3IT [219] LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day: https://arxiv.org/pdf/2306.00890.pdf [220] 即将推出: https://github.com/microsoft/LLaVA-Med#llava-med-dataset [221] GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction: https://arxiv.org/pdf/2305.18752.pdf [222] 链接: https://github.com/StevenGrove/GPT4Tools#dataset [223] ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst: https://arxiv.org/pdf/2305.16103.pdf [224] 即将推出: https://iva-chatbridge.github.io/ [225] DetGPT: Detect What You Need via Reasoning: https://arxiv.org/pdf/2305.14167.pdf [226] 链接: https://github.com/OptimalScale/DetGPT/tree/main/dataset [227] PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering: https://arxiv.org/pdf/2305.10415.pdf [228] 即将推出: https://xiaoman-zhang.github.io/PMC-VQA/ [229] VideoChat: Chat-Centric Video Understanding: https://arxiv.org/pdf/2305.06355.pdf [230] 链接: https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data [231] X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf [232] 链接: https://github.com/phellonchen/X-LLM [233] LMEye: An Interactive Perception Network for Large Language Models: https://arxiv.org/pdf/2305.03701.pdf [234] 链接: https://huggingface.co/datasets/YunxinLi/Multimodal_Insturction_Data_V2 [235] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models: https://arxiv.org/pdf/2304.10592.pdf [236] 链接: https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align [237] Visual Instruction Tuning: https://arxiv.org/pdf/2304.08485.pdf [238] 链接: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K [239] MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning: https://arxiv.org/pdf/2212.10773.pdf [240] 链接: https://github.com/VT-NLP/MultiInstruct [241] MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf [242] 即将推出: https://github.com/Luodian/Otter [243] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought: https://arxiv.org/pdf/2305.15021.pdf [244] 即将推出: https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch [245] Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction: https://arxiv.org/pdf/2305.13903.pdf [246] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering: https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf [247] 链接: https://github.com/lupantech/ScienceQA#ghost-download-the-dataset [248] IMAD: IMage-Augmented multi-modal Dialogue: https://arxiv.org/pdf/2305.10512.pdf [249] 链接: https://github.com/VityaVitalich/IMAD [250] LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf [251] 链接: https://github.com/OpenLAMM/LAMM#lamm-benchmark [252] mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality: https://arxiv.org/pdf/2304.14178.pdf [253] 链接: https://github.com/X-PLUG/mPLUG-Owl/tree/main/OwlEval [254] Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf [255] 链接: https://github.com/mbzuai-oryx/Video-ChatGPT#quantitative-evaluation-bar_chart [256] LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models: https://arxiv.org/pdf/2306.09265.pdf [257] 链接: https://github.com/OpenGVLab/Multi-Modality-Arena [258] Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf [259] 链接: https://drive.google.com/drive/folders/1TqBzkyqxOSg1hgCXF8JjpYIAuRV-uVft [260] Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf [261] 链接: https://drive.google.com/drive/folders/1Saaia2rRRb1nz5sKdmpzYdS4jHiMDaP0 [262] GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models