在AI盛起的当下,各类AI应用不断地出现在人们的视野中,AI正在重塑着各行各业。笔者认为,如果说ChatGPT引领了AI革命的开端,那么多模态大模型一定代表着AI应用的未来。
本文是一个多模态大语言模型的资源库,里面罗列了大大小小很多个多模态大语言模型的论文、应用、数据集等学习资源,建议大家点赞收藏。
对于本文中的部分项目,笔者之前也有文章介绍,部分罗列如下,感兴趣的同学可以查看:
Audiocraft——一个基于PyTorch的AI音频生成深度学习研究库
Recognize_Anything-Tag2Text——一款强大的图像标签模型和Tag2Text
......
🔥🔥🔥 这是一个精心策划的多模态大型语言模型(MLLM)列表,包括数据集、多模态指令调整、多模态情境学习、多模态思维链条、由LLM辅助的视觉推理、基础模型,以及其他。
🔥🔥🔥 这个列表会实时更新。
🔥🔥🔥 MLLM的综述论文正在准备中,很快就会发布!
目录
•超棒的论文[1] •多模态指令调整[2] •多模态情境学习[3] •多模态思维链条[4] •由LLM辅助的视觉推理[5] •基础模型[6] •其他[7]•超棒的数据集[8] •对齐预训练的数据集[9] •多模态指令调整的数据集[10] •情境学习的数据集[11] •多模态思维链条的数据集[12] •其他[13]
下面的部分论文笔者有中文版,有需要的可以联系笔者获取。
标题 | 发布会议 | 日期 | 代码 | 演示 |
---|---|---|---|---|
StarMacaw-LLM: 图像,音频,视频和文本整合的多模态语言模型[14] | arXiv | 2023-06-15 | Github[15] | 即将到来[16] |
StarLAMM: 语言辅助的多模态指导调优数据集,框架和基准[17] | arXiv | 2023-06-11 | Github[18] | 演示[19] |
StarVideo-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解[20] | arXiv | 2023-06-08 | Github[21] | 演示[22] |
StarMIMIC-IT: 多模态上下文指导调优[23] | arXiv | 2023-06-08 | Github[24] | 演示[25] |
M3IT: 面向多模态多语言指导调优的大规模数据集[26] | arXiv | 2023-06-07 | - | - |
StarVideo-LLaMA: 为视频理解的指导调优的音频视觉语言模型[27] | arXiv | 2023-06-05 | Github[28] | 演示[29] |
StarLLaVA-Med:在一天内训练用于生物医学的大型语言和视觉助手[30] | arXiv | 2023-06-01 | Github[31] | - |
StarGPT4Tools:通过自我指导教大型语言模型使用工具[32] | arXiv | 2023-05-30 | Github[33] | Demo[34] |
StarPandaGPT:一种用于全面指令跟随的模型[35] | arXiv | 2023-05-25 | Github[36] | Demo[37] |
StarChatBridge:通过大型语言模型作为语言催化剂来桥接模式[38] | arXiv | 2023-05-25 | Github[39] | - |
Star简便快捷:大型语言模型的高效视觉语言指令调优[40] | arXiv | 2023-05-24 | Github[41] | 本地演示 |
StarDetGPT:通过推理检测你需要的东西[42] | arXiv | 2023-05-23 | Github[43] | Demo[44] |
StarVisionLLM: 大型语言模型也是视觉中心任务的开放式解码器[45] | arXiv | 2023-05-18 | Github[46] | Demo[47] |
StarListen, Think, and Understand[48] | arXiv | 2023-05-18 | Github[49] | Demo[50] |
StarVisualGLM-6B | - | 2023-05-17 | Github[51] | 本地演示 |
StarPMC-VQA: 医疗视觉问答的视觉指导优化[52] | arXiv | 2023-05-17 | Github[53] | - |
StarInstructBLIP: 通过指导优化实现通用的视觉语言模型[54] | arXiv | 2023-05-11 | Github[55] | 本地演示 |
StarVideoChat: 以聊天为中心的视频理解[56] | arXiv | 2023-05-10 | Github[57] | Demo[58] |
StarMultiModal-GPT: 用于与人类对话的视觉和语言模型[59] | arXiv | 2023-05-08 | Github[60] | Demo[61] |
StarX-LLM: 通过将多模态视为外语来引导先进的大型语言模型[62] | arXiv | 2023-05-07 | Github[63] | - |
StarLMEye: 用于大型语言模型的交互式感知网络[64] | arXiv | 2023-05-05 | Github[65] | 本地演示 |
StarLLaMA-Adapter V2: 高效参数的视觉指导模型[66] | arXiv | 2023-04-28 | Github[67] | Demo[68] |
StarmPLUG-Owl: 模块化使大型语言模型具备多模态能力[69] | arXiv | 2023-04-27 | Github[70] | Demo[71] |
StarMiniGPT-4: 通过先进的大型语言模型增强视觉语言理解[72] | arXiv | 2023-04-20 | Github[73] | - |
StarVisual Instruction Tuning[74] | arXiv | 2023-04-17 | GitHub[75] | Demo[76] |
StarLLaMA-Adapter: 使用零初始化注意力高效微调语言模型[77] | arXiv | 2023-03-28 | Github[78] | Demo[79] |
StarMultiInstruct: 通过指导调整提高多模态零样本学习[80] | ACL | 2022-12-21 | Github[81] | - |
笔者整理了部分论文的中文版,有需要的可以私聊笔者获取,大概效果如下:
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
StarMIMIC-IT: 多模态上下文中的指导调整[82] | arXiv | 2023-06-08 | Github[83] | Demo[84] |
StarChameleon: 使用大型语言模型进行即插即用的组合推理[85] | arXiv | 2023-04-19 | Github[86] | Demo[87] |
StarHuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务[88] | arXiv | 2023-03-30 | Github[89] | Demo[90] |
StarMM-REACT: 用于多模态推理和操作的ChatGPT提示[91] | arXiv | 2023-03-20 | Github[92] | Demo[93] |
Star利用答案启发的提示为基于知识的视觉问答提供支持[94] | CVPR | 2023-03-03 | Github[95] | - |
Star视觉编程:无需训练的组合视觉推理[96] | CVPR | 2022-11-18 | Github[97] | Local Demo |
StarGPT-3的经验研究:用于少样本知识驱动视觉问答的实证研究[98] | AAAI | 2022-06-28 | Github[99] | - |
StarFlamingo:一种用于少样本学习的视觉语言模型[100] | NeurIPS | 2022-04-29 | Github[101] | 演示[102] |
冻结语言模型的多模态少样本学习[103] | NeurIPS | 2021-06-25 | - | - |
标题 | 会议/期刊 | 日期 | 代码 | 演示 |
---|---|---|---|---|
StarEmbodiedGPT: 通过多模态思维链进行视觉语言预训练[104] | arXiv | 2023-05-24 | Github[105] | - |
让我们逐帧思考:通过视频补全和预测评估视频思维链[106] | arXiv | 2023-05-23 | - | - |
StarCaption Anything: 利用多样的多模态控制进行交互式图像描述[107] | arXiv | 2023-05-04 | Github[108] | 演示[109] |
视觉思维链:用多模态补全填补逻辑间隙[110] | arXiv | 2023-05-03 | 即将推出[111] | - |
StarChameleon: 使用大型语言模型进行即插即用的组合推理[112] | arXiv | 2023-04-19 | Github[113] | 演示[114] |
视觉语言模型中的思维链提示微调[115] | arXiv | 2023-04-16 | 即将推出 | - |
StarMM-REACT: 多模态推理与交互式ChatGPT[116] | arXiv | 2023-03-20 | Github[117] | 演示[118] |
StarVisual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑[119] | arXiv | 2023-03-08 | Github[120] | 演示[121] |
Star多模态思维链推理[122] | arXiv | 2023-02-02 | Github[123] | - |
Star视觉编程:无需训练的组合视觉推理[124] | CVPR | 2022-11-18 | Github[125] | 本地演示 |
Star学会解释:通过思维链进行多模态推理解答科学问题[126] | NeurIPS | 2022-09-20 | Github[127] | - |
标题 | 会议 | 日期 | 代码 | 演示 |
---|---|---|---|---|
StarGPT4Tools: 通过自我教育教授大型语言模型使用工具[128] | arXiv | 2023-05-30 | Github[129] | 演示[130] |
StarLayoutGPT: 利用大型语言模型进行组合式视觉规划和生成[131] | arXiv | 2023-05-24 | Github[132] | - |
StarIdealGPT: 通过大型语言模型迭代分解视觉和语言推理[133] | arXiv | 2023-05-24 | Github[134] | 本地演示 |
StarAccountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令[135] | arXiv | 2023-05-10 | Github[136] | - |
StarCaption Anything: 多样多模态控制的交互式图像描述[137] | arXiv | 2023-05-04 | Github[138] | 演示[139] |
StarChameleon: 大型语言模型的即插即用组合式推理[140] | arXiv | 2023-04-19 | Github[141] | 演示[142] |
StarHuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务[143] | arXiv | 2023-03-30 | Github[144] | 演示[145] |
StarMM-REACT: 多模态推理和行动中的ChatGPT提示[146] | arXiv | 2023-03-20 | Github[147] | 演示[148] |
StarViperGPT: 通过Python执行进行视觉推理[149] | arXiv | 2023-03-14 | Github[150] | 本地演示 |
StarChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问[151] | arXiv | 2023-03-12 | Github[152] | 本地演示 |
StarVisual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑[153] | arXiv | 2023-03-08 | Github[154] | 演示[155] |
StarPrompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器[156] | CVPR | 2023-03-03 | Github[157] | - |
StarPointCLIP V2: 适应强大的3D开放世界学习的CLIP[158] | CVPR | 2022-11-21 | Github[159] | - |
StarVisual Programming: 无需训练的组合式视觉推理[160] | CVPR | 2022-11-18 | Github[161] | 本地演示 |
StarSocratic Models: 使用语言进行零样本多模态推理[162] | arXiv | 2022-04-01 | Github[163] | - |
标题 | 发表会议/期刊 | 日期 | 代码 | 演示 |
---|---|---|---|---|
StarTransfer Visual Prompt Generator across LLMs[164] | arXiv | 2023-05-02 | Github[165] | 演示[166] |
GPT-4 技术报告[167] | arXiv | 2023-03-15 | - | - |
PaLM-E: 一种具有身体感知的多模态语言模型[168] | arXiv | 2023-03-06 | - | 演示[169] |
StarPrismer: 具有多个专家的视觉语言模型[170] | arXiv | 2023-03-04 | Github[171] | 演示[172] |
Star语言并非唯一需求:将感知与语言模型对齐[173] | arXiv | 2023-02-27 | Github[174] | - |
StarBLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练[175] | arXiv | 2023-01-30 | Github[176] | 演示[177] |
StarVIMA: 基于多模态提示的通用机器人操作[178] | ICML | 2022-10-06 | Github[179] |
标题 | 发表会议/期刊 | 日期 | 代码 | 演示 |
---|---|---|---|---|
大型预训练模型能帮助视觉模型处理感知任务吗?[180] | arXiv | 2023-06-01 | 即将推出[181] | - |
Star多模态大型语言模型在上下文目标检测中的应用[182] | arXiv | 2023-05-29 | Github[183] | 演示[184] |
Star利用多模态语言模型生成图像[185] | arXiv | 2023-05-26 | Github[186] | - |
Star评估大型视觉-语言模型的对抗鲁棒性[187] | arXiv | 2023-05-26 | Github[188] | - |
Star在大型视觉-语言模型中评估对象虚构[189] | arXiv | 2023-05-17 | Github[190] | - |
Star将语言模型与图像进行模态间输入输出关联[191] | ICML | 2023-01-31 | Github[192] | 演示[193] |
名称 | 论文 | 类型 | 模态 |
---|---|---|---|
MS-COCO | Microsoft COCO: Common Objects in Context[194] | 标题 | 图像-文本 |
SBU Captions | Im2Text: Describing Images Using 1 Million Captioned Photographs[195] | 标题 | 图像-文本 |
Conceptual Captions | Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning[196] | 标题 | 图像-文本 |
LAION-400M | LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs[197] | 标题 | 图像-文本 |
VG Captions | Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations[198] | 标题 | 图像-文本 |
Flickr30k | Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models[199] | 标题 | 图像-文本 |
AI-Caps | AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding[200] | 标题 | 图像-文本 |
悟空标注 | Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark[201] | 标题 | 图像-文本 |
Youku-mPLUG | Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks[202] | 标题 | 视频-文本 |
MSR-VTT | MSR-VTT: A Large Video Description Dataset for Bridging Video and Language[203] | 标题 | 视频-文本 |
Webvid10M | Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval[204] | 标题 | 视频-文本 |
WavCaps | WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research[205] | 标题 | 音频-文本 |
AISHELL-1 | AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline[206] | ASR | 音频-文本 |
AISHELL-2 | AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale[207] | ASR | 音频-文本 |
VSDial-CN | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[208] | ASR | 图像-音频-文本 |
名称 | 论文 | 链接 | 备注 |
---|---|---|---|
Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration[209] | 链接[210] | 一个大规模的多模态指令数据集,具有多轮对话 |
LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[211] | 链接[212] | 一个全面的多模态指令调整数据集 |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[213] | 链接[214] | 10万个高质量视频指令数据集 |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning[215] | 即将推出[216] | 多模态上下文指令调整 |
M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning[217] | 链接[218] | 大规模、广覆盖的多模态指令调整数据集 |
LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day[219] | 即将推出[220] | 一个大规模、广覆盖的生物医学指令跟随数据集 |
GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction[221] | 链接[222] | 与工具相关的指令数据集 |
MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst[223] | 即将推出[224] | 覆盖16种多模态任务的多模态指令调整数据集 |
DetGPT | DetGPT: Detect What You Need via Reasoning[225] | 链接[226] | 一个包含5000张图像和约30000个查询-回答对的指令调整数据集 |
PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering[227] | 即将推出[228] | 大规模的医学视觉问答数据集 |
VideoChat | VideoChat: Chat-Centric Video Understanding[229] | 链接[230] | 以视频为中心的多模态指令数据集 |
X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[231] | 链接[232] | 中文多模态指令数据集 |
LMEye | LMEye: An Interactive Perception Network for Large Language Models[233] | 链接[234] | 一个多模态指令调整数据集 |
cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models[235] | 链接[236] | 用于改善模型可用性和生成流畅性的多模态对齐数据集 |
LLaVA-Instruct-150K | Visual Instruction Tuning[237] | 链接[238] | 由GPT生成的多模态指令跟随数据 |
MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning[239] | 链接[240] | 第一个多模态指令调整基准数据集 |
名称 | 论文 | 链接 | 备注 |
---|---|---|---|
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning[241] | 即将推出[242] | 多模态上下文指令数据集 |
名称 | 论文 | 链接 | 备注 |
---|---|---|---|
EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought[243] | 即将推出[244] | 大规模的具身化规划数据集 |
VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction[245] | 即将推出 | 用于评估VideoCOT的推理时间数据集 |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering[246] | 链接[247] | 大规模的多项选择数据集,涵盖了多模态科学问题和多个领域 |
名称 | 论文 | 链接 | 备注 |
---|---|---|---|
IMAD | IMAD: IMage-Augmented multi-modal Dialogue[248] | 链接[249] | 多模态对话数据集 |
LAMM-Benchmark | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[250] | 链接[251] | 用于评估MLLM在各种2D/3D视觉任务上的定量性能的基准测试 |
OwlEval | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality[252] | 链接[253] | 用于评估多种能力的数据集 |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[254] | 链接[255] | 用于视频对话模型的定量评估框架 |
LVLM-eHub | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models[256] | 链接[257] | MLLM的评估平台 |
CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[258] | 链接[259] | 用于学习拒绝指令的合成多模态微调数据集 |
Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[260] | 链接[261] | 用于学习拒绝指令的手动拍摄的多模态微调数据集 |
文章内容主要翻译整理自:GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models[262],后续会持续更新,请点赞收藏!
[1]
超棒的论文: #超棒的论文
[2]
多模态指令调整: #多模态指令调整
[3]
多模态情境学习: #多模态情境学习
[4]
多模态思维链条: #多模态思维链条
[5]
由LLM辅助的视觉推理: #由llm辅助的视觉推理
[6]
基础模型: #基础模型
[7]
其他: #其他
[8]
超棒的数据集: #超棒的数据集
[9]
对齐预训练的数据集: #对齐预训练的数据集
[10]
多模态指令调整的数据集: #多模态指令调整的数据集
[11]
情境学习的数据集: #情境学习的数据集
[12]
多模态思维链条的数据集: #多模态思维链条的数据集
[13]
其他: #其他-1
[14]
Macaw-LLM: 图像,音频,视频和文本整合的多模态语言模型: https://arxiv.org/pdf/2306.09093.pdf
[15]
Github: https://github.com/lyuchenyang/Macaw-LLM
[16]
即将到来:
[17]
LAMM: 语言辅助的多模态指导调优数据集,框架和基准: https://arxiv.org/pdf/2306.06687.pdf
[18]
Github: https://github.com/OpenLAMM/LAMM
[19]
演示: https://huggingface.co/spaces/openlamm/LAMM
[20]
Video-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解: https://arxiv.org/pdf/2306.05424.pdf
[21]
Github: https://github.com/mbzuai-oryx/Video-ChatGPT
[22]
演示: https://www.ival-mbzuai.com/video-chatgpt
[23]
MIMIC-IT: 多模态上下文指导调优: https://arxiv.org/pdf/2306.05425.pdf
[24]
Github: https://github.com/Luodian/Otter
[25]
演示: https://otter.cliangyu.com/
[26]
M3IT: 面向多模态多语言指导调优的大规模数据集: https://arxiv.org/pdf/2306.04387.pdf
[27]
Video-LLaMA: 为视频理解的指导调优的音频视觉语言模型: https://arxiv.org/pdf/2306.02858.pdf
[28]
Github: https://github.com/DAMO-NLP-SG/Video-LLaMA
[29]
演示: https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA
[30]
LLaVA-Med:在一天内训练用于生物医学的大型语言和视觉助手: https://arxiv.org/pdf/2306.00890.pdf
[31]
Github: https://github.com/microsoft/LLaVA-Med
[32]
GPT4Tools:通过自我指导教大型语言模型使用工具: https://arxiv.org/pdf/2305.18752.pdf
[33]
Github: https://github.com/StevenGrove/GPT4Tools
[34]
Demo: https://huggingface.co/spaces/stevengrove/GPT4Tools
[35]
PandaGPT:一种用于全面指令跟随的模型: https://arxiv.org/pdf/2305.16355.pdf
[36]
Github: https://github.com/yxuansu/PandaGPT
[37]
Demo: https://huggingface.co/spaces/GMFTBY/PandaGPT
[38]
ChatBridge:通过大型语言模型作为语言催化剂来桥接模式: https://arxiv.org/pdf/2305.16103.pdf
[39]
Github: https://github.com/joez17/ChatBridge
[40]
简便快捷:大型语言模型的高效视觉语言指令调优: https://arxiv.org/pdf/2305.15023.pdf
[41]
Github: https://github.com/luogen1996/LaVIN
[42]
DetGPT:通过推理检测你需要的东西: https://arxiv.org/pdf/2305.14167.pdf
[43]
Github: https://github.com/OptimalScale/DetGPT
[44]
Demo: https://d3c431c0c77b1d9010.gradio.live/
[45]
VisionLLM: 大型语言模型也是视觉中心任务的开放式解码器: https://arxiv.org/pdf/2305.11175.pdf
[46]
Github: https://github.com/OpenGVLab/VisionLLM
[47]
Demo: https://igpt.opengvlab.com/
[48]
Listen, Think, and Understand: https://arxiv.org/pdf/2305.10790.pdf
[49]
Github: https://github.com/YuanGongND/ltu
[50]
Demo: https://github.com/YuanGongND/ltu
[51]
Github: https://github.com/THUDM/VisualGLM-6B
[52]
PMC-VQA: 医疗视觉问答的视觉指导优化: https://arxiv.org/pdf/2305.10415.pdf
[53]
Github: https://github.com/xiaoman-zhang/PMC-VQA
[54]
InstructBLIP: 通过指导优化实现通用的视觉语言模型: https://arxiv.org/pdf/2305.06500.pdf
[55]
Github: https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
[56]
VideoChat: 以聊天为中心的视频理解: https://arxiv.org/pdf/2305.06355.pdf
[57]
Github: https://github.com/OpenGVLab/Ask-Anything
[58]
Demo: https://ask.opengvlab.com/
[59]
MultiModal-GPT: 用于与人类对话的视觉和语言模型: https://arxiv.org/pdf/2305.04790.pdf
[60]
Github: https://github.com/open-mmlab/Multimodal-GPT
[61]
Demo: https://mmgpt.openmmlab.org.cn/
[62]
X-LLM: 通过将多模态视为外语来引导先进的大型语言模型: https://arxiv.org/pdf/2305.04160.pdf
[63]
Github: https://github.com/phellonchen/X-LLM
[64]
LMEye: 用于大型语言模型的交互式感知网络: https://arxiv.org/pdf/2305.03701.pdf
[65]
Github: https://github.com/YunxinLi/LingCloud
[66]
LLaMA-Adapter V2: 高效参数的视觉指导模型: https://arxiv.org/pdf/2304.15010.pdf
[67]
Github: https://github.com/ZrrSkywalker/LLaMA-Adapter
[68]
Demo: http://llama-adapter.opengvlab.com/
[69]
mPLUG-Owl: 模块化使大型语言模型具备多模态能力: https://arxiv.org/pdf/2304.14178.pdf
[70]
Github: https://github.com/X-PLUG/mPLUG-Owl
[71]
Demo: https://huggingface.co/spaces/MAGAer13/mPLUG-Owl
[72]
MiniGPT-4: 通过先进的大型语言模型增强视觉语言理解: https://arxiv.org/pdf/2304.10592.pdf
[73]
Github: https://github.com/Vision-CAIR/MiniGPT-4
[74]
Visual Instruction Tuning: https://arxiv.org/pdf/2304.08485.pdf
[75]
GitHub: https://github.com/haotian-liu/LLaVA
[76]
Demo: https://llava.hliu.cc/
[77]
LLaMA-Adapter: 使用零初始化注意力高效微调语言模型: https://arxiv.org/pdf/2303.16199.pdf
[78]
Github: https://github.com/ZrrSkywalker/LLaMA-Adapter
[79]
Demo: https://huggingface.co/spaces/csuhan/LLaMA-Adapter
[80]
MultiInstruct: 通过指导调整提高多模态零样本学习: https://arxiv.org/pdf/2212.10773.pdf
[81]
Github: https://github.com/VT-NLP/MultiInstruct
[82]
MIMIC-IT: 多模态上下文中的指导调整: https://arxiv.org/pdf/2306.05425.pdf
[83]
Github: https://github.com/Luodian/Otter
[84]
Demo: https://otter.cliangyu.com/
[85]
Chameleon: 使用大型语言模型进行即插即用的组合推理: https://arxiv.org/pdf/2304.09842.pdf
[86]
Github: https://github.com/lupantech/chameleon-llm
[87]
Demo: https://chameleon-llm.github.io/
[88]
HuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务: https://arxiv.org/pdf/2303.17580.pdf
[89]
Github: https://github.com/microsoft/JARVIS
[90]
Demo: https://huggingface.co/spaces/microsoft/HuggingGPT
[91]
MM-REACT: 用于多模态推理和操作的ChatGPT提示: https://arxiv.org/pdf/2303.11381.pdf
[92]
Github: https://github.com/microsoft/MM-REACT
[93]
Demo: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[94]
利用答案启发的提示为基于知识的视觉问答提供支持: https://arxiv.org/pdf/2303.01903.pdf
[95]
Github: https://github.com/MILVLG/prophet
[96]
视觉编程:无需训练的组合视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf
[97]
Github: https://github.com/allenai/visprog
[98]
GPT-3的经验研究:用于少样本知识驱动视觉问答的实证研究: https://ojs.aaai.org/index.php/AAAI/article/download/20215/19974
[99]
Github: https://github.com/microsoft/PICa
[100]
Flamingo:一种用于少样本学习的视觉语言模型: https://arxiv.org/pdf/2204.14198.pdf
[101]
Github: https://github.com/mlfoundations/open_flamingo
[102]
演示: https://huggingface.co/spaces/dhansmair/flamingo-mini-cap
[103]
冻结语言模型的多模态少样本学习: https://arxiv.org/pdf/2106.13884.pdf
[104]
EmbodiedGPT: 通过多模态思维链进行视觉语言预训练: https://arxiv.org/pdf/2305.15021.pdf
[105]
Github: https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch
[106]
让我们逐帧思考:通过视频补全和预测评估视频思维链: https://arxiv.org/pdf/2305.13903.pdf
[107]
Caption Anything: 利用多样的多模态控制进行交互式图像描述: https://arxiv.org/pdf/2305.02677.pdf
[108]
Github: https://github.com/ttengwang/Caption-Anything
[109]
演示: https://huggingface.co/spaces/TencentARC/Caption-Anything
[110]
视觉思维链:用多模态补全填补逻辑间隙: https://arxiv.org/pdf/2305.02317.pdf
[111]
即将推出: https://github.com/dannyrose30/VCOT
[112]
Chameleon: 使用大型语言模型进行即插即用的组合推理: https://arxiv.org/pdf/2304.09842.pdf
[113]
Github: https://github.com/lupantech/chameleon-llm
[114]
演示: https://chameleon-llm.github.io/
[115]
视觉语言模型中的思维链提示微调: https://arxiv.org/pdf/2304.07919.pdf
[116]
MM-REACT: 多模态推理与交互式ChatGPT: https://arxiv.org/pdf/2303.11381.pdf
[117]
Github: https://github.com/microsoft/MM-REACT
[118]
演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[119]
Visual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑: https://arxiv.org/pdf/2303.04671.pdf
[120]
Github: https://github.com/microsoft/TaskMatrix
[121]
演示: https://huggingface.co/spaces/microsoft/visual_chatgpt
[122]
多模态思维链推理: https://arxiv.org/pdf/2302.00923.pdf
[123]
Github: https://github.com/amazon-science/mm-cot
[124]
视觉编程:无需训练的组合视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf
[125]
Github: https://github.com/allenai/visprog
[126]
学会解释:通过思维链进行多模态推理解答科学问题: https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf
[127]
Github: https://github.com/lupantech/ScienceQA
[128]
GPT4Tools: 通过自我教育教授大型语言模型使用工具: https://arxiv.org/pdf/2305.18752.pdf
[129]
Github: https://github.com/StevenGrove/GPT4Tools
[130]
演示: https://c60eb7e9400930f31b.gradio.live/
[131]
LayoutGPT: 利用大型语言模型进行组合式视觉规划和生成: https://arxiv.org/pdf/2305.15393.pdf
[132]
Github: https://github.com/weixi-feng/LayoutGPT
[133]
IdealGPT: 通过大型语言模型迭代分解视觉和语言推理: https://arxiv.org/pdf/2305.14985.pdf
[134]
Github: https://github.com/Hxyou/IdealGPT
[135]
Accountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令: https://arxiv.org/pdf/2303.05983.pdf
[136]
Github: https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat
[137]
Caption Anything: 多样多模态控制的交互式图像描述: https://arxiv.org/pdf/2305.02677.pdf
[138]
Github: https://github.com/ttengwang/Caption-Anything
[139]
演示: https://huggingface.co/spaces/TencentARC/Caption-Anything
[140]
Chameleon: 大型语言模型的即插即用组合式推理: https://arxiv.org/pdf/2304.09842.pdf
[141]
Github: https://github.com/lupantech/chameleon-llm
[142]
演示: https://chameleon-llm.github.io/
[143]
HuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务: https://arxiv.org/pdf/2303.17580.pdf
[144]
Github: https://github.com/microsoft/JARVIS
[145]
演示: https://huggingface.co/spaces/microsoft/HuggingGPT
[146]
MM-REACT: 多模态推理和行动中的ChatGPT提示: https://arxiv.org/pdf/2303.11381.pdf
[147]
Github: https://github.com/microsoft/MM-REACT
[148]
演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react
[149]
ViperGPT: 通过Python执行进行视觉推理: https://arxiv.org/pdf/2303.08128.pdf
[150]
Github: https://github.com/cvlab-columbia/viper
[151]
ChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问: https://arxiv.org/pdf/2303.06594.pdf
[152]
Github: https://github.com/Vision-CAIR/ChatCaptioner
[153]
Visual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑: https://arxiv.org/pdf/2303.04671.pdf
[154]
Github: https://github.com/microsoft/TaskMatrix
[155]
演示: https://huggingface.co/spaces/microsoft/visual_chatgpt
[156]
Prompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器: https://arxiv.org/pdf/2303.02151.pdf
[157]
Github: https://github.com/ZrrSkywalker/CaFo
[158]
PointCLIP V2: 适应强大的3D开放世界学习的CLIP: https://arxiv.org/pdf/2211.11682.pdf
[159]
Github: https://github.com/yangyangyang127/PointCLIP_V2
[160]
Visual Programming: 无需训练的组合式视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf
[161]
Github: https://github.com/allenai/visprog
[162]
Socratic Models: 使用语言进行零样本多模态推理: https://arxiv.org/pdf/2204.00598.pdf
[163]
Github: https://github.com/google-research/google-research/tree/master/socraticmodels
[164]
Transfer Visual Prompt Generator across LLMs: https://arxiv.org/pdf/2305.01278.pdf
[165]
Github: https://github.com/VPGTrans/VPGTrans
[166]
演示: https://3fc7715dbc44234a7f.gradio.live/
[167]
GPT-4 技术报告: https://arxiv.org/pdf/2303.08774.pdf
[168]
PaLM-E: 一种具有身体感知的多模态语言模型: https://arxiv.org/pdf/2303.03378.pdf
[169]
演示: https://palm-e.github.io/#demo
[170]
Prismer: 具有多个专家的视觉语言模型: https://arxiv.org/pdf/2303.02506.pdf
[171]
Github: https://github.com/NVlabs/prismer
[172]
演示: https://huggingface.co/spaces/lorenmt/prismer
[173]
语言并非唯一需求:将感知与语言模型对齐: https://arxiv.org/pdf/2302.14045.pdf
[174]
Github: https://github.com/microsoft/unilm
[175]
BLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练: https://arxiv.org/pdf/2301.12597.pdf
[176]
Github: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
[177]
演示: https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb
[178]
VIMA: 基于多模态提示的通用机器人操作: https://arxiv.org/pdf/2210.03094.pdf
[179]
Github: https://github.com/vimalabs/VIMA
[180]
大型预训练模型能帮助视觉模型处理感知任务吗?: https://arxiv.org/pdf/2306.00693.pdf
[181]
即将推出:
[182]
多模态大型语言模型在上下文目标检测中的应用: https://arxiv.org/pdf/2305.18279.pdf
[183]
Github: https://github.com/yuhangzang/ContextDET
[184]
演示: https://huggingface.co/spaces/yuhangzang/ContextDet-Demo
[185]
利用多模态语言模型生成图像: https://arxiv.org/pdf/2305.17216.pdf
[186]
Github: https://github.com/kohjingyu/gill
[187]
评估大型视觉-语言模型的对抗鲁棒性: https://arxiv.org/pdf/2305.16934.pdf
[188]
Github: https://github.com/yunqing-me/AttackVLM
[189]
在大型视觉-语言模型中评估对象虚构: https://arxiv.org/pdf/2305.10355.pdf
[190]
Github: https://github.com/RUCAIBox/POPE
[191]
将语言模型与图像进行模态间输入输出关联: https://arxiv.org/pdf/2301.13823.pdf
[192]
Github: https://github.com/kohjingyu/fromage
[193]
演示: https://huggingface.co/spaces/jykoh/fromage
[194]
Microsoft COCO: Common Objects in Context: https://arxiv.org/pdf/1405.0312.pdf
[195]
Im2Text: Describing Images Using 1 Million Captioned Photographs: https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
[196]
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning: https://aclanthology.org/P18-1238.pdf
[197]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs: https://arxiv.org/pdf/2111.02114.pdf
[198]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations: https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf
[199]
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models: https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf
[200]
AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding: https://arxiv.org/pdf/1711.06475.pdf
[201]
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark: https://proceedings.neurips.cc/paper_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf
[202]
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks: https://arxiv.org/pdf/2306.04362.pdf
[203]
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language: https://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR-VTT_A_Large_CVPR_2016_paper.pdf
[204]
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval: https://arxiv.org/pdf/2104.00650.pdf
[205]
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research: https://arxiv.org/pdf/2303.17395.pdf
[206]
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline: https://arxiv.org/pdf/1709.05522.pdf
[207]
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale: https://arxiv.org/pdf/1808.10583.pdf
[208]
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf
[209]
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration: https://arxiv.org/pdf/2306.09093.pdf
[210]
链接: https://github.com/lyuchenyang/Macaw-LLM/tree/main/data
[211]
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf
[212]
链接: https://github.com/OpenLAMM/LAMM#lamm-dataset
[213]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf
[214]
链接: https://github.com/mbzuai-oryx/Video-ChatGPT#video-instruction-dataset-open_file_folder
[215]
MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf
[216]
即将推出: https://github.com/Luodian/Otter
[217]
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning: https://arxiv.org/pdf/2306.04387.pdf
[218]
链接: https://huggingface.co/datasets/MMInstruction/M3IT
[219]
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day: https://arxiv.org/pdf/2306.00890.pdf
[220]
即将推出: https://github.com/microsoft/LLaVA-Med#llava-med-dataset
[221]
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction: https://arxiv.org/pdf/2305.18752.pdf
[222]
链接: https://github.com/StevenGrove/GPT4Tools#dataset
[223]
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst: https://arxiv.org/pdf/2305.16103.pdf
[224]
即将推出: https://iva-chatbridge.github.io/
[225]
DetGPT: Detect What You Need via Reasoning: https://arxiv.org/pdf/2305.14167.pdf
[226]
链接: https://github.com/OptimalScale/DetGPT/tree/main/dataset
[227]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering: https://arxiv.org/pdf/2305.10415.pdf
[228]
即将推出: https://xiaoman-zhang.github.io/PMC-VQA/
[229]
VideoChat: Chat-Centric Video Understanding: https://arxiv.org/pdf/2305.06355.pdf
[230]
链接: https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data
[231]
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf
[232]
链接: https://github.com/phellonchen/X-LLM
[233]
LMEye: An Interactive Perception Network for Large Language Models: https://arxiv.org/pdf/2305.03701.pdf
[234]
链接: https://huggingface.co/datasets/YunxinLi/Multimodal_Insturction_Data_V2
[235]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models: https://arxiv.org/pdf/2304.10592.pdf
[236]
链接: https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align
[237]
Visual Instruction Tuning: https://arxiv.org/pdf/2304.08485.pdf
[238]
链接: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
[239]
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning: https://arxiv.org/pdf/2212.10773.pdf
[240]
链接: https://github.com/VT-NLP/MultiInstruct
[241]
MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf
[242]
即将推出: https://github.com/Luodian/Otter
[243]
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought: https://arxiv.org/pdf/2305.15021.pdf
[244]
即将推出: https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch
[245]
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction: https://arxiv.org/pdf/2305.13903.pdf
[246]
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering: https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf
[247]
链接: https://github.com/lupantech/ScienceQA#ghost-download-the-dataset
[248]
IMAD: IMage-Augmented multi-modal Dialogue: https://arxiv.org/pdf/2305.10512.pdf
[249]
链接: https://github.com/VityaVitalich/IMAD
[250]
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf
[251]
链接: https://github.com/OpenLAMM/LAMM#lamm-benchmark
[252]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality: https://arxiv.org/pdf/2304.14178.pdf
[253]
链接: https://github.com/X-PLUG/mPLUG-Owl/tree/main/OwlEval
[254]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf
[255]
链接: https://github.com/mbzuai-oryx/Video-ChatGPT#quantitative-evaluation-bar_chart
[256]
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models: https://arxiv.org/pdf/2306.09265.pdf
[257]
链接: https://github.com/OpenGVLab/Multi-Modality-Arena
[258]
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf
[259]
链接: https://drive.google.com/drive/folders/1TqBzkyqxOSg1hgCXF8JjpYIAuRV-uVft
[260]
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf
[261]
链接: https://drive.google.com/drive/folders/1Saaia2rRRb1nz5sKdmpzYdS4jHiMDaP0
[262]
GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models