前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >优秀的多模态大模型(LLM)资源库

优秀的多模态大模型(LLM)资源库

作者头像
山行AI
发布2023-06-26 10:49:20
1.5K0
发布2023-06-26 10:49:20
举报
文章被收录于专栏:山行AI山行AI

前言

在AI盛起的当下,各类AI应用不断地出现在人们的视野中,AI正在重塑着各行各业。笔者认为,如果说ChatGPT引领了AI革命的开端,那么多模态大模型一定代表着AI应用的未来。

本文是一个多模态大语言模型的资源库,里面罗列了大大小小很多个多模态大语言模型的论文、应用、数据集等学习资源,建议大家点赞收藏。

对于本文中的部分项目,笔者之前也有文章介绍,部分罗列如下,感兴趣的同学可以查看:

GPT4All——可本地布署的AI助理

MiniGPT-4:使用先进的大型语言模型提升视觉语言理解

Audiocraft——一个基于PyTorch的AI音频生成深度学习研究库

Recognize_Anything-Tag2Text——一款强大的图像标签模型和Tag2Text

MLC LLM——本地应用程序上原生部署任何语言模型

......

超棒的-多模态-大型语言模型资源库

🔥🔥🔥 这是一个精心策划的多模态大型语言模型(MLLM)列表,包括数据集多模态指令调整多模态情境学习多模态思维链条由LLM辅助的视觉推理基础模型,以及其他

🔥🔥🔥 这个列表会实时更新。

🔥🔥🔥 MLLM的综述论文正在准备中,很快就会发布!


目录

•超棒的论文[1] •多模态指令调整[2] •多模态情境学习[3] •多模态思维链条[4] •由LLM辅助的视觉推理[5] •基础模型[6] •其他[7]•超棒的数据集[8] •对齐预训练的数据集[9] •多模态指令调整的数据集[10] •情境学习的数据集[11] •多模态思维链条的数据集[12] •其他[13]


优秀论文

下面的部分论文笔者有中文版,有需要的可以联系笔者获取。

多模态指导调优

标题

发布会议

日期

代码

演示

StarMacaw-LLM: 图像,音频,视频和文本整合的多模态语言模型[14]

arXiv

2023-06-15

Github[15]

即将到来[16]

StarLAMM: 语言辅助的多模态指导调优数据集,框架和基准[17]

arXiv

2023-06-11

Github[18]

演示[19]

StarVideo-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解[20]

arXiv

2023-06-08

Github[21]

演示[22]

StarMIMIC-IT: 多模态上下文指导调优[23]

arXiv

2023-06-08

Github[24]

演示[25]

M3IT: 面向多模态多语言指导调优的大规模数据集[26]

arXiv

2023-06-07

-

-

StarVideo-LLaMA: 为视频理解的指导调优的音频视觉语言模型[27]

arXiv

2023-06-05

Github[28]

演示[29]

StarLLaVA-Med:在一天内训练用于生物医学的大型语言和视觉助手[30]

arXiv

2023-06-01

Github[31]

-

StarGPT4Tools:通过自我指导教大型语言模型使用工具[32]

arXiv

2023-05-30

Github[33]

Demo[34]

StarPandaGPT:一种用于全面指令跟随的模型[35]

arXiv

2023-05-25

Github[36]

Demo[37]

StarChatBridge:通过大型语言模型作为语言催化剂来桥接模式[38]

arXiv

2023-05-25

Github[39]

-

Star简便快捷:大型语言模型的高效视觉语言指令调优[40]

arXiv

2023-05-24

Github[41]

本地演示

StarDetGPT:通过推理检测你需要的东西[42]

arXiv

2023-05-23

Github[43]

Demo[44]

StarVisionLLM: 大型语言模型也是视觉中心任务的开放式解码器[45]

arXiv

2023-05-18

Github[46]

Demo[47]

StarListen, Think, and Understand[48]

arXiv

2023-05-18

Github[49]

Demo[50]

StarVisualGLM-6B

-

2023-05-17

Github[51]

本地演示

StarPMC-VQA: 医疗视觉问答的视觉指导优化[52]

arXiv

2023-05-17

Github[53]

-

StarInstructBLIP: 通过指导优化实现通用的视觉语言模型[54]

arXiv

2023-05-11

Github[55]

本地演示

StarVideoChat: 以聊天为中心的视频理解[56]

arXiv

2023-05-10

Github[57]

Demo[58]

StarMultiModal-GPT: 用于与人类对话的视觉和语言模型[59]

arXiv

2023-05-08

Github[60]

Demo[61]

StarX-LLM: 通过将多模态视为外语来引导先进的大型语言模型[62]

arXiv

2023-05-07

Github[63]

-

StarLMEye: 用于大型语言模型的交互式感知网络[64]

arXiv

2023-05-05

Github[65]

本地演示

StarLLaMA-Adapter V2: 高效参数的视觉指导模型[66]

arXiv

2023-04-28

Github[67]

Demo[68]

StarmPLUG-Owl: 模块化使大型语言模型具备多模态能力[69]

arXiv

2023-04-27

Github[70]

Demo[71]

StarMiniGPT-4: 通过先进的大型语言模型增强视觉语言理解[72]

arXiv

2023-04-20

Github[73]

-

StarVisual Instruction Tuning[74]

arXiv

2023-04-17

GitHub[75]

Demo[76]

StarLLaMA-Adapter: 使用零初始化注意力高效微调语言模型[77]

arXiv

2023-03-28

Github[78]

Demo[79]

StarMultiInstruct: 通过指导调整提高多模态零样本学习[80]

ACL

2022-12-21

Github[81]

-

中文版论文

笔者整理了部分论文的中文版,有需要的可以私聊笔者获取,大概效果如下:

多模态上下文学习

Title

Venue

Date

Code

Demo

StarMIMIC-IT: 多模态上下文中的指导调整[82]

arXiv

2023-06-08

Github[83]

Demo[84]

StarChameleon: 使用大型语言模型进行即插即用的组合推理[85]

arXiv

2023-04-19

Github[86]

Demo[87]

StarHuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务[88]

arXiv

2023-03-30

Github[89]

Demo[90]

StarMM-REACT: 用于多模态推理和操作的ChatGPT提示[91]

arXiv

2023-03-20

Github[92]

Demo[93]

Star利用答案启发的提示为基于知识的视觉问答提供支持[94]

CVPR

2023-03-03

Github[95]

-

Star视觉编程:无需训练的组合视觉推理[96]

CVPR

2022-11-18

Github[97]

Local Demo

StarGPT-3的经验研究:用于少样本知识驱动视觉问答的实证研究[98]

AAAI

2022-06-28

Github[99]

-

StarFlamingo:一种用于少样本学习的视觉语言模型[100]

NeurIPS

2022-04-29

Github[101]

演示[102]

冻结语言模型的多模态少样本学习[103]

NeurIPS

2021-06-25

-

-

多模态思维链

标题

会议/期刊

日期

代码

演示

StarEmbodiedGPT: 通过多模态思维链进行视觉语言预训练[104]

arXiv

2023-05-24

Github[105]

-

让我们逐帧思考:通过视频补全和预测评估视频思维链[106]

arXiv

2023-05-23

-

-

StarCaption Anything: 利用多样的多模态控制进行交互式图像描述[107]

arXiv

2023-05-04

Github[108]

演示[109]

视觉思维链:用多模态补全填补逻辑间隙[110]

arXiv

2023-05-03

即将推出[111]

-

StarChameleon: 使用大型语言模型进行即插即用的组合推理[112]

arXiv

2023-04-19

Github[113]

演示[114]

视觉语言模型中的思维链提示微调[115]

arXiv

2023-04-16

即将推出

-

StarMM-REACT: 多模态推理与交互式ChatGPT[116]

arXiv

2023-03-20

Github[117]

演示[118]

StarVisual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑[119]

arXiv

2023-03-08

Github[120]

演示[121]

Star多模态思维链推理[122]

arXiv

2023-02-02

Github[123]

-

Star视觉编程:无需训练的组合视觉推理[124]

CVPR

2022-11-18

Github[125]

本地演示

Star学会解释:通过思维链进行多模态推理解答科学问题[126]

NeurIPS

2022-09-20

Github[127]

-

LLM辅助的视觉推理

标题

会议

日期

代码

演示

StarGPT4Tools: 通过自我教育教授大型语言模型使用工具[128]

arXiv

2023-05-30

Github[129]

演示[130]

StarLayoutGPT: 利用大型语言模型进行组合式视觉规划和生成[131]

arXiv

2023-05-24

Github[132]

-

StarIdealGPT: 通过大型语言模型迭代分解视觉和语言推理[133]

arXiv

2023-05-24

Github[134]

本地演示

StarAccountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令[135]

arXiv

2023-05-10

Github[136]

-

StarCaption Anything: 多样多模态控制的交互式图像描述[137]

arXiv

2023-05-04

Github[138]

演示[139]

StarChameleon: 大型语言模型的即插即用组合式推理[140]

arXiv

2023-04-19

Github[141]

演示[142]

StarHuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务[143]

arXiv

2023-03-30

Github[144]

演示[145]

StarMM-REACT: 多模态推理和行动中的ChatGPT提示[146]

arXiv

2023-03-20

Github[147]

演示[148]

StarViperGPT: 通过Python执行进行视觉推理[149]

arXiv

2023-03-14

Github[150]

本地演示

StarChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问[151]

arXiv

2023-03-12

Github[152]

本地演示

StarVisual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑[153]

arXiv

2023-03-08

Github[154]

演示[155]

StarPrompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器[156]

CVPR

2023-03-03

Github[157]

-

StarPointCLIP V2: 适应强大的3D开放世界学习的CLIP[158]

CVPR

2022-11-21

Github[159]

-

StarVisual Programming: 无需训练的组合式视觉推理[160]

CVPR

2022-11-18

Github[161]

本地演示

StarSocratic Models: 使用语言进行零样本多模态推理[162]

arXiv

2022-04-01

Github[163]

-

基础模型

标题

发表会议/期刊

日期

代码

演示

StarTransfer Visual Prompt Generator across LLMs[164]

arXiv

2023-05-02

Github[165]

演示[166]

GPT-4 技术报告[167]

arXiv

2023-03-15

-

-

PaLM-E: 一种具有身体感知的多模态语言模型[168]

arXiv

2023-03-06

-

演示[169]

StarPrismer: 具有多个专家的视觉语言模型[170]

arXiv

2023-03-04

Github[171]

演示[172]

Star语言并非唯一需求:将感知与语言模型对齐[173]

arXiv

2023-02-27

Github[174]

-

StarBLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练[175]

arXiv

2023-01-30

Github[176]

演示[177]

StarVIMA: 基于多模态提示的通用机器人操作[178]

ICML

2022-10-06

Github[179]

其他

标题

发表会议/期刊

日期

代码

演示

大型预训练模型能帮助视觉模型处理感知任务吗?[180]

arXiv

2023-06-01

即将推出[181]

-

Star多模态大型语言模型在上下文目标检测中的应用[182]

arXiv

2023-05-29

Github[183]

演示[184]

Star利用多模态语言模型生成图像[185]

arXiv

2023-05-26

Github[186]

-

Star评估大型视觉-语言模型的对抗鲁棒性[187]

arXiv

2023-05-26

Github[188]

-

Star在大型视觉-语言模型中评估对象虚构[189]

arXiv

2023-05-17

Github[190]

-

Star将语言模型与图像进行模态间输入输出关联[191]

ICML

2023-01-31

Github[192]

演示[193]


精彩数据集

用于对齐的预训练数据集

名称

论文

类型

模态

MS-COCO

Microsoft COCO: Common Objects in Context[194]

标题

图像-文本

SBU Captions

Im2Text: Describing Images Using 1 Million Captioned Photographs[195]

标题

图像-文本

Conceptual Captions

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning[196]

标题

图像-文本

LAION-400M

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs[197]

标题

图像-文本

VG Captions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations[198]

标题

图像-文本

Flickr30k

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models[199]

标题

图像-文本

AI-Caps

AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding[200]

标题

图像-文本

悟空标注

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark[201]

标题

图像-文本

Youku-mPLUG

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks[202]

标题

视频-文本

MSR-VTT

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language[203]

标题

视频-文本

Webvid10M

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval[204]

标题

视频-文本

WavCaps

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research[205]

标题

音频-文本

AISHELL-1

AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline[206]

ASR

音频-文本

AISHELL-2

AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale[207]

ASR

音频-文本

VSDial-CN

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[208]

ASR

图像-音频-文本

多模态指令调整数据集

名称

论文

链接

备注

Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration[209]

链接[210]

一个大规模的多模态指令数据集,具有多轮对话

LAMM-Dataset

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[211]

链接[212]

一个全面的多模态指令调整数据集

Video-ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[213]

链接[214]

10万个高质量视频指令数据集

MIMIC-IT

MIMIC-IT: Multi-Modal In-Context Instruction Tuning[215]

即将推出[216]

多模态上下文指令调整

M3IT

M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning[217]

链接[218]

大规模、广覆盖的多模态指令调整数据集

LLaVA-Med

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day[219]

即将推出[220]

一个大规模、广覆盖的生物医学指令跟随数据集

GPT4Tools

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction[221]

链接[222]

与工具相关的指令数据集

MULTIS

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst[223]

即将推出[224]

覆盖16种多模态任务的多模态指令调整数据集

DetGPT

DetGPT: Detect What You Need via Reasoning[225]

链接[226]

一个包含5000张图像和约30000个查询-回答对的指令调整数据集

PMC-VQA

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering[227]

即将推出[228]

大规模的医学视觉问答数据集

VideoChat

VideoChat: Chat-Centric Video Understanding[229]

链接[230]

以视频为中心的多模态指令数据集

X-LLM

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages[231]

链接[232]

中文多模态指令数据集

LMEye

LMEye: An Interactive Perception Network for Large Language Models[233]

链接[234]

一个多模态指令调整数据集

cc-sbu-align

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models[235]

链接[236]

用于改善模型可用性和生成流畅性的多模态对齐数据集

LLaVA-Instruct-150K

Visual Instruction Tuning[237]

链接[238]

由GPT生成的多模态指令跟随数据

MultiInstruct

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning[239]

链接[240]

第一个多模态指令调整基准数据集

在上下文学习中的数据集

名称

论文

链接

备注

MIMIC-IT

MIMIC-IT: Multi-Modal In-Context Instruction Tuning[241]

即将推出[242]

多模态上下文指令数据集

在多模态思维链中的数据集

名称

论文

链接

备注

EgoCOT

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought[243]

即将推出[244]

大规模的具身化规划数据集

VIP

Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction[245]

即将推出

用于评估VideoCOT的推理时间数据集

ScienceQA

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering[246]

链接[247]

大规模的多项选择数据集,涵盖了多模态科学问题和多个领域

其他数据集

名称

论文

链接

备注

IMAD

IMAD: IMage-Augmented multi-modal Dialogue[248]

链接[249]

多模态对话数据集

LAMM-Benchmark

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark[250]

链接[251]

用于评估MLLM在各种2D/3D视觉任务上的定量性能的基准测试

OwlEval

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality[252]

链接[253]

用于评估多种能力的数据集

Video-ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models[254]

链接[255]

用于视频对话模型的定量评估框架

LVLM-eHub

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models[256]

链接[257]

MLLM的评估平台

CLEVR-ATVC

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[258]

链接[259]

用于学习拒绝指令的合成多模态微调数据集

Fruit-ATVC

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation[260]

链接[261]

用于学习拒绝指令的手动拍摄的多模态微调数据集

声明

文章内容主要翻译整理自:GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models[262],后续会持续更新,请点赞收藏!

References

[1] 超棒的论文: #超棒的论文 [2] 多模态指令调整: #多模态指令调整 [3] 多模态情境学习: #多模态情境学习 [4] 多模态思维链条: #多模态思维链条 [5] 由LLM辅助的视觉推理: #由llm辅助的视觉推理 [6] 基础模型: #基础模型 [7] 其他: #其他 [8] 超棒的数据集: #超棒的数据集 [9] 对齐预训练的数据集: #对齐预训练的数据集 [10] 多模态指令调整的数据集: #多模态指令调整的数据集 [11] 情境学习的数据集: #情境学习的数据集 [12] 多模态思维链条的数据集: #多模态思维链条的数据集 [13] 其他: #其他-1 [14] Macaw-LLM: 图像,音频,视频和文本整合的多模态语言模型: https://arxiv.org/pdf/2306.09093.pdf [15] Github: https://github.com/lyuchenyang/Macaw-LLM [16] 即将到来: [17] LAMM: 语言辅助的多模态指导调优数据集,框架和基准: https://arxiv.org/pdf/2306.06687.pdf [18] Github: https://github.com/OpenLAMM/LAMM [19] 演示: https://huggingface.co/spaces/openlamm/LAMM [20] Video-ChatGPT: 通过大型视觉和语言模型进行详细的视频理解: https://arxiv.org/pdf/2306.05424.pdf [21] Github: https://github.com/mbzuai-oryx/Video-ChatGPT [22] 演示: https://www.ival-mbzuai.com/video-chatgpt [23] MIMIC-IT: 多模态上下文指导调优: https://arxiv.org/pdf/2306.05425.pdf [24] Github: https://github.com/Luodian/Otter [25] 演示: https://otter.cliangyu.com/ [26] M3IT: 面向多模态多语言指导调优的大规模数据集: https://arxiv.org/pdf/2306.04387.pdf [27] Video-LLaMA: 为视频理解的指导调优的音频视觉语言模型: https://arxiv.org/pdf/2306.02858.pdf [28] Github: https://github.com/DAMO-NLP-SG/Video-LLaMA [29] 演示: https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA [30] LLaVA-Med:在一天内训练用于生物医学的大型语言和视觉助手: https://arxiv.org/pdf/2306.00890.pdf [31] Github: https://github.com/microsoft/LLaVA-Med [32] GPT4Tools:通过自我指导教大型语言模型使用工具: https://arxiv.org/pdf/2305.18752.pdf [33] Github: https://github.com/StevenGrove/GPT4Tools [34] Demo: https://huggingface.co/spaces/stevengrove/GPT4Tools [35] PandaGPT:一种用于全面指令跟随的模型: https://arxiv.org/pdf/2305.16355.pdf [36] Github: https://github.com/yxuansu/PandaGPT [37] Demo: https://huggingface.co/spaces/GMFTBY/PandaGPT [38] ChatBridge:通过大型语言模型作为语言催化剂来桥接模式: https://arxiv.org/pdf/2305.16103.pdf [39] Github: https://github.com/joez17/ChatBridge [40] 简便快捷:大型语言模型的高效视觉语言指令调优: https://arxiv.org/pdf/2305.15023.pdf [41] Github: https://github.com/luogen1996/LaVIN [42] DetGPT:通过推理检测你需要的东西: https://arxiv.org/pdf/2305.14167.pdf [43] Github: https://github.com/OptimalScale/DetGPT [44] Demo: https://d3c431c0c77b1d9010.gradio.live/ [45] VisionLLM: 大型语言模型也是视觉中心任务的开放式解码器: https://arxiv.org/pdf/2305.11175.pdf [46] Github: https://github.com/OpenGVLab/VisionLLM [47] Demo: https://igpt.opengvlab.com/ [48] Listen, Think, and Understand: https://arxiv.org/pdf/2305.10790.pdf [49] Github: https://github.com/YuanGongND/ltu [50] Demo: https://github.com/YuanGongND/ltu [51] Github: https://github.com/THUDM/VisualGLM-6B [52] PMC-VQA: 医疗视觉问答的视觉指导优化: https://arxiv.org/pdf/2305.10415.pdf [53] Github: https://github.com/xiaoman-zhang/PMC-VQA [54] InstructBLIP: 通过指导优化实现通用的视觉语言模型: https://arxiv.org/pdf/2305.06500.pdf [55] Github: https://github.com/salesforce/LAVIS/tree/main/projects/instructblip [56] VideoChat: 以聊天为中心的视频理解: https://arxiv.org/pdf/2305.06355.pdf [57] Github: https://github.com/OpenGVLab/Ask-Anything [58] Demo: https://ask.opengvlab.com/ [59] MultiModal-GPT: 用于与人类对话的视觉和语言模型: https://arxiv.org/pdf/2305.04790.pdf [60] Github: https://github.com/open-mmlab/Multimodal-GPT [61] Demo: https://mmgpt.openmmlab.org.cn/ [62] X-LLM: 通过将多模态视为外语来引导先进的大型语言模型: https://arxiv.org/pdf/2305.04160.pdf [63] Github: https://github.com/phellonchen/X-LLM [64] LMEye: 用于大型语言模型的交互式感知网络: https://arxiv.org/pdf/2305.03701.pdf [65] Github: https://github.com/YunxinLi/LingCloud [66] LLaMA-Adapter V2: 高效参数的视觉指导模型: https://arxiv.org/pdf/2304.15010.pdf [67] Github: https://github.com/ZrrSkywalker/LLaMA-Adapter [68] Demo: http://llama-adapter.opengvlab.com/ [69] mPLUG-Owl: 模块化使大型语言模型具备多模态能力: https://arxiv.org/pdf/2304.14178.pdf [70] Github: https://github.com/X-PLUG/mPLUG-Owl [71] Demo: https://huggingface.co/spaces/MAGAer13/mPLUG-Owl [72] MiniGPT-4: 通过先进的大型语言模型增强视觉语言理解: https://arxiv.org/pdf/2304.10592.pdf [73] Github: https://github.com/Vision-CAIR/MiniGPT-4 [74] Visual Instruction Tuning: https://arxiv.org/pdf/2304.08485.pdf [75] GitHub: https://github.com/haotian-liu/LLaVA [76] Demo: https://llava.hliu.cc/ [77] LLaMA-Adapter: 使用零初始化注意力高效微调语言模型: https://arxiv.org/pdf/2303.16199.pdf [78] Github: https://github.com/ZrrSkywalker/LLaMA-Adapter [79] Demo: https://huggingface.co/spaces/csuhan/LLaMA-Adapter [80] MultiInstruct: 通过指导调整提高多模态零样本学习: https://arxiv.org/pdf/2212.10773.pdf [81] Github: https://github.com/VT-NLP/MultiInstruct [82] MIMIC-IT: 多模态上下文中的指导调整: https://arxiv.org/pdf/2306.05425.pdf [83] Github: https://github.com/Luodian/Otter [84] Demo: https://otter.cliangyu.com/ [85] Chameleon: 使用大型语言模型进行即插即用的组合推理: https://arxiv.org/pdf/2304.09842.pdf [86] Github: https://github.com/lupantech/chameleon-llm [87] Demo: https://chameleon-llm.github.io/ [88] HuggingGPT: 在HuggingFace中使用ChatGPT及其伙伴解决AI任务: https://arxiv.org/pdf/2303.17580.pdf [89] Github: https://github.com/microsoft/JARVIS [90] Demo: https://huggingface.co/spaces/microsoft/HuggingGPT [91] MM-REACT: 用于多模态推理和操作的ChatGPT提示: https://arxiv.org/pdf/2303.11381.pdf [92] Github: https://github.com/microsoft/MM-REACT [93] Demo: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react [94] 利用答案启发的提示为基于知识的视觉问答提供支持: https://arxiv.org/pdf/2303.01903.pdf [95] Github: https://github.com/MILVLG/prophet [96] 视觉编程:无需训练的组合视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf [97] Github: https://github.com/allenai/visprog [98] GPT-3的经验研究:用于少样本知识驱动视觉问答的实证研究: https://ojs.aaai.org/index.php/AAAI/article/download/20215/19974 [99] Github: https://github.com/microsoft/PICa [100] Flamingo:一种用于少样本学习的视觉语言模型: https://arxiv.org/pdf/2204.14198.pdf [101] Github: https://github.com/mlfoundations/open_flamingo [102] 演示: https://huggingface.co/spaces/dhansmair/flamingo-mini-cap [103] 冻结语言模型的多模态少样本学习: https://arxiv.org/pdf/2106.13884.pdf [104] EmbodiedGPT: 通过多模态思维链进行视觉语言预训练: https://arxiv.org/pdf/2305.15021.pdf [105] Github: https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch [106] 让我们逐帧思考:通过视频补全和预测评估视频思维链: https://arxiv.org/pdf/2305.13903.pdf [107] Caption Anything: 利用多样的多模态控制进行交互式图像描述: https://arxiv.org/pdf/2305.02677.pdf [108] Github: https://github.com/ttengwang/Caption-Anything [109] 演示: https://huggingface.co/spaces/TencentARC/Caption-Anything [110] 视觉思维链:用多模态补全填补逻辑间隙: https://arxiv.org/pdf/2305.02317.pdf [111] 即将推出: https://github.com/dannyrose30/VCOT [112] Chameleon: 使用大型语言模型进行即插即用的组合推理: https://arxiv.org/pdf/2304.09842.pdf [113] Github: https://github.com/lupantech/chameleon-llm [114] 演示: https://chameleon-llm.github.io/ [115] 视觉语言模型中的思维链提示微调: https://arxiv.org/pdf/2304.07919.pdf [116] MM-REACT: 多模态推理与交互式ChatGPT: https://arxiv.org/pdf/2303.11381.pdf [117] Github: https://github.com/microsoft/MM-REACT [118] 演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react [119] Visual ChatGPT: 使用视觉基础模型进行对话、绘画和编辑: https://arxiv.org/pdf/2303.04671.pdf [120] Github: https://github.com/microsoft/TaskMatrix [121] 演示: https://huggingface.co/spaces/microsoft/visual_chatgpt [122] 多模态思维链推理: https://arxiv.org/pdf/2302.00923.pdf [123] Github: https://github.com/amazon-science/mm-cot [124] 视觉编程:无需训练的组合视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf [125] Github: https://github.com/allenai/visprog [126] 学会解释:通过思维链进行多模态推理解答科学问题: https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf [127] Github: https://github.com/lupantech/ScienceQA [128] GPT4Tools: 通过自我教育教授大型语言模型使用工具: https://arxiv.org/pdf/2305.18752.pdf [129] Github: https://github.com/StevenGrove/GPT4Tools [130] 演示: https://c60eb7e9400930f31b.gradio.live/ [131] LayoutGPT: 利用大型语言模型进行组合式视觉规划和生成: https://arxiv.org/pdf/2305.15393.pdf [132] Github: https://github.com/weixi-feng/LayoutGPT [133] IdealGPT: 通过大型语言模型迭代分解视觉和语言推理: https://arxiv.org/pdf/2305.14985.pdf [134] Github: https://github.com/Hxyou/IdealGPT [135] Accountable Textual-Visual Chat 在图像再创作中学会拒绝人类指令: https://arxiv.org/pdf/2303.05983.pdf [136] Github: https://github.com/matrix-alpha/Accountable-Textual-Visual-Chat [137] Caption Anything: 多样多模态控制的交互式图像描述: https://arxiv.org/pdf/2305.02677.pdf [138] Github: https://github.com/ttengwang/Caption-Anything [139] 演示: https://huggingface.co/spaces/TencentARC/Caption-Anything [140] Chameleon: 大型语言模型的即插即用组合式推理: https://arxiv.org/pdf/2304.09842.pdf [141] Github: https://github.com/lupantech/chameleon-llm [142] 演示: https://chameleon-llm.github.io/ [143] HuggingGPT: 使用ChatGPT及其HuggingFace的朋友解决AI任务: https://arxiv.org/pdf/2303.17580.pdf [144] Github: https://github.com/microsoft/JARVIS [145] 演示: https://huggingface.co/spaces/microsoft/HuggingGPT [146] MM-REACT: 多模态推理和行动中的ChatGPT提示: https://arxiv.org/pdf/2303.11381.pdf [147] Github: https://github.com/microsoft/MM-REACT [148] 演示: https://huggingface.co/spaces/microsoft-cognitive-service/mm-react [149] ViperGPT: 通过Python执行进行视觉推理: https://arxiv.org/pdf/2303.08128.pdf [150] Github: https://github.com/cvlab-columbia/viper [151] ChatGPT Asks, BLIP-2 Answers: 丰富视觉描述的自动提问: https://arxiv.org/pdf/2303.06594.pdf [152] Github: https://github.com/Vision-CAIR/ChatCaptioner [153] Visual ChatGPT: 使用视觉基础模型进行对话、绘制和编辑: https://arxiv.org/pdf/2303.04671.pdf [154] Github: https://github.com/microsoft/TaskMatrix [155] 演示: https://huggingface.co/spaces/microsoft/visual_chatgpt [156] Prompt, Generate, then Cache: 级联基础模型构建强大的少样本学习器: https://arxiv.org/pdf/2303.02151.pdf [157] Github: https://github.com/ZrrSkywalker/CaFo [158] PointCLIP V2: 适应强大的3D开放世界学习的CLIP: https://arxiv.org/pdf/2211.11682.pdf [159] Github: https://github.com/yangyangyang127/PointCLIP_V2 [160] Visual Programming: 无需训练的组合式视觉推理: https://openaccess.thecvf.com/content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf [161] Github: https://github.com/allenai/visprog [162] Socratic Models: 使用语言进行零样本多模态推理: https://arxiv.org/pdf/2204.00598.pdf [163] Github: https://github.com/google-research/google-research/tree/master/socraticmodels [164] Transfer Visual Prompt Generator across LLMs: https://arxiv.org/pdf/2305.01278.pdf [165] Github: https://github.com/VPGTrans/VPGTrans [166] 演示: https://3fc7715dbc44234a7f.gradio.live/ [167] GPT-4 技术报告: https://arxiv.org/pdf/2303.08774.pdf [168] PaLM-E: 一种具有身体感知的多模态语言模型: https://arxiv.org/pdf/2303.03378.pdf [169] 演示: https://palm-e.github.io/#demo [170] Prismer: 具有多个专家的视觉语言模型: https://arxiv.org/pdf/2303.02506.pdf [171] Github: https://github.com/NVlabs/prismer [172] 演示: https://huggingface.co/spaces/lorenmt/prismer [173] 语言并非唯一需求:将感知与语言模型对齐: https://arxiv.org/pdf/2302.14045.pdf [174] Github: https://github.com/microsoft/unilm [175] BLIP-2: 使用冻结图像编码器和大型语言模型引导语言-图像预训练: https://arxiv.org/pdf/2301.12597.pdf [176] Github: https://github.com/salesforce/LAVIS/tree/main/projects/blip2 [177] 演示: https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb [178] VIMA: 基于多模态提示的通用机器人操作: https://arxiv.org/pdf/2210.03094.pdf [179] Github: https://github.com/vimalabs/VIMA [180] 大型预训练模型能帮助视觉模型处理感知任务吗?: https://arxiv.org/pdf/2306.00693.pdf [181] 即将推出: [182] 多模态大型语言模型在上下文目标检测中的应用: https://arxiv.org/pdf/2305.18279.pdf [183] Github: https://github.com/yuhangzang/ContextDET [184] 演示: https://huggingface.co/spaces/yuhangzang/ContextDet-Demo [185] 利用多模态语言模型生成图像: https://arxiv.org/pdf/2305.17216.pdf [186] Github: https://github.com/kohjingyu/gill [187] 评估大型视觉-语言模型的对抗鲁棒性: https://arxiv.org/pdf/2305.16934.pdf [188] Github: https://github.com/yunqing-me/AttackVLM [189] 在大型视觉-语言模型中评估对象虚构: https://arxiv.org/pdf/2305.10355.pdf [190] Github: https://github.com/RUCAIBox/POPE [191] 将语言模型与图像进行模态间输入输出关联: https://arxiv.org/pdf/2301.13823.pdf [192] Github: https://github.com/kohjingyu/fromage [193] 演示: https://huggingface.co/spaces/jykoh/fromage [194] Microsoft COCO: Common Objects in Context: https://arxiv.org/pdf/1405.0312.pdf [195] Im2Text: Describing Images Using 1 Million Captioned Photographs: https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf [196] Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning: https://aclanthology.org/P18-1238.pdf [197] LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs: https://arxiv.org/pdf/2111.02114.pdf [198] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations: https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf [199] Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models: https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf [200] AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding: https://arxiv.org/pdf/1711.06475.pdf [201] Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark: https://proceedings.neurips.cc/paper_files/paper/2022/file/a90b9a09a6ee43d6631cf42e225d73b4-Paper-Datasets_and_Benchmarks.pdf [202] Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks: https://arxiv.org/pdf/2306.04362.pdf [203] MSR-VTT: A Large Video Description Dataset for Bridging Video and Language: https://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR-VTT_A_Large_CVPR_2016_paper.pdf [204] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval: https://arxiv.org/pdf/2104.00650.pdf [205] WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research: https://arxiv.org/pdf/2303.17395.pdf [206] AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline: https://arxiv.org/pdf/1709.05522.pdf [207] AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale: https://arxiv.org/pdf/1808.10583.pdf [208] X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf [209] Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration: https://arxiv.org/pdf/2306.09093.pdf [210] 链接: https://github.com/lyuchenyang/Macaw-LLM/tree/main/data [211] LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf [212] 链接: https://github.com/OpenLAMM/LAMM#lamm-dataset [213] Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf [214] 链接: https://github.com/mbzuai-oryx/Video-ChatGPT#video-instruction-dataset-open_file_folder [215] MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf [216] 即将推出: https://github.com/Luodian/Otter [217] M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning: https://arxiv.org/pdf/2306.04387.pdf [218] 链接: https://huggingface.co/datasets/MMInstruction/M3IT [219] LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day: https://arxiv.org/pdf/2306.00890.pdf [220] 即将推出: https://github.com/microsoft/LLaVA-Med#llava-med-dataset [221] GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction: https://arxiv.org/pdf/2305.18752.pdf [222] 链接: https://github.com/StevenGrove/GPT4Tools#dataset [223] ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst: https://arxiv.org/pdf/2305.16103.pdf [224] 即将推出: https://iva-chatbridge.github.io/ [225] DetGPT: Detect What You Need via Reasoning: https://arxiv.org/pdf/2305.14167.pdf [226] 链接: https://github.com/OptimalScale/DetGPT/tree/main/dataset [227] PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering: https://arxiv.org/pdf/2305.10415.pdf [228] 即将推出: https://xiaoman-zhang.github.io/PMC-VQA/ [229] VideoChat: Chat-Centric Video Understanding: https://arxiv.org/pdf/2305.06355.pdf [230] 链接: https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data [231] X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages: https://arxiv.org/pdf/2305.04160.pdf [232] 链接: https://github.com/phellonchen/X-LLM [233] LMEye: An Interactive Perception Network for Large Language Models: https://arxiv.org/pdf/2305.03701.pdf [234] 链接: https://huggingface.co/datasets/YunxinLi/Multimodal_Insturction_Data_V2 [235] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models: https://arxiv.org/pdf/2304.10592.pdf [236] 链接: https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align [237] Visual Instruction Tuning: https://arxiv.org/pdf/2304.08485.pdf [238] 链接: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K [239] MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning: https://arxiv.org/pdf/2212.10773.pdf [240] 链接: https://github.com/VT-NLP/MultiInstruct [241] MIMIC-IT: Multi-Modal In-Context Instruction Tuning: https://arxiv.org/pdf/2306.05425.pdf [242] 即将推出: https://github.com/Luodian/Otter [243] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought: https://arxiv.org/pdf/2305.15021.pdf [244] 即将推出: https://github.com/EmbodiedGPT/EmbodiedGPT_Pytorch [245] Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction: https://arxiv.org/pdf/2305.13903.pdf [246] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering: https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf [247] 链接: https://github.com/lupantech/ScienceQA#ghost-download-the-dataset [248] IMAD: IMage-Augmented multi-modal Dialogue: https://arxiv.org/pdf/2305.10512.pdf [249] 链接: https://github.com/VityaVitalich/IMAD [250] LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark: https://arxiv.org/pdf/2306.06687.pdf [251] 链接: https://github.com/OpenLAMM/LAMM#lamm-benchmark [252] mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality: https://arxiv.org/pdf/2304.14178.pdf [253] 链接: https://github.com/X-PLUG/mPLUG-Owl/tree/main/OwlEval [254] Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models: https://arxiv.org/pdf/2306.05424.pdf [255] 链接: https://github.com/mbzuai-oryx/Video-ChatGPT#quantitative-evaluation-bar_chart [256] LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models: https://arxiv.org/pdf/2306.09265.pdf [257] 链接: https://github.com/OpenGVLab/Multi-Modality-Arena [258] Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf [259] 链接: https://drive.google.com/drive/folders/1TqBzkyqxOSg1hgCXF8JjpYIAuRV-uVft [260] Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation: https://arxiv.org/pdf/2303.05983.pdf [261] 链接: https://drive.google.com/drive/folders/1Saaia2rRRb1nz5sKdmpzYdS4jHiMDaP0 [262] GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: Latest Papers and Datasets on Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2023-06-20,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 山行AI 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 前言
  • 超棒的-多模态-大型语言模型资源库
  • 优秀论文
    • 多模态指导调优
      • 中文版论文
        • 多模态上下文学习
          • 多模态思维链
            • LLM辅助的视觉推理
              • 基础模型
                • 其他
                • 精彩数据集
                  • 用于对齐的预训练数据集
                    • 多模态指令调整数据集
                      • 在上下文学习中的数据集
                        • 在多模态思维链中的数据集
                          • 其他数据集
                            • References
                        • 声明
                        领券
                        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档