HuggingGPT

HuggingGPT Paper 解读

https://arxiv.org/pdf/2303.17580.pdf

概括下

HuggingGPT 是一个框架,主要用来解决复杂任务. 主要包含 4 个阶段

计划阶段: 一个是大模型, 大模型会对用户描述的任务, 先进行计划分解. 思考出每个步骤, 步骤之间的依赖关系, 每个步骤的输入和输出产物, 步骤的顺序
模型选择: Huggingface 这种模型社区. Huggingface 会拿到具体的任务根据依赖关系确定好顺序, 根据当前要执行的任务描述. 和对应社区现有的专家模型的描述做模型的选择. 将任务交给他去执行
任务执行: 各个专家模型会执行任务,把结果返回给大模型
产生结果: 大模型会根据每一步的产出和最终的结果来综合得到答案,反馈给用户

根据目前我的理解, HuggingGPT 可以看做是一个 langchain 框架的复杂版,大规模版本
思路很像但又有不同.
任务步骤的拆分 -> 分步执行 -> 综合结果大方向是一致的

具体的区别上

首先任务的依赖和优先级相关 HuggingGPT 是内部去分辨处理的, langchain 里其实还是需要在编码中去考虑处理

一个 langchain 的 agent 只能明确执行某一类任务,Tool 是固定的. HuggingGPT 会自己(huggingface)动脑子去组装自己需要的工具.

模型间的通信应该是内部的协议通信, langchain 因为是自己编码, 所以 tool 里的实现完全自定义.

其他:

HuggingGPT 这个思路很符合大部分人正常的思考方式, 也是在很多领域里实践的. 1 个主脑, 下面各个专业的 leader. 各自负责, 最终汇总结果
也非常符合人类思考问题的方式. HuggingGPT 内部对计划阶段的 Prompt 其实并不复杂, 但他能解析出那么多信息, 我觉得很魔法. 依赖,顺序, 产出这些其实直接用自然语言都不是那么容易理解的逻辑,需要让大模型正确理解.

HuggingGPT 其实也代表的开源的一股力量, 越来越多的开源模型加入到 Huggingface, 这样可用的工具模型数量和质量上都有大的发展. 对结果的输出质量也会提高.
但这对 HuggingGPT 在计划阶段和模型选择上的精准度提出了更高的要求, 还有就是需要模型的描述够准确

这部分是对要求的指导:
#1 Task Planning Stage - The AI assistant can parse user input to several tasks: [{"task": task, "id",
task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video":
URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the
current task relies on. A special tag "<resource>-task_id" refers to the generated text image,
audio and video in the dependency task with id as task_id. The task MUST be selected from the
following options: {{ Available Task List }}. There is a logical relationship between tasks, please
note their order. If the user input can’t be parsed, you need to reply empty JSON. Here are several
cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat Logs }}.
From the chat logs, you can find the path of the user-mentioned resources for your task planning

这部分是例子,帮助他理解:
Look at /exp1.jpg, Can you
tell me how many objects in
the picture?
[{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image":
"/exp1.jpg" }}, {"task": "object-detection", "id": 0, "dep": [-1],
"args": {"image": "/exp1.jpg" }}]
In /exp2.jpg, what’s the animal and what’s it doing?
[{"task": "image-to-text", "id": 0, "dep":[-1], "args": {"image":
"/exp2.jpg" }}, {"task":"image-classification", "id": 1, "dep": [-1],
"args": {"image": "/exp2.jpg" }}, {"task":"object-detection", "id":
2, "dep": [-1], "args": {"image": "/exp2.jpg" }}, {"task": "visualquestion-answering", "id": 3, "dep":[-1], "args": {"text": "What’s the
animal doing?", "image": "/exp2.jpg" }}]
Given an image /exp3.jpg,
first generate a hed image,
then based on the hed image and a prompt: a girl is
reading a book, you need to
reply with a new image.
[{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "/examples/boy.jpg" }}, {"task": "openpose-control", "id": 1, "dep": [-1],
"args": {"image": "/examples/boy.jpg" }}, {"task": "openpose-textto-image", "id": 2, "dep": [1], "args": {"text": "a girl is reading a
book", "image": "<resource>-1" }}]