Commit 15657c44cbb7148ee22a350ce9b137953d3c4924

Authored by 鲲鹏
1 parent dd49cf9a

知识库蒸馏方案调整

... ... @@ -42,6 +42,8 @@
- 扫描后台代码仓库,生成接口契约、枚举状态、实现约束三类实现补充知识
- `scripts/build_usable_knowledge_pack.py`
- 生成一套面向日常问答与预评审的可用知识库包 `dist/usable_kb/`
- 当前输出为完整主题展开版:不再限制每模块主题数,也不再只抽样少量主事实/补充事实
- 会对 `feature_scope`、模块标签和标题做归一化,尽量减少版本前缀、容器前缀和脏标题
- `scripts/build_dify_import_pack.py`
-`dist/usable_kb/` 整理成更适合 Dify / 通用 RAG 平台导入的中颗粒度包 `dist/dify_import/`
- `scripts/rebuild_version_kb.sh`
... ...
No preview for this file type
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
... ... @@ -20,16 +20,16 @@
- `16_BACKSTAGE_后台.md`
- `17_GENERAL_通用.md`
- 产品主题数:2330
- 产品主题数:2235
- 后台实现原子数:4048
## 模块覆盖
- AUTH / 认证:660 个主题
- INCOME / 收入提现:537 个主题
- INQUIRY / 问诊:777 个主题
- CLINIC / 门诊:573 个主题
- PATIENT / 患者:973 个主题
- NOTIFICATION / 通知:358 个主题
- BACKSTAGE / 后台:297 个主题
- GENERAL / 通用:357 个主题
- AUTH / 认证:668 个主题
- INCOME / 收入提现:558 个主题
- INQUIRY / 问诊:768 个主题
- CLINIC / 门诊:565 个主题
- PATIENT / 患者:957 个主题
- NOTIFICATION / 通知:347 个主题
- BACKSTAGE / 后台:316 个主题
- GENERAL / 通用:354 个主题
... ...
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
... ... @@ -188,6 +188,11 @@ bash scripts/rebuild_all_kb.sh
## 7. 每次更新后的最小验收
先做结构验收:
- 检查 `dist/usable_kb/` 与 `dist/dify_import/` 的模块主文件是否已重建
- 检查模块主文件是否为完整主题展开版,而不是旧的少量摘要主题
- 如果本次改了主题归一化或标题提炼规则,抽查 `AUTH`、`INCOME` 两个模块标题是否变得更稳定、更可检索
### 产品主知识库
至少测:
... ... @@ -229,6 +234,17 @@ bash scripts/rebuild_all_kb.sh
- 替换后台实现补充知识库 4 个文件
- 如实现约束变化明显,同步更新飞书主文档中的实现说明
### 修改生成脚本 / 导出规则 / 标题归一化逻辑
- 重跑受影响的构建脚本,至少包含 `python3 scripts/build_usable_knowledge_pack.py`
-`dist/dify_import/` 受影响,再跑 `python3 scripts/build_dify_import_pack.py`
- 同步更新:
- `docs/产品研发RAG_总体方案与实施手册.md`
- `docs/产品研发RAG_增量更新与Dify维护手册.md`
- `docs/产品研发RAG_接手说明.md`
- `skills/product-rag-maintainer/SKILL.md`
- 这一步默认必做,不需要额外提醒
## 9. 原则
- 底层知识资产是一套
... ...
... ... @@ -571,6 +571,9 @@ flowchart TD
作用:
- 把主事实、补充事实、后台实现信息整理成一套更适合直接使用的知识库包
- 当前默认输出“完整主知识库版”,不再把模块文件裁成少量主题摘要
- 每个主题会完整展开产品主事实与交互/测试补充事实
- 导出前会对 `feature_scope`、模块标签和主题标题做归一化,尽量减少版本前缀、端侧容器前缀与脏标题
输出:
... ... @@ -584,6 +587,7 @@ flowchart TD
- 保留公共文件和模块主文件
- 自动吸收 `inputs/priority_refs/*.md` 这类高优先参考文件
- 交给 Dify 在导入时继续做内部切分
- 当前不会再次把模块主文件压缩成摘要版,而是直接复制完整展开后的主知识库文件
输出:
... ... @@ -608,6 +612,12 @@ python3 scripts/build_usable_knowledge_pack.py
python3 scripts/build_dify_import_pack.py
```
## 8.1.1 维护约束
- 只要修改了知识库生成脚本、导出结构、主题归一化规则或 Dify 导入规则,必须同步更新 `docs/` 下对应说明文档
- 同时必须同步更新 repo 内维护 skill:`skills/product-rag-maintainer/SKILL.md`
- 不要把脚本行为改了但文档和 skill 还停留在旧流程
如需同时接入后台代码仓库,再执行:
```bash
... ...
... ... @@ -83,6 +83,17 @@ bash scripts/rebuild_version_kb.sh <version> /Users/xwk/Downloads/studio-server2
python3 scripts/build_dify_import_pack.py
```
### 修改知识库生成逻辑
- 如果动了 `scripts/build_usable_knowledge_pack.py`、`scripts/build_dify_import_pack.py` 或其他会改变导出结构的脚本:
```bash
python3 scripts/build_usable_knowledge_pack.py
python3 scripts/build_dify_import_pack.py
```
- 然后同步更新主文档、维护手册和 repo 内 skill,不要只改脚本不改说明
### 全量重刷
```bash
... ... @@ -104,6 +115,10 @@ bash scripts/rebuild_all_kb.sh /Users/xwk/Downloads/studio-server2
如果这次有新增专项规则,还同步:
- 对应 `inputs/priority_refs/*.md`
如果这次修改了知识库生成逻辑,还同步:
- `skills/product-rag-maintainer/SKILL.md`
- 相关 `docs/*.md` 中的运行手册与产物说明
## 6. 不要做的事
- 不要把所有内容硬塞回 Dify 的一个知识库
... ...
... ... @@ -43,6 +43,99 @@ MODULE_NAMES = {
"BACKSTAGE": "后台",
"GENERAL": "通用",
}
GENERIC_RESULTS = {"满足预期", "搜索出结果", "成功", "失败", "显示成功", "显示失败", "显示正常", "表现正常", "逻辑同上", "无"}
MODULE_ALIASES = {
"AUTH": "AUTH",
"认证": "AUTH",
"身份认证": "AUTH",
"医生认证": "AUTH",
"医师资质": "AUTH",
"互联网医院备案": "AUTH",
"用户注册": "AUTH",
"用户登录": "AUTH",
"INCOME": "INCOME",
"收入": "INCOME",
"收入提现": "INCOME",
"签约": "INCOME",
"签约提现": "INCOME",
"税收": "INCOME",
"税务": "INCOME",
"收入税务": "INCOME",
"缴税": "INCOME",
"收税方式": "INCOME",
"税源地": "INCOME",
"结算": "INCOME",
"费用结算": "INCOME",
"绩效收入": "INCOME",
"工猫": "INCOME",
"安易发": "INCOME",
"提现": "INCOME",
"INQUIRY": "INQUIRY",
"问诊": "INQUIRY",
"图文问诊": "INQUIRY",
"电话问诊": "INQUIRY",
"视频问诊": "INQUIRY",
"问诊单": "INQUIRY",
"问诊定价": "INQUIRY",
"待接诊": "INQUIRY",
"聊天": "INQUIRY",
"消息会话": "INQUIRY",
"医患聊天": "INQUIRY",
"CLINIC": "CLINIC",
"门诊": "CLINIC",
"预约挂号": "CLINIC",
"PATIENT": "PATIENT",
"患者": "PATIENT",
"患者端": "PATIENT",
"患者管理": "PATIENT",
"患者档案": "PATIENT",
"患者分组": "PATIENT",
"患者互动": "PATIENT",
"患者通讯录": "PATIENT",
"患者搜索": "PATIENT",
"病历": "PATIENT",
"随访": "PATIENT",
"评价": "PATIENT",
"锦旗": "PATIENT",
"电子锦旗": "PATIENT",
"NOTIFICATION": "NOTIFICATION",
"通知": "NOTIFICATION",
"BACKSTAGE": "BACKSTAGE",
"后台": "BACKSTAGE",
"医生管理": "BACKSTAGE",
"二维码管理": "BACKSTAGE",
"工作室设置": "BACKSTAGE",
"工作室开通": "BACKSTAGE",
"GENERAL": "GENERAL",
}
GENERIC_FEATURE_SEGMENTS = {
"功能描述",
"需求背景",
"背景",
"说明",
"场景",
"兼容性",
"新版本",
"老版本",
"医师端",
"患者端",
"医生App",
"APP端",
"小程序端",
"PC端",
}
BAD_TITLE_KEYWORDS = {"目标", "背景", "说明", "场景", "功能描述", "需求背景", "兼容性"}
BAD_TITLE_STARTS = ("如果", "当", "该", "给", "通知", "有", "无", "进入", "直接", "还是", "已经", "支持", "显示", "不显示")
GENERIC_PREFIX_PATTERNS = (
"医师端",
"患者端",
"医生App",
"APP端",
"小程序端",
"PC端",
"猫头鹰端",
"猫头鹰后台",
)
def clean_text(text: str) -> str:
... ... @@ -78,6 +171,137 @@ def display_feature_scope(feature_scope: str) -> str:
return clean_text(scope) or "未归类功能"
def normalize_module(value: str) -> str | None:
text = clean_text(value)
if not text:
return None
upper = text.upper()
if upper in MODULE_ORDER:
return upper
return MODULE_ALIASES.get(text)
def normalize_feature_segments(feature_scope: str) -> list[str]:
text = clean_text(feature_scope)
text = re.sub(r"\s*-\s*>\s*", " > ", text)
text = re.sub(r"\s*>\s*", " > ", text)
text = re.sub(r"^v?\d+(?:\.\d+)+(?:\s*>\s*)?", "", text, flags=re.I)
parts = [display_feature_scope(part) for part in re.split(r"\s*>\s*", text) if display_feature_scope(part)]
cleaned = []
for part in parts:
part = re.sub(r"^[❤♥•◦■]+", "", part).strip()
for prefix in GENERIC_PREFIX_PATTERNS:
part = re.sub(rf"^{re.escape(prefix)}\s*[--/]\s*", "", part)
if re.fullmatch(r"v?\d+(?:\.\d+)+", part, flags=re.I):
continue
part = re.sub(r"^(?:功能描述|需求背景|背景|说明|场景)[::]\s*", "", part)
part = clean_text(part)
if not part:
continue
cleaned.append(part)
return cleaned
def normalize_feature_key(feature_scope: str) -> str:
parts = normalize_feature_segments(feature_scope)
if not parts:
return "未归类功能"
if len(parts) == 1:
return parts[0]
tail = parts[-1]
prev = parts[-2]
if re.fullmatch(r"[\d.]+", tail):
return prev
if tail in GENERIC_FEATURE_SEGMENTS or len(tail) <= 2:
return f"{prev} > {tail}"
if len(prev) >= 18 and len(tail) <= 18:
return tail
if prev in GENERIC_FEATURE_SEGMENTS:
return tail
if len(tail) <= 12 or len(prev) <= 12:
return f"{prev} > {tail}"
return tail
def normalize_title_candidate(text: str) -> str:
text = normalize_feature_key(text)
text = re.sub(r"\s*-\s*>\s*", " > ", text)
for prefix in GENERIC_PREFIX_PATTERNS:
text = re.sub(rf"^{re.escape(prefix)}\s*[--/]\s*", "", text)
text = re.sub(r"^(?:目标|背景|说明|场景|功能描述|需求背景)[::]\s*", "", text)
text = re.sub(r"^[•◦■\-]+\s*", "", text)
text = clean_text(text)
return text
def rewrite_title(text: str) -> str:
text = normalize_title_candidate(text)
if not text:
return text
text = re.sub(r"^操作(?:切换)?", "", text).strip()
text = re.sub(r"^点击(.+?) > (.+)$", r"\1 > \2", text)
text = re.sub(r"^点击(.+)$", r"\1", text)
text = re.sub(r"^去掉涉及到的(.+?)相关$", r"\1", text)
text = re.sub(r"^去掉[“\"]?(.+?)[”\"]?$", r"\1", text)
text = re.sub(r"^增加app的(.+)$", r"\1", text, flags=re.I)
text = re.sub(r"^外治还是走原来的流程$", "外治流程", text)
text = re.sub(r"^没有选择任何筛选条件$", "筛选条件为空", text)
text = re.sub(r"^第四周放号数据生成$", "第四周放号", text)
text = re.sub(r"^设置线下预约挂号时[::]\s*(.+)$", r"线下预约挂号设置", text)
text = re.sub(r"^“我的-优惠券”.*$", "我的优惠券展示", text)
text = re.sub(r"^(.+?)还是走原来的流程$", r"\1流程", text)
text = clean_text(text.strip(" >-"))
return text
def is_good_title(text: str) -> bool:
text = rewrite_title(text)
if not text or text == "未归类功能":
return False
if len(text) < 3 or len(text) > 40:
return False
if text.startswith(BAD_TITLE_STARTS):
return False
if any(text.startswith(f"{prefix}-") or text.startswith(f"{prefix} >") for prefix in GENERIC_PREFIX_PATTERNS):
return False
if text in GENERIC_FEATURE_SEGMENTS:
return False
if any(keyword in text for keyword in BAD_TITLE_KEYWORDS):
return False
return True
def extract_title_fragments(text: str) -> list[str]:
raw = clean_text(text)
if not raw:
return []
raw = re.sub(r"\s*-\s*>\s*", " > ", raw)
candidates = [raw]
if ">" in raw:
candidates.extend(part.strip() for part in raw.split(">") if part.strip())
candidates.extend(re.split(r"[;;]", raw))
enriched = []
for item in candidates:
item = clean_text(item)
if not item:
continue
item = re.sub(r"^(?:\d+[.、)]\s*)+", "", item)
item = re.sub(r"^(?:操作|点击|选择|设置|显示|进入|打开|查看|发送|支持|增加|新增)[::]?\s*", "", item)
item = re.split(r"[,,。]", item, maxsplit=1)[0]
item = re.split(r"\s{2,}", item, maxsplit=1)[0]
item = rewrite_title(item)
if item and not item.startswith(BAD_TITLE_STARTS):
enriched.append(item)
result = []
seen = set()
for item in enriched:
if item in seen:
continue
seen.add(item)
result.append(item)
return result
def normalize_rule(text: str) -> str:
text = clean_text(text)
text = re.sub(r"^[a-zA-ZivxIVX]+[.、)]\s*", "", text)
... ... @@ -89,21 +313,40 @@ def normalize_rule(text: str) -> str:
def choose_title(feature: str, atoms: list[dict]) -> str:
candidates = [display_feature_scope(feature)]
candidates: list[tuple[str, int]] = [
(rewrite_title(feature), 3),
(normalize_feature_key(feature), 2),
(display_feature_scope(feature), 1),
]
for atom in atoms:
for raw in (atom.get("A", ""), atom.get("C", "")):
value = display_feature_scope(raw)
if value and value != "未归类功能":
candidates.append(value)
filtered = []
for raw in (atom.get("feature_scope", ""),):
for value in extract_title_fragments(raw):
if value and value != "未归类功能":
candidates.append((value, 3))
for raw in (atom.get("C", ""), atom.get("A", ""), atom.get("R", "")):
for value in extract_title_fragments(raw):
if value and value != "未归类功能":
candidates.append((value, 1))
filtered: list[tuple[str, int]] = []
seen = set()
for item in candidates:
for item, source_rank in candidates:
if not item or item in seen:
continue
seen.add(item)
filtered.append(item)
filtered.sort(key=lambda x: (x == "未归类功能", len(x)))
return filtered[0] if filtered else "未归类功能"
filtered.append((item, source_rank))
if not filtered:
return "未归类功能"
def score(entry: tuple[str, int]) -> tuple[int, int, int, int, str]:
title, source_rank = entry
title = rewrite_title(title)
good = 1 if is_good_title(title) else 0
path_bonus = 1 if " > " in title and not any(title.startswith(f"{prefix} >") for prefix in GENERIC_PREFIX_PATTERNS) else 0
ideal_len = -abs(len(title) - 10)
return (good, source_rank, path_bonus, ideal_len, title)
filtered.sort(key=score, reverse=True)
return filtered[0][0]
def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]:
... ... @@ -124,16 +367,69 @@ def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]:
return rules
def collect_rule_entries(atoms: list[dict]) -> list[dict]:
entries = []
seen = set()
for atom in sorted(
atoms,
key=lambda x: (
version_key(x.get("app_version", "")),
x.get("atom_type", ""),
x.get("merge_fingerprint", ""),
x.get("R", ""),
x.get("A", ""),
),
):
for raw in (atom.get("R", ""), atom.get("A", ""), atom.get("canon_text", "")):
text = normalize_rule(raw)
if not text or len(text) < 2:
continue
if text in GENERIC_RESULTS:
continue
key = (
atom.get("app_version", ""),
atom.get("atom_type", ""),
text,
)
if key in seen:
continue
seen.add(key)
entries.append(
{
"version": atom.get("app_version", "") or "未知版本",
"source": atom.get("atom_type", "") or "unknown",
"text": text,
}
)
break
return entries
def group_product_features(master_atoms: list[dict]) -> dict[str, dict]:
grouped: dict[str, dict] = {}
by_feature: dict[str, list[dict]] = defaultdict(list)
for atom in master_atoms:
if atom.get("atom_type") not in {"doc_rule", "definition", "rule", "case_rule"}:
continue
by_feature[atom.get("feature_scope", "未归类功能")].append(atom)
normalized_feature = normalize_feature_key(atom.get("feature_scope", "未归类功能"))
by_feature[normalized_feature].append(atom)
for feature, atoms in by_feature.items():
modules = sorted({m for atom in atoms for m in atom.get("modules", []) if m})
modules = sorted(
{
normalized
for atom in atoms
for normalized in [normalize_module(atom.get("primary_module", ""))]
if normalized
}
| {
normalized
for atom in atoms
for module in atom.get("modules", [])
for normalized in [normalize_module(module)]
if normalized
}
)
primary = [a for a in atoms if a.get("atom_type") in {"doc_rule", "definition"}]
supplement = [a for a in atoms if a.get("atom_type") in {"rule", "case_rule"}]
versions = sorted({a.get("app_version", "") for a in atoms if a.get("app_version")}, key=version_key)
... ... @@ -243,13 +539,13 @@ def render_versions(product_features: dict[str, dict]) -> str:
"",
]
items = sorted(product_features.values(), key=lambda x: (-len(x["versions"]), x["title"].lower()))
for item in items[:220]:
for item in items:
lines.append(f"## {item['title']}")
lines.append("")
lines.append(f"- 模块:{', '.join(item['modules'])}")
lines.append(f"- 版本:{', '.join(item['versions']) or '无'}")
lines.append(f"- 主事实样例:{';'.join(sample_product_rules(item['primary'], 2)) or '无'}")
lines.append(f"- 补充样例:{';'.join(sample_product_rules(item['supplement'], 2)) or '无'}")
lines.append(f"- 主事实数:{len(collect_rule_entries(item['primary']))}")
lines.append(f"- 补充事实数:{len(collect_rule_entries(item['supplement']))}")
lines.append("")
return "\n".join(lines)
... ... @@ -345,17 +641,33 @@ def render_module_file(module: str, items: list[dict], code_bucket: dict[str, li
lines.append(f"- 约束样例:{';'.join(constraint_samples)}")
lines.extend(["", "## 主题清单", ""])
for item in sorted(items, key=feature_rank)[:90]:
for item in sorted(items, key=feature_rank):
lines.append(f"### {item['title']}")
lines.append("")
if item["touchpoints"]:
lines.append(f"- 触点:{', '.join(item['touchpoints'])}")
if item["versions"]:
lines.append(f"- 涉及版本:{', '.join(item['versions'])}")
primary_rules = sample_product_rules(item["primary"], 3)
supplement_rules = sample_product_rules(item["supplement"], 3)
lines.append(f"- 产品主事实:{';'.join(primary_rules) or '无'}")
lines.append(f"- 交互/测试补充:{';'.join(supplement_rules) or '无'}")
primary_entries = collect_rule_entries(item["primary"])
supplement_entries = collect_rule_entries(item["supplement"])
lines.append(f"- 主事实条数:{len(primary_entries)}")
lines.append(f"- 补充事实条数:{len(supplement_entries)}")
lines.append("")
lines.append("#### 产品主事实")
lines.append("")
if primary_entries:
for entry in primary_entries:
lines.append(f"- [{entry['version']}] {entry['text']}")
else:
lines.append("- 无")
lines.append("")
lines.append("#### 交互/测试补充")
lines.append("")
if supplement_entries:
for entry in supplement_entries:
lines.append(f"- [{entry['version']}] {entry['text']}")
else:
lines.append("- 无")
lines.append("")
return "\n".join(lines)
... ...
... ... @@ -17,10 +17,22 @@ Use this skill when the task is to continue maintaining the repository at `产å“
- new high-priority reference
- backend repo update
- full rebuild
- generator or export-rule update
5. Run the matching script:
- version rebuild: `bash scripts/rebuild_version_kb.sh <version> [backend_repo]`
- full rebuild: `bash scripts/rebuild_all_kb.sh [backend_repo]`
- Dify import pack only: `python3 scripts/build_dify_import_pack.py`
- if any generator / export / title-normalization logic changed, rebuild at least:
- `python3 scripts/build_usable_knowledge_pack.py`
- `python3 scripts/build_dify_import_pack.py`
## Documentation Sync
After any change to scripts, output structure, title-normalization logic, or maintenance behavior:
1. Update the matching docs under `docs/`.
2. Update this skill file if the workflow or rules changed.
3. Treat doc and skill sync as mandatory follow-up work, not an optional reminder.
## File placement rules
... ... @@ -61,8 +73,9 @@ After any update:
1. Check `dist/dify_import/`, `dist/backend_code/`, `dist/final_kb/`.
2. Check `dist/quality/atom_quality_summary.md`.
3. Run Dify retrieval tests using the examples in `references/validation-queries.md`.
4. After version updates, remind the user to sync the Feishu docs entry pages and version overview.
3. If `build_usable_knowledge_pack.py` changed, verify the module files are still complete expanded knowledge files rather than truncated summaries.
4. Run Dify retrieval tests using the examples in `references/validation-queries.md`.
5. After version updates, remind the user to sync the Feishu docs entry pages and version overview.
## Notes
... ...