Commit 15657c44cbb7148ee22a350ce9b137953d3c4924

Authored by 鲲鹏
1 parent dd49cf9a

知识库蒸馏方案调整

@@ -42,6 +42,8 @@ @@ -42,6 +42,8 @@
42 - 扫描后台代码仓库,生成接口契约、枚举状态、实现约束三类实现补充知识 42 - 扫描后台代码仓库,生成接口契约、枚举状态、实现约束三类实现补充知识
43 - `scripts/build_usable_knowledge_pack.py` 43 - `scripts/build_usable_knowledge_pack.py`
44 - 生成一套面向日常问答与预评审的可用知识库包 `dist/usable_kb/` 44 - 生成一套面向日常问答与预评审的可用知识库包 `dist/usable_kb/`
  45 + - 当前输出为完整主题展开版:不再限制每模块主题数,也不再只抽样少量主事实/补充事实
  46 + - 会对 `feature_scope`、模块标签和标题做归一化,尽量减少版本前缀、容器前缀和脏标题
45 - `scripts/build_dify_import_pack.py` 47 - `scripts/build_dify_import_pack.py`
46 -`dist/usable_kb/` 整理成更适合 Dify / 通用 RAG 平台导入的中颗粒度包 `dist/dify_import/` 48 -`dist/usable_kb/` 整理成更适合 Dify / 通用 RAG 平台导入的中颗粒度包 `dist/dify_import/`
47 - `scripts/rebuild_version_kb.sh` 49 - `scripts/rebuild_version_kb.sh`
No preview for this file type
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
@@ -20,16 +20,16 @@ @@ -20,16 +20,16 @@
20 - `16_BACKSTAGE_后台.md` 20 - `16_BACKSTAGE_后台.md`
21 - `17_GENERAL_通用.md` 21 - `17_GENERAL_通用.md`
22 22
23 -- 产品主题数:2330 23 +- 产品主题数:2235
24 - 后台实现原子数:4048 24 - 后台实现原子数:4048
25 25
26 ## 模块覆盖 26 ## 模块覆盖
27 27
28 -- AUTH / 认证:660 个主题  
29 -- INCOME / 收入提现:537 个主题  
30 -- INQUIRY / 问诊:777 个主题  
31 -- CLINIC / 门诊:573 个主题  
32 -- PATIENT / 患者:973 个主题  
33 -- NOTIFICATION / 通知:358 个主题  
34 -- BACKSTAGE / 后台:297 个主题  
35 -- GENERAL / 通用:357 个主题 28 +- AUTH / 认证:668 个主题
  29 +- INCOME / 收入提现:558 个主题
  30 +- INQUIRY / 问诊:768 个主题
  31 +- CLINIC / 门诊:565 个主题
  32 +- PATIENT / 患者:957 个主题
  33 +- NOTIFICATION / 通知:347 个主题
  34 +- BACKSTAGE / 后台:316 个主题
  35 +- GENERAL / 通用:354 个主题
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
@@ -188,6 +188,11 @@ bash scripts/rebuild_all_kb.sh @@ -188,6 +188,11 @@ bash scripts/rebuild_all_kb.sh
188 188
189 ## 7. 每次更新后的最小验收 189 ## 7. 每次更新后的最小验收
190 190
  191 +先做结构验收:
  192 +- 检查 `dist/usable_kb/` 与 `dist/dify_import/` 的模块主文件是否已重建
  193 +- 检查模块主文件是否为完整主题展开版,而不是旧的少量摘要主题
  194 +- 如果本次改了主题归一化或标题提炼规则,抽查 `AUTH`、`INCOME` 两个模块标题是否变得更稳定、更可检索
  195 +
191 ### 产品主知识库 196 ### 产品主知识库
192 197
193 至少测: 198 至少测:
@@ -229,6 +234,17 @@ bash scripts/rebuild_all_kb.sh @@ -229,6 +234,17 @@ bash scripts/rebuild_all_kb.sh
229 - 替换后台实现补充知识库 4 个文件 234 - 替换后台实现补充知识库 4 个文件
230 - 如实现约束变化明显,同步更新飞书主文档中的实现说明 235 - 如实现约束变化明显,同步更新飞书主文档中的实现说明
231 236
  237 +### 修改生成脚本 / 导出规则 / 标题归一化逻辑
  238 +
  239 +- 重跑受影响的构建脚本,至少包含 `python3 scripts/build_usable_knowledge_pack.py`
  240 +-`dist/dify_import/` 受影响,再跑 `python3 scripts/build_dify_import_pack.py`
  241 +- 同步更新:
  242 + - `docs/产品研发RAG_总体方案与实施手册.md`
  243 + - `docs/产品研发RAG_增量更新与Dify维护手册.md`
  244 + - `docs/产品研发RAG_接手说明.md`
  245 + - `skills/product-rag-maintainer/SKILL.md`
  246 +- 这一步默认必做,不需要额外提醒
  247 +
232 ## 9. 原则 248 ## 9. 原则
233 249
234 - 底层知识资产是一套 250 - 底层知识资产是一套
@@ -571,6 +571,9 @@ flowchart TD @@ -571,6 +571,9 @@ flowchart TD
571 作用: 571 作用:
572 572
573 - 把主事实、补充事实、后台实现信息整理成一套更适合直接使用的知识库包 573 - 把主事实、补充事实、后台实现信息整理成一套更适合直接使用的知识库包
  574 +- 当前默认输出“完整主知识库版”,不再把模块文件裁成少量主题摘要
  575 +- 每个主题会完整展开产品主事实与交互/测试补充事实
  576 +- 导出前会对 `feature_scope`、模块标签和主题标题做归一化,尽量减少版本前缀、端侧容器前缀与脏标题
574 577
575 输出: 578 输出:
576 579
@@ -584,6 +587,7 @@ flowchart TD @@ -584,6 +587,7 @@ flowchart TD
584 - 保留公共文件和模块主文件 587 - 保留公共文件和模块主文件
585 - 自动吸收 `inputs/priority_refs/*.md` 这类高优先参考文件 588 - 自动吸收 `inputs/priority_refs/*.md` 这类高优先参考文件
586 - 交给 Dify 在导入时继续做内部切分 589 - 交给 Dify 在导入时继续做内部切分
  590 +- 当前不会再次把模块主文件压缩成摘要版,而是直接复制完整展开后的主知识库文件
587 591
588 输出: 592 输出:
589 593
@@ -608,6 +612,12 @@ python3 scripts/build_usable_knowledge_pack.py @@ -608,6 +612,12 @@ python3 scripts/build_usable_knowledge_pack.py
608 python3 scripts/build_dify_import_pack.py 612 python3 scripts/build_dify_import_pack.py
609 ``` 613 ```
610 614
  615 +## 8.1.1 维护约束
  616 +
  617 +- 只要修改了知识库生成脚本、导出结构、主题归一化规则或 Dify 导入规则,必须同步更新 `docs/` 下对应说明文档
  618 +- 同时必须同步更新 repo 内维护 skill:`skills/product-rag-maintainer/SKILL.md`
  619 +- 不要把脚本行为改了但文档和 skill 还停留在旧流程
  620 +
611 如需同时接入后台代码仓库,再执行: 621 如需同时接入后台代码仓库,再执行:
612 622
613 ```bash 623 ```bash
@@ -83,6 +83,17 @@ bash scripts/rebuild_version_kb.sh <version> /Users/xwk/Downloads/studio-server2 @@ -83,6 +83,17 @@ bash scripts/rebuild_version_kb.sh <version> /Users/xwk/Downloads/studio-server2
83 python3 scripts/build_dify_import_pack.py 83 python3 scripts/build_dify_import_pack.py
84 ``` 84 ```
85 85
  86 +### 修改知识库生成逻辑
  87 +
  88 +- 如果动了 `scripts/build_usable_knowledge_pack.py`、`scripts/build_dify_import_pack.py` 或其他会改变导出结构的脚本:
  89 +
  90 +```bash
  91 +python3 scripts/build_usable_knowledge_pack.py
  92 +python3 scripts/build_dify_import_pack.py
  93 +```
  94 +
  95 +- 然后同步更新主文档、维护手册和 repo 内 skill,不要只改脚本不改说明
  96 +
86 ### 全量重刷 97 ### 全量重刷
87 98
88 ```bash 99 ```bash
@@ -104,6 +115,10 @@ bash scripts/rebuild_all_kb.sh /Users/xwk/Downloads/studio-server2 @@ -104,6 +115,10 @@ bash scripts/rebuild_all_kb.sh /Users/xwk/Downloads/studio-server2
104 如果这次有新增专项规则,还同步: 115 如果这次有新增专项规则,还同步:
105 - 对应 `inputs/priority_refs/*.md` 116 - 对应 `inputs/priority_refs/*.md`
106 117
  118 +如果这次修改了知识库生成逻辑,还同步:
  119 +- `skills/product-rag-maintainer/SKILL.md`
  120 +- 相关 `docs/*.md` 中的运行手册与产物说明
  121 +
107 ## 6. 不要做的事 122 ## 6. 不要做的事
108 123
109 - 不要把所有内容硬塞回 Dify 的一个知识库 124 - 不要把所有内容硬塞回 Dify 的一个知识库
@@ -43,6 +43,99 @@ MODULE_NAMES = { @@ -43,6 +43,99 @@ MODULE_NAMES = {
43 "BACKSTAGE": "后台", 43 "BACKSTAGE": "后台",
44 "GENERAL": "通用", 44 "GENERAL": "通用",
45 } 45 }
  46 +GENERIC_RESULTS = {"满足预期", "搜索出结果", "成功", "失败", "显示成功", "显示失败", "显示正常", "表现正常", "逻辑同上", "无"}
  47 +MODULE_ALIASES = {
  48 + "AUTH": "AUTH",
  49 + "认证": "AUTH",
  50 + "身份认证": "AUTH",
  51 + "医生认证": "AUTH",
  52 + "医师资质": "AUTH",
  53 + "互联网医院备案": "AUTH",
  54 + "用户注册": "AUTH",
  55 + "用户登录": "AUTH",
  56 + "INCOME": "INCOME",
  57 + "收入": "INCOME",
  58 + "收入提现": "INCOME",
  59 + "签约": "INCOME",
  60 + "签约提现": "INCOME",
  61 + "税收": "INCOME",
  62 + "税务": "INCOME",
  63 + "收入税务": "INCOME",
  64 + "缴税": "INCOME",
  65 + "收税方式": "INCOME",
  66 + "税源地": "INCOME",
  67 + "结算": "INCOME",
  68 + "费用结算": "INCOME",
  69 + "绩效收入": "INCOME",
  70 + "工猫": "INCOME",
  71 + "安易发": "INCOME",
  72 + "提现": "INCOME",
  73 + "INQUIRY": "INQUIRY",
  74 + "问诊": "INQUIRY",
  75 + "图文问诊": "INQUIRY",
  76 + "电话问诊": "INQUIRY",
  77 + "视频问诊": "INQUIRY",
  78 + "问诊单": "INQUIRY",
  79 + "问诊定价": "INQUIRY",
  80 + "待接诊": "INQUIRY",
  81 + "聊天": "INQUIRY",
  82 + "消息会话": "INQUIRY",
  83 + "医患聊天": "INQUIRY",
  84 + "CLINIC": "CLINIC",
  85 + "门诊": "CLINIC",
  86 + "预约挂号": "CLINIC",
  87 + "PATIENT": "PATIENT",
  88 + "患者": "PATIENT",
  89 + "患者端": "PATIENT",
  90 + "患者管理": "PATIENT",
  91 + "患者档案": "PATIENT",
  92 + "患者分组": "PATIENT",
  93 + "患者互动": "PATIENT",
  94 + "患者通讯录": "PATIENT",
  95 + "患者搜索": "PATIENT",
  96 + "病历": "PATIENT",
  97 + "随访": "PATIENT",
  98 + "评价": "PATIENT",
  99 + "锦旗": "PATIENT",
  100 + "电子锦旗": "PATIENT",
  101 + "NOTIFICATION": "NOTIFICATION",
  102 + "通知": "NOTIFICATION",
  103 + "BACKSTAGE": "BACKSTAGE",
  104 + "后台": "BACKSTAGE",
  105 + "医生管理": "BACKSTAGE",
  106 + "二维码管理": "BACKSTAGE",
  107 + "工作室设置": "BACKSTAGE",
  108 + "工作室开通": "BACKSTAGE",
  109 + "GENERAL": "GENERAL",
  110 +}
  111 +GENERIC_FEATURE_SEGMENTS = {
  112 + "功能描述",
  113 + "需求背景",
  114 + "背景",
  115 + "说明",
  116 + "场景",
  117 + "兼容性",
  118 + "新版本",
  119 + "老版本",
  120 + "医师端",
  121 + "患者端",
  122 + "医生App",
  123 + "APP端",
  124 + "小程序端",
  125 + "PC端",
  126 +}
  127 +BAD_TITLE_KEYWORDS = {"目标", "背景", "说明", "场景", "功能描述", "需求背景", "兼容性"}
  128 +BAD_TITLE_STARTS = ("如果", "当", "该", "给", "通知", "有", "无", "进入", "直接", "还是", "已经", "支持", "显示", "不显示")
  129 +GENERIC_PREFIX_PATTERNS = (
  130 + "医师端",
  131 + "患者端",
  132 + "医生App",
  133 + "APP端",
  134 + "小程序端",
  135 + "PC端",
  136 + "猫头鹰端",
  137 + "猫头鹰后台",
  138 +)
46 139
47 140
48 def clean_text(text: str) -> str: 141 def clean_text(text: str) -> str:
@@ -78,6 +171,137 @@ def display_feature_scope(feature_scope: str) -> str: @@ -78,6 +171,137 @@ def display_feature_scope(feature_scope: str) -> str:
78 return clean_text(scope) or "未归类功能" 171 return clean_text(scope) or "未归类功能"
79 172
80 173
  174 +def normalize_module(value: str) -> str | None:
  175 + text = clean_text(value)
  176 + if not text:
  177 + return None
  178 + upper = text.upper()
  179 + if upper in MODULE_ORDER:
  180 + return upper
  181 + return MODULE_ALIASES.get(text)
  182 +
  183 +
  184 +def normalize_feature_segments(feature_scope: str) -> list[str]:
  185 + text = clean_text(feature_scope)
  186 + text = re.sub(r"\s*-\s*>\s*", " > ", text)
  187 + text = re.sub(r"\s*>\s*", " > ", text)
  188 + text = re.sub(r"^v?\d+(?:\.\d+)+(?:\s*>\s*)?", "", text, flags=re.I)
  189 + parts = [display_feature_scope(part) for part in re.split(r"\s*>\s*", text) if display_feature_scope(part)]
  190 + cleaned = []
  191 + for part in parts:
  192 + part = re.sub(r"^[❤♥•◦■]+", "", part).strip()
  193 + for prefix in GENERIC_PREFIX_PATTERNS:
  194 + part = re.sub(rf"^{re.escape(prefix)}\s*[--/]\s*", "", part)
  195 + if re.fullmatch(r"v?\d+(?:\.\d+)+", part, flags=re.I):
  196 + continue
  197 + part = re.sub(r"^(?:功能描述|需求背景|背景|说明|场景)[::]\s*", "", part)
  198 + part = clean_text(part)
  199 + if not part:
  200 + continue
  201 + cleaned.append(part)
  202 + return cleaned
  203 +
  204 +
  205 +def normalize_feature_key(feature_scope: str) -> str:
  206 + parts = normalize_feature_segments(feature_scope)
  207 + if not parts:
  208 + return "未归类功能"
  209 + if len(parts) == 1:
  210 + return parts[0]
  211 + tail = parts[-1]
  212 + prev = parts[-2]
  213 + if re.fullmatch(r"[\d.]+", tail):
  214 + return prev
  215 + if tail in GENERIC_FEATURE_SEGMENTS or len(tail) <= 2:
  216 + return f"{prev} > {tail}"
  217 + if len(prev) >= 18 and len(tail) <= 18:
  218 + return tail
  219 + if prev in GENERIC_FEATURE_SEGMENTS:
  220 + return tail
  221 + if len(tail) <= 12 or len(prev) <= 12:
  222 + return f"{prev} > {tail}"
  223 + return tail
  224 +
  225 +
  226 +def normalize_title_candidate(text: str) -> str:
  227 + text = normalize_feature_key(text)
  228 + text = re.sub(r"\s*-\s*>\s*", " > ", text)
  229 + for prefix in GENERIC_PREFIX_PATTERNS:
  230 + text = re.sub(rf"^{re.escape(prefix)}\s*[--/]\s*", "", text)
  231 + text = re.sub(r"^(?:目标|背景|说明|场景|功能描述|需求背景)[::]\s*", "", text)
  232 + text = re.sub(r"^[•◦■\-]+\s*", "", text)
  233 + text = clean_text(text)
  234 + return text
  235 +
  236 +
  237 +def rewrite_title(text: str) -> str:
  238 + text = normalize_title_candidate(text)
  239 + if not text:
  240 + return text
  241 + text = re.sub(r"^操作(?:切换)?", "", text).strip()
  242 + text = re.sub(r"^点击(.+?) > (.+)$", r"\1 > \2", text)
  243 + text = re.sub(r"^点击(.+)$", r"\1", text)
  244 + text = re.sub(r"^去掉涉及到的(.+?)相关$", r"\1", text)
  245 + text = re.sub(r"^去掉[“\"]?(.+?)[”\"]?$", r"\1", text)
  246 + text = re.sub(r"^增加app的(.+)$", r"\1", text, flags=re.I)
  247 + text = re.sub(r"^外治还是走原来的流程$", "外治流程", text)
  248 + text = re.sub(r"^没有选择任何筛选条件$", "筛选条件为空", text)
  249 + text = re.sub(r"^第四周放号数据生成$", "第四周放号", text)
  250 + text = re.sub(r"^设置线下预约挂号时[::]\s*(.+)$", r"线下预约挂号设置", text)
  251 + text = re.sub(r"^“我的-优惠券”.*$", "我的优惠券展示", text)
  252 + text = re.sub(r"^(.+?)还是走原来的流程$", r"\1流程", text)
  253 + text = clean_text(text.strip(" >-"))
  254 + return text
  255 +
  256 +
  257 +def is_good_title(text: str) -> bool:
  258 + text = rewrite_title(text)
  259 + if not text or text == "未归类功能":
  260 + return False
  261 + if len(text) < 3 or len(text) > 40:
  262 + return False
  263 + if text.startswith(BAD_TITLE_STARTS):
  264 + return False
  265 + if any(text.startswith(f"{prefix}-") or text.startswith(f"{prefix} >") for prefix in GENERIC_PREFIX_PATTERNS):
  266 + return False
  267 + if text in GENERIC_FEATURE_SEGMENTS:
  268 + return False
  269 + if any(keyword in text for keyword in BAD_TITLE_KEYWORDS):
  270 + return False
  271 + return True
  272 +
  273 +
  274 +def extract_title_fragments(text: str) -> list[str]:
  275 + raw = clean_text(text)
  276 + if not raw:
  277 + return []
  278 + raw = re.sub(r"\s*-\s*>\s*", " > ", raw)
  279 + candidates = [raw]
  280 + if ">" in raw:
  281 + candidates.extend(part.strip() for part in raw.split(">") if part.strip())
  282 + candidates.extend(re.split(r"[;;]", raw))
  283 + enriched = []
  284 + for item in candidates:
  285 + item = clean_text(item)
  286 + if not item:
  287 + continue
  288 + item = re.sub(r"^(?:\d+[.、)]\s*)+", "", item)
  289 + item = re.sub(r"^(?:操作|点击|选择|设置|显示|进入|打开|查看|发送|支持|增加|新增)[::]?\s*", "", item)
  290 + item = re.split(r"[,,。]", item, maxsplit=1)[0]
  291 + item = re.split(r"\s{2,}", item, maxsplit=1)[0]
  292 + item = rewrite_title(item)
  293 + if item and not item.startswith(BAD_TITLE_STARTS):
  294 + enriched.append(item)
  295 + result = []
  296 + seen = set()
  297 + for item in enriched:
  298 + if item in seen:
  299 + continue
  300 + seen.add(item)
  301 + result.append(item)
  302 + return result
  303 +
  304 +
81 def normalize_rule(text: str) -> str: 305 def normalize_rule(text: str) -> str:
82 text = clean_text(text) 306 text = clean_text(text)
83 text = re.sub(r"^[a-zA-ZivxIVX]+[.、)]\s*", "", text) 307 text = re.sub(r"^[a-zA-ZivxIVX]+[.、)]\s*", "", text)
@@ -89,21 +313,40 @@ def normalize_rule(text: str) -> str: @@ -89,21 +313,40 @@ def normalize_rule(text: str) -> str:
89 313
90 314
91 def choose_title(feature: str, atoms: list[dict]) -> str: 315 def choose_title(feature: str, atoms: list[dict]) -> str:
92 - candidates = [display_feature_scope(feature)] 316 + candidates: list[tuple[str, int]] = [
  317 + (rewrite_title(feature), 3),
  318 + (normalize_feature_key(feature), 2),
  319 + (display_feature_scope(feature), 1),
  320 + ]
93 for atom in atoms: 321 for atom in atoms:
94 - for raw in (atom.get("A", ""), atom.get("C", "")):  
95 - value = display_feature_scope(raw)  
96 - if value and value != "未归类功能":  
97 - candidates.append(value)  
98 - filtered = [] 322 + for raw in (atom.get("feature_scope", ""),):
  323 + for value in extract_title_fragments(raw):
  324 + if value and value != "未归类功能":
  325 + candidates.append((value, 3))
  326 + for raw in (atom.get("C", ""), atom.get("A", ""), atom.get("R", "")):
  327 + for value in extract_title_fragments(raw):
  328 + if value and value != "未归类功能":
  329 + candidates.append((value, 1))
  330 + filtered: list[tuple[str, int]] = []
99 seen = set() 331 seen = set()
100 - for item in candidates: 332 + for item, source_rank in candidates:
101 if not item or item in seen: 333 if not item or item in seen:
102 continue 334 continue
103 seen.add(item) 335 seen.add(item)
104 - filtered.append(item)  
105 - filtered.sort(key=lambda x: (x == "未归类功能", len(x)))  
106 - return filtered[0] if filtered else "未归类功能" 336 + filtered.append((item, source_rank))
  337 + if not filtered:
  338 + return "未归类功能"
  339 +
  340 + def score(entry: tuple[str, int]) -> tuple[int, int, int, int, str]:
  341 + title, source_rank = entry
  342 + title = rewrite_title(title)
  343 + good = 1 if is_good_title(title) else 0
  344 + path_bonus = 1 if " > " in title and not any(title.startswith(f"{prefix} >") for prefix in GENERIC_PREFIX_PATTERNS) else 0
  345 + ideal_len = -abs(len(title) - 10)
  346 + return (good, source_rank, path_bonus, ideal_len, title)
  347 +
  348 + filtered.sort(key=score, reverse=True)
  349 + return filtered[0][0]
107 350
108 351
109 def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]: 352 def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]:
@@ -124,16 +367,69 @@ def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]: @@ -124,16 +367,69 @@ def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]:
124 return rules 367 return rules
125 368
126 369
  370 +def collect_rule_entries(atoms: list[dict]) -> list[dict]:
  371 + entries = []
  372 + seen = set()
  373 + for atom in sorted(
  374 + atoms,
  375 + key=lambda x: (
  376 + version_key(x.get("app_version", "")),
  377 + x.get("atom_type", ""),
  378 + x.get("merge_fingerprint", ""),
  379 + x.get("R", ""),
  380 + x.get("A", ""),
  381 + ),
  382 + ):
  383 + for raw in (atom.get("R", ""), atom.get("A", ""), atom.get("canon_text", "")):
  384 + text = normalize_rule(raw)
  385 + if not text or len(text) < 2:
  386 + continue
  387 + if text in GENERIC_RESULTS:
  388 + continue
  389 + key = (
  390 + atom.get("app_version", ""),
  391 + atom.get("atom_type", ""),
  392 + text,
  393 + )
  394 + if key in seen:
  395 + continue
  396 + seen.add(key)
  397 + entries.append(
  398 + {
  399 + "version": atom.get("app_version", "") or "未知版本",
  400 + "source": atom.get("atom_type", "") or "unknown",
  401 + "text": text,
  402 + }
  403 + )
  404 + break
  405 + return entries
  406 +
  407 +
127 def group_product_features(master_atoms: list[dict]) -> dict[str, dict]: 408 def group_product_features(master_atoms: list[dict]) -> dict[str, dict]:
128 grouped: dict[str, dict] = {} 409 grouped: dict[str, dict] = {}
129 by_feature: dict[str, list[dict]] = defaultdict(list) 410 by_feature: dict[str, list[dict]] = defaultdict(list)
130 for atom in master_atoms: 411 for atom in master_atoms:
131 if atom.get("atom_type") not in {"doc_rule", "definition", "rule", "case_rule"}: 412 if atom.get("atom_type") not in {"doc_rule", "definition", "rule", "case_rule"}:
132 continue 413 continue
133 - by_feature[atom.get("feature_scope", "未归类功能")].append(atom) 414 + normalized_feature = normalize_feature_key(atom.get("feature_scope", "未归类功能"))
  415 + by_feature[normalized_feature].append(atom)
134 416
135 for feature, atoms in by_feature.items(): 417 for feature, atoms in by_feature.items():
136 - modules = sorted({m for atom in atoms for m in atom.get("modules", []) if m}) 418 + modules = sorted(
  419 + {
  420 + normalized
  421 + for atom in atoms
  422 + for normalized in [normalize_module(atom.get("primary_module", ""))]
  423 + if normalized
  424 + }
  425 + | {
  426 + normalized
  427 + for atom in atoms
  428 + for module in atom.get("modules", [])
  429 + for normalized in [normalize_module(module)]
  430 + if normalized
  431 + }
  432 + )
137 primary = [a for a in atoms if a.get("atom_type") in {"doc_rule", "definition"}] 433 primary = [a for a in atoms if a.get("atom_type") in {"doc_rule", "definition"}]
138 supplement = [a for a in atoms if a.get("atom_type") in {"rule", "case_rule"}] 434 supplement = [a for a in atoms if a.get("atom_type") in {"rule", "case_rule"}]
139 versions = sorted({a.get("app_version", "") for a in atoms if a.get("app_version")}, key=version_key) 435 versions = sorted({a.get("app_version", "") for a in atoms if a.get("app_version")}, key=version_key)
@@ -243,13 +539,13 @@ def render_versions(product_features: dict[str, dict]) -> str: @@ -243,13 +539,13 @@ def render_versions(product_features: dict[str, dict]) -> str:
243 "", 539 "",
244 ] 540 ]
245 items = sorted(product_features.values(), key=lambda x: (-len(x["versions"]), x["title"].lower())) 541 items = sorted(product_features.values(), key=lambda x: (-len(x["versions"]), x["title"].lower()))
246 - for item in items[:220]: 542 + for item in items:
247 lines.append(f"## {item['title']}") 543 lines.append(f"## {item['title']}")
248 lines.append("") 544 lines.append("")
249 lines.append(f"- 模块:{', '.join(item['modules'])}") 545 lines.append(f"- 模块:{', '.join(item['modules'])}")
250 lines.append(f"- 版本:{', '.join(item['versions']) or '无'}") 546 lines.append(f"- 版本:{', '.join(item['versions']) or '无'}")
251 - lines.append(f"- 主事实样例:{';'.join(sample_product_rules(item['primary'], 2)) or '无'}")  
252 - lines.append(f"- 补充样例:{';'.join(sample_product_rules(item['supplement'], 2)) or '无'}") 547 + lines.append(f"- 主事实数:{len(collect_rule_entries(item['primary']))}")
  548 + lines.append(f"- 补充事实数:{len(collect_rule_entries(item['supplement']))}")
253 lines.append("") 549 lines.append("")
254 return "\n".join(lines) 550 return "\n".join(lines)
255 551
@@ -345,17 +641,33 @@ def render_module_file(module: str, items: list[dict], code_bucket: dict[str, li @@ -345,17 +641,33 @@ def render_module_file(module: str, items: list[dict], code_bucket: dict[str, li
345 lines.append(f"- 约束样例:{';'.join(constraint_samples)}") 641 lines.append(f"- 约束样例:{';'.join(constraint_samples)}")
346 lines.extend(["", "## 主题清单", ""]) 642 lines.extend(["", "## 主题清单", ""])
347 643
348 - for item in sorted(items, key=feature_rank)[:90]: 644 + for item in sorted(items, key=feature_rank):
349 lines.append(f"### {item['title']}") 645 lines.append(f"### {item['title']}")
350 lines.append("") 646 lines.append("")
351 if item["touchpoints"]: 647 if item["touchpoints"]:
352 lines.append(f"- 触点:{', '.join(item['touchpoints'])}") 648 lines.append(f"- 触点:{', '.join(item['touchpoints'])}")
353 if item["versions"]: 649 if item["versions"]:
354 lines.append(f"- 涉及版本:{', '.join(item['versions'])}") 650 lines.append(f"- 涉及版本:{', '.join(item['versions'])}")
355 - primary_rules = sample_product_rules(item["primary"], 3)  
356 - supplement_rules = sample_product_rules(item["supplement"], 3)  
357 - lines.append(f"- 产品主事实:{';'.join(primary_rules) or '无'}")  
358 - lines.append(f"- 交互/测试补充:{';'.join(supplement_rules) or '无'}") 651 + primary_entries = collect_rule_entries(item["primary"])
  652 + supplement_entries = collect_rule_entries(item["supplement"])
  653 + lines.append(f"- 主事实条数:{len(primary_entries)}")
  654 + lines.append(f"- 补充事实条数:{len(supplement_entries)}")
  655 + lines.append("")
  656 + lines.append("#### 产品主事实")
  657 + lines.append("")
  658 + if primary_entries:
  659 + for entry in primary_entries:
  660 + lines.append(f"- [{entry['version']}] {entry['text']}")
  661 + else:
  662 + lines.append("- 无")
  663 + lines.append("")
  664 + lines.append("#### 交互/测试补充")
  665 + lines.append("")
  666 + if supplement_entries:
  667 + for entry in supplement_entries:
  668 + lines.append(f"- [{entry['version']}] {entry['text']}")
  669 + else:
  670 + lines.append("- 无")
359 lines.append("") 671 lines.append("")
360 return "\n".join(lines) 672 return "\n".join(lines)
361 673
@@ -17,10 +17,22 @@ Use this skill when the task is to continue maintaining the repository at `äº§å“ @@ -17,10 +17,22 @@ Use this skill when the task is to continue maintaining the repository at `产å“
17 - new high-priority reference 17 - new high-priority reference
18 - backend repo update 18 - backend repo update
19 - full rebuild 19 - full rebuild
  20 + - generator or export-rule update
20 5. Run the matching script: 21 5. Run the matching script:
21 - version rebuild: `bash scripts/rebuild_version_kb.sh <version> [backend_repo]` 22 - version rebuild: `bash scripts/rebuild_version_kb.sh <version> [backend_repo]`
22 - full rebuild: `bash scripts/rebuild_all_kb.sh [backend_repo]` 23 - full rebuild: `bash scripts/rebuild_all_kb.sh [backend_repo]`
23 - Dify import pack only: `python3 scripts/build_dify_import_pack.py` 24 - Dify import pack only: `python3 scripts/build_dify_import_pack.py`
  25 + - if any generator / export / title-normalization logic changed, rebuild at least:
  26 + - `python3 scripts/build_usable_knowledge_pack.py`
  27 + - `python3 scripts/build_dify_import_pack.py`
  28 +
  29 +## Documentation Sync
  30 +
  31 +After any change to scripts, output structure, title-normalization logic, or maintenance behavior:
  32 +
  33 +1. Update the matching docs under `docs/`.
  34 +2. Update this skill file if the workflow or rules changed.
  35 +3. Treat doc and skill sync as mandatory follow-up work, not an optional reminder.
24 36
25 ## File placement rules 37 ## File placement rules
26 38
@@ -61,8 +73,9 @@ After any update: @@ -61,8 +73,9 @@ After any update:
61 73
62 1. Check `dist/dify_import/`, `dist/backend_code/`, `dist/final_kb/`. 74 1. Check `dist/dify_import/`, `dist/backend_code/`, `dist/final_kb/`.
63 2. Check `dist/quality/atom_quality_summary.md`. 75 2. Check `dist/quality/atom_quality_summary.md`.
64 -3. Run Dify retrieval tests using the examples in `references/validation-queries.md`.  
65 -4. After version updates, remind the user to sync the Feishu docs entry pages and version overview. 76 +3. If `build_usable_knowledge_pack.py` changed, verify the module files are still complete expanded knowledge files rather than truncated summaries.
  77 +4. Run Dify retrieval tests using the examples in `references/validation-queries.md`.
  78 +5. After version updates, remind the user to sync the Feishu docs entry pages and version overview.
66 79
67 ## Notes 80 ## Notes
68 81