Showing
26 changed files
with
399 additions
and
31 deletions
| @@ -42,6 +42,8 @@ | @@ -42,6 +42,8 @@ | ||
| 42 | - 扫描后台代码仓库,生成接口契约、枚举状态、实现约束三类实现补充知识 | 42 | - 扫描后台代码仓库,生成接口契约、枚举状态、实现约束三类实现补充知识 |
| 43 | - `scripts/build_usable_knowledge_pack.py` | 43 | - `scripts/build_usable_knowledge_pack.py` |
| 44 | - 生成一套面向日常问答与预评审的可用知识库包 `dist/usable_kb/` | 44 | - 生成一套面向日常问答与预评审的可用知识库包 `dist/usable_kb/` |
| 45 | + - 当前输出为完整主题展开版:不再限制每模块主题数,也不再只抽样少量主事实/补充事实 | ||
| 46 | + - 会对 `feature_scope`、模块标签和标题做归一化,尽量减少版本前缀、容器前缀和脏标题 | ||
| 45 | - `scripts/build_dify_import_pack.py` | 47 | - `scripts/build_dify_import_pack.py` |
| 46 | - 把 `dist/usable_kb/` 整理成更适合 Dify / 通用 RAG 平台导入的中颗粒度包 `dist/dify_import/` | 48 | - 把 `dist/usable_kb/` 整理成更适合 Dify / 通用 RAG 平台导入的中颗粒度包 `dist/dify_import/` |
| 47 | - `scripts/rebuild_version_kb.sh` | 49 | - `scripts/rebuild_version_kb.sh` |
No preview for this file type
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
| @@ -20,16 +20,16 @@ | @@ -20,16 +20,16 @@ | ||
| 20 | - `16_BACKSTAGE_后台.md` | 20 | - `16_BACKSTAGE_后台.md` |
| 21 | - `17_GENERAL_通用.md` | 21 | - `17_GENERAL_通用.md` |
| 22 | 22 | ||
| 23 | -- 产品主题数:2330 | 23 | +- 产品主题数:2235 |
| 24 | - 后台实现原子数:4048 | 24 | - 后台实现原子数:4048 |
| 25 | 25 | ||
| 26 | ## 模块覆盖 | 26 | ## 模块覆盖 |
| 27 | 27 | ||
| 28 | -- AUTH / 认证:660 个主题 | ||
| 29 | -- INCOME / 收入提现:537 个主题 | ||
| 30 | -- INQUIRY / 问诊:777 个主题 | ||
| 31 | -- CLINIC / 门诊:573 个主题 | ||
| 32 | -- PATIENT / 患者:973 个主题 | ||
| 33 | -- NOTIFICATION / 通知:358 个主题 | ||
| 34 | -- BACKSTAGE / 后台:297 个主题 | ||
| 35 | -- GENERAL / 通用:357 个主题 | 28 | +- AUTH / 认证:668 个主题 |
| 29 | +- INCOME / 收入提现:558 个主题 | ||
| 30 | +- INQUIRY / 问诊:768 个主题 | ||
| 31 | +- CLINIC / 门诊:565 个主题 | ||
| 32 | +- PATIENT / 患者:957 个主题 | ||
| 33 | +- NOTIFICATION / 通知:347 个主题 | ||
| 34 | +- BACKSTAGE / 后台:316 个主题 | ||
| 35 | +- GENERAL / 通用:354 个主题 |
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
This diff could not be displayed because it is too large.
| @@ -188,6 +188,11 @@ bash scripts/rebuild_all_kb.sh | @@ -188,6 +188,11 @@ bash scripts/rebuild_all_kb.sh | ||
| 188 | 188 | ||
| 189 | ## 7. 每次更新后的最小验收 | 189 | ## 7. 每次更新后的最小验收 |
| 190 | 190 | ||
| 191 | +先做结构验收: | ||
| 192 | +- 检查 `dist/usable_kb/` 与 `dist/dify_import/` 的模块主文件是否已重建 | ||
| 193 | +- 检查模块主文件是否为完整主题展开版,而不是旧的少量摘要主题 | ||
| 194 | +- 如果本次改了主题归一化或标题提炼规则,抽查 `AUTH`、`INCOME` 两个模块标题是否变得更稳定、更可检索 | ||
| 195 | + | ||
| 191 | ### 产品主知识库 | 196 | ### 产品主知识库 |
| 192 | 197 | ||
| 193 | 至少测: | 198 | 至少测: |
| @@ -229,6 +234,17 @@ bash scripts/rebuild_all_kb.sh | @@ -229,6 +234,17 @@ bash scripts/rebuild_all_kb.sh | ||
| 229 | - 替换后台实现补充知识库 4 个文件 | 234 | - 替换后台实现补充知识库 4 个文件 |
| 230 | - 如实现约束变化明显,同步更新飞书主文档中的实现说明 | 235 | - 如实现约束变化明显,同步更新飞书主文档中的实现说明 |
| 231 | 236 | ||
| 237 | +### 修改生成脚本 / 导出规则 / 标题归一化逻辑 | ||
| 238 | + | ||
| 239 | +- 重跑受影响的构建脚本,至少包含 `python3 scripts/build_usable_knowledge_pack.py` | ||
| 240 | +- 如 `dist/dify_import/` 受影响,再跑 `python3 scripts/build_dify_import_pack.py` | ||
| 241 | +- 同步更新: | ||
| 242 | + - `docs/产品研发RAG_总体方案与实施手册.md` | ||
| 243 | + - `docs/产品研发RAG_增量更新与Dify维护手册.md` | ||
| 244 | + - `docs/产品研发RAG_接手说明.md` | ||
| 245 | + - `skills/product-rag-maintainer/SKILL.md` | ||
| 246 | +- 这一步默认必做,不需要额外提醒 | ||
| 247 | + | ||
| 232 | ## 9. 原则 | 248 | ## 9. 原则 |
| 233 | 249 | ||
| 234 | - 底层知识资产是一套 | 250 | - 底层知识资产是一套 |
| @@ -571,6 +571,9 @@ flowchart TD | @@ -571,6 +571,9 @@ flowchart TD | ||
| 571 | 作用: | 571 | 作用: |
| 572 | 572 | ||
| 573 | - 把主事实、补充事实、后台实现信息整理成一套更适合直接使用的知识库包 | 573 | - 把主事实、补充事实、后台实现信息整理成一套更适合直接使用的知识库包 |
| 574 | +- 当前默认输出“完整主知识库版”,不再把模块文件裁成少量主题摘要 | ||
| 575 | +- 每个主题会完整展开产品主事实与交互/测试补充事实 | ||
| 576 | +- 导出前会对 `feature_scope`、模块标签和主题标题做归一化,尽量减少版本前缀、端侧容器前缀与脏标题 | ||
| 574 | 577 | ||
| 575 | 输出: | 578 | 输出: |
| 576 | 579 | ||
| @@ -584,6 +587,7 @@ flowchart TD | @@ -584,6 +587,7 @@ flowchart TD | ||
| 584 | - 保留公共文件和模块主文件 | 587 | - 保留公共文件和模块主文件 |
| 585 | - 自动吸收 `inputs/priority_refs/*.md` 这类高优先参考文件 | 588 | - 自动吸收 `inputs/priority_refs/*.md` 这类高优先参考文件 |
| 586 | - 交给 Dify 在导入时继续做内部切分 | 589 | - 交给 Dify 在导入时继续做内部切分 |
| 590 | +- 当前不会再次把模块主文件压缩成摘要版,而是直接复制完整展开后的主知识库文件 | ||
| 587 | 591 | ||
| 588 | 输出: | 592 | 输出: |
| 589 | 593 | ||
| @@ -608,6 +612,12 @@ python3 scripts/build_usable_knowledge_pack.py | @@ -608,6 +612,12 @@ python3 scripts/build_usable_knowledge_pack.py | ||
| 608 | python3 scripts/build_dify_import_pack.py | 612 | python3 scripts/build_dify_import_pack.py |
| 609 | ``` | 613 | ``` |
| 610 | 614 | ||
| 615 | +## 8.1.1 维护约束 | ||
| 616 | + | ||
| 617 | +- 只要修改了知识库生成脚本、导出结构、主题归一化规则或 Dify 导入规则,必须同步更新 `docs/` 下对应说明文档 | ||
| 618 | +- 同时必须同步更新 repo 内维护 skill:`skills/product-rag-maintainer/SKILL.md` | ||
| 619 | +- 不要把脚本行为改了但文档和 skill 还停留在旧流程 | ||
| 620 | + | ||
| 611 | 如需同时接入后台代码仓库,再执行: | 621 | 如需同时接入后台代码仓库,再执行: |
| 612 | 622 | ||
| 613 | ```bash | 623 | ```bash |
| @@ -83,6 +83,17 @@ bash scripts/rebuild_version_kb.sh <version> /Users/xwk/Downloads/studio-server2 | @@ -83,6 +83,17 @@ bash scripts/rebuild_version_kb.sh <version> /Users/xwk/Downloads/studio-server2 | ||
| 83 | python3 scripts/build_dify_import_pack.py | 83 | python3 scripts/build_dify_import_pack.py |
| 84 | ``` | 84 | ``` |
| 85 | 85 | ||
| 86 | +### 修改知识库生成逻辑 | ||
| 87 | + | ||
| 88 | +- 如果动了 `scripts/build_usable_knowledge_pack.py`、`scripts/build_dify_import_pack.py` 或其他会改变导出结构的脚本: | ||
| 89 | + | ||
| 90 | +```bash | ||
| 91 | +python3 scripts/build_usable_knowledge_pack.py | ||
| 92 | +python3 scripts/build_dify_import_pack.py | ||
| 93 | +``` | ||
| 94 | + | ||
| 95 | +- 然后同步更新主文档、维护手册和 repo 内 skill,不要只改脚本不改说明 | ||
| 96 | + | ||
| 86 | ### 全量重刷 | 97 | ### 全量重刷 |
| 87 | 98 | ||
| 88 | ```bash | 99 | ```bash |
| @@ -104,6 +115,10 @@ bash scripts/rebuild_all_kb.sh /Users/xwk/Downloads/studio-server2 | @@ -104,6 +115,10 @@ bash scripts/rebuild_all_kb.sh /Users/xwk/Downloads/studio-server2 | ||
| 104 | 如果这次有新增专项规则,还同步: | 115 | 如果这次有新增专项规则,还同步: |
| 105 | - 对应 `inputs/priority_refs/*.md` | 116 | - 对应 `inputs/priority_refs/*.md` |
| 106 | 117 | ||
| 118 | +如果这次修改了知识库生成逻辑,还同步: | ||
| 119 | +- `skills/product-rag-maintainer/SKILL.md` | ||
| 120 | +- 相关 `docs/*.md` 中的运行手册与产物说明 | ||
| 121 | + | ||
| 107 | ## 6. 不要做的事 | 122 | ## 6. 不要做的事 |
| 108 | 123 | ||
| 109 | - 不要把所有内容硬塞回 Dify 的一个知识库 | 124 | - 不要把所有内容硬塞回 Dify 的一个知识库 |
| @@ -43,6 +43,99 @@ MODULE_NAMES = { | @@ -43,6 +43,99 @@ MODULE_NAMES = { | ||
| 43 | "BACKSTAGE": "后台", | 43 | "BACKSTAGE": "后台", |
| 44 | "GENERAL": "通用", | 44 | "GENERAL": "通用", |
| 45 | } | 45 | } |
| 46 | +GENERIC_RESULTS = {"满足预期", "搜索出结果", "成功", "失败", "显示成功", "显示失败", "显示正常", "表现正常", "逻辑同上", "无"} | ||
| 47 | +MODULE_ALIASES = { | ||
| 48 | + "AUTH": "AUTH", | ||
| 49 | + "认证": "AUTH", | ||
| 50 | + "身份认证": "AUTH", | ||
| 51 | + "医生认证": "AUTH", | ||
| 52 | + "医师资质": "AUTH", | ||
| 53 | + "互联网医院备案": "AUTH", | ||
| 54 | + "用户注册": "AUTH", | ||
| 55 | + "用户登录": "AUTH", | ||
| 56 | + "INCOME": "INCOME", | ||
| 57 | + "收入": "INCOME", | ||
| 58 | + "收入提现": "INCOME", | ||
| 59 | + "签约": "INCOME", | ||
| 60 | + "签约提现": "INCOME", | ||
| 61 | + "税收": "INCOME", | ||
| 62 | + "税务": "INCOME", | ||
| 63 | + "收入税务": "INCOME", | ||
| 64 | + "缴税": "INCOME", | ||
| 65 | + "收税方式": "INCOME", | ||
| 66 | + "税源地": "INCOME", | ||
| 67 | + "结算": "INCOME", | ||
| 68 | + "费用结算": "INCOME", | ||
| 69 | + "绩效收入": "INCOME", | ||
| 70 | + "工猫": "INCOME", | ||
| 71 | + "安易发": "INCOME", | ||
| 72 | + "提现": "INCOME", | ||
| 73 | + "INQUIRY": "INQUIRY", | ||
| 74 | + "问诊": "INQUIRY", | ||
| 75 | + "图文问诊": "INQUIRY", | ||
| 76 | + "电话问诊": "INQUIRY", | ||
| 77 | + "视频问诊": "INQUIRY", | ||
| 78 | + "问诊单": "INQUIRY", | ||
| 79 | + "问诊定价": "INQUIRY", | ||
| 80 | + "待接诊": "INQUIRY", | ||
| 81 | + "聊天": "INQUIRY", | ||
| 82 | + "消息会话": "INQUIRY", | ||
| 83 | + "医患聊天": "INQUIRY", | ||
| 84 | + "CLINIC": "CLINIC", | ||
| 85 | + "门诊": "CLINIC", | ||
| 86 | + "预约挂号": "CLINIC", | ||
| 87 | + "PATIENT": "PATIENT", | ||
| 88 | + "患者": "PATIENT", | ||
| 89 | + "患者端": "PATIENT", | ||
| 90 | + "患者管理": "PATIENT", | ||
| 91 | + "患者档案": "PATIENT", | ||
| 92 | + "患者分组": "PATIENT", | ||
| 93 | + "患者互动": "PATIENT", | ||
| 94 | + "患者通讯录": "PATIENT", | ||
| 95 | + "患者搜索": "PATIENT", | ||
| 96 | + "病历": "PATIENT", | ||
| 97 | + "随访": "PATIENT", | ||
| 98 | + "评价": "PATIENT", | ||
| 99 | + "锦旗": "PATIENT", | ||
| 100 | + "电子锦旗": "PATIENT", | ||
| 101 | + "NOTIFICATION": "NOTIFICATION", | ||
| 102 | + "通知": "NOTIFICATION", | ||
| 103 | + "BACKSTAGE": "BACKSTAGE", | ||
| 104 | + "后台": "BACKSTAGE", | ||
| 105 | + "医生管理": "BACKSTAGE", | ||
| 106 | + "二维码管理": "BACKSTAGE", | ||
| 107 | + "工作室设置": "BACKSTAGE", | ||
| 108 | + "工作室开通": "BACKSTAGE", | ||
| 109 | + "GENERAL": "GENERAL", | ||
| 110 | +} | ||
| 111 | +GENERIC_FEATURE_SEGMENTS = { | ||
| 112 | + "功能描述", | ||
| 113 | + "需求背景", | ||
| 114 | + "背景", | ||
| 115 | + "说明", | ||
| 116 | + "场景", | ||
| 117 | + "兼容性", | ||
| 118 | + "新版本", | ||
| 119 | + "老版本", | ||
| 120 | + "医师端", | ||
| 121 | + "患者端", | ||
| 122 | + "医生App", | ||
| 123 | + "APP端", | ||
| 124 | + "小程序端", | ||
| 125 | + "PC端", | ||
| 126 | +} | ||
| 127 | +BAD_TITLE_KEYWORDS = {"目标", "背景", "说明", "场景", "功能描述", "需求背景", "兼容性"} | ||
| 128 | +BAD_TITLE_STARTS = ("如果", "当", "该", "给", "通知", "有", "无", "进入", "直接", "还是", "已经", "支持", "显示", "不显示") | ||
| 129 | +GENERIC_PREFIX_PATTERNS = ( | ||
| 130 | + "医师端", | ||
| 131 | + "患者端", | ||
| 132 | + "医生App", | ||
| 133 | + "APP端", | ||
| 134 | + "小程序端", | ||
| 135 | + "PC端", | ||
| 136 | + "猫头鹰端", | ||
| 137 | + "猫头鹰后台", | ||
| 138 | +) | ||
| 46 | 139 | ||
| 47 | 140 | ||
| 48 | def clean_text(text: str) -> str: | 141 | def clean_text(text: str) -> str: |
| @@ -78,6 +171,137 @@ def display_feature_scope(feature_scope: str) -> str: | @@ -78,6 +171,137 @@ def display_feature_scope(feature_scope: str) -> str: | ||
| 78 | return clean_text(scope) or "未归类功能" | 171 | return clean_text(scope) or "未归类功能" |
| 79 | 172 | ||
| 80 | 173 | ||
| 174 | +def normalize_module(value: str) -> str | None: | ||
| 175 | + text = clean_text(value) | ||
| 176 | + if not text: | ||
| 177 | + return None | ||
| 178 | + upper = text.upper() | ||
| 179 | + if upper in MODULE_ORDER: | ||
| 180 | + return upper | ||
| 181 | + return MODULE_ALIASES.get(text) | ||
| 182 | + | ||
| 183 | + | ||
| 184 | +def normalize_feature_segments(feature_scope: str) -> list[str]: | ||
| 185 | + text = clean_text(feature_scope) | ||
| 186 | + text = re.sub(r"\s*-\s*>\s*", " > ", text) | ||
| 187 | + text = re.sub(r"\s*>\s*", " > ", text) | ||
| 188 | + text = re.sub(r"^v?\d+(?:\.\d+)+(?:\s*>\s*)?", "", text, flags=re.I) | ||
| 189 | + parts = [display_feature_scope(part) for part in re.split(r"\s*>\s*", text) if display_feature_scope(part)] | ||
| 190 | + cleaned = [] | ||
| 191 | + for part in parts: | ||
| 192 | + part = re.sub(r"^[❤♥•◦■]+", "", part).strip() | ||
| 193 | + for prefix in GENERIC_PREFIX_PATTERNS: | ||
| 194 | + part = re.sub(rf"^{re.escape(prefix)}\s*[--/]\s*", "", part) | ||
| 195 | + if re.fullmatch(r"v?\d+(?:\.\d+)+", part, flags=re.I): | ||
| 196 | + continue | ||
| 197 | + part = re.sub(r"^(?:功能描述|需求背景|背景|说明|场景)[::]\s*", "", part) | ||
| 198 | + part = clean_text(part) | ||
| 199 | + if not part: | ||
| 200 | + continue | ||
| 201 | + cleaned.append(part) | ||
| 202 | + return cleaned | ||
| 203 | + | ||
| 204 | + | ||
| 205 | +def normalize_feature_key(feature_scope: str) -> str: | ||
| 206 | + parts = normalize_feature_segments(feature_scope) | ||
| 207 | + if not parts: | ||
| 208 | + return "未归类功能" | ||
| 209 | + if len(parts) == 1: | ||
| 210 | + return parts[0] | ||
| 211 | + tail = parts[-1] | ||
| 212 | + prev = parts[-2] | ||
| 213 | + if re.fullmatch(r"[\d.]+", tail): | ||
| 214 | + return prev | ||
| 215 | + if tail in GENERIC_FEATURE_SEGMENTS or len(tail) <= 2: | ||
| 216 | + return f"{prev} > {tail}" | ||
| 217 | + if len(prev) >= 18 and len(tail) <= 18: | ||
| 218 | + return tail | ||
| 219 | + if prev in GENERIC_FEATURE_SEGMENTS: | ||
| 220 | + return tail | ||
| 221 | + if len(tail) <= 12 or len(prev) <= 12: | ||
| 222 | + return f"{prev} > {tail}" | ||
| 223 | + return tail | ||
| 224 | + | ||
| 225 | + | ||
| 226 | +def normalize_title_candidate(text: str) -> str: | ||
| 227 | + text = normalize_feature_key(text) | ||
| 228 | + text = re.sub(r"\s*-\s*>\s*", " > ", text) | ||
| 229 | + for prefix in GENERIC_PREFIX_PATTERNS: | ||
| 230 | + text = re.sub(rf"^{re.escape(prefix)}\s*[--/]\s*", "", text) | ||
| 231 | + text = re.sub(r"^(?:目标|背景|说明|场景|功能描述|需求背景)[::]\s*", "", text) | ||
| 232 | + text = re.sub(r"^[•◦■\-]+\s*", "", text) | ||
| 233 | + text = clean_text(text) | ||
| 234 | + return text | ||
| 235 | + | ||
| 236 | + | ||
| 237 | +def rewrite_title(text: str) -> str: | ||
| 238 | + text = normalize_title_candidate(text) | ||
| 239 | + if not text: | ||
| 240 | + return text | ||
| 241 | + text = re.sub(r"^操作(?:切换)?", "", text).strip() | ||
| 242 | + text = re.sub(r"^点击(.+?) > (.+)$", r"\1 > \2", text) | ||
| 243 | + text = re.sub(r"^点击(.+)$", r"\1", text) | ||
| 244 | + text = re.sub(r"^去掉涉及到的(.+?)相关$", r"\1", text) | ||
| 245 | + text = re.sub(r"^去掉[“\"]?(.+?)[”\"]?$", r"\1", text) | ||
| 246 | + text = re.sub(r"^增加app的(.+)$", r"\1", text, flags=re.I) | ||
| 247 | + text = re.sub(r"^外治还是走原来的流程$", "外治流程", text) | ||
| 248 | + text = re.sub(r"^没有选择任何筛选条件$", "筛选条件为空", text) | ||
| 249 | + text = re.sub(r"^第四周放号数据生成$", "第四周放号", text) | ||
| 250 | + text = re.sub(r"^设置线下预约挂号时[::]\s*(.+)$", r"线下预约挂号设置", text) | ||
| 251 | + text = re.sub(r"^“我的-优惠券”.*$", "我的优惠券展示", text) | ||
| 252 | + text = re.sub(r"^(.+?)还是走原来的流程$", r"\1流程", text) | ||
| 253 | + text = clean_text(text.strip(" >-")) | ||
| 254 | + return text | ||
| 255 | + | ||
| 256 | + | ||
| 257 | +def is_good_title(text: str) -> bool: | ||
| 258 | + text = rewrite_title(text) | ||
| 259 | + if not text or text == "未归类功能": | ||
| 260 | + return False | ||
| 261 | + if len(text) < 3 or len(text) > 40: | ||
| 262 | + return False | ||
| 263 | + if text.startswith(BAD_TITLE_STARTS): | ||
| 264 | + return False | ||
| 265 | + if any(text.startswith(f"{prefix}-") or text.startswith(f"{prefix} >") for prefix in GENERIC_PREFIX_PATTERNS): | ||
| 266 | + return False | ||
| 267 | + if text in GENERIC_FEATURE_SEGMENTS: | ||
| 268 | + return False | ||
| 269 | + if any(keyword in text for keyword in BAD_TITLE_KEYWORDS): | ||
| 270 | + return False | ||
| 271 | + return True | ||
| 272 | + | ||
| 273 | + | ||
| 274 | +def extract_title_fragments(text: str) -> list[str]: | ||
| 275 | + raw = clean_text(text) | ||
| 276 | + if not raw: | ||
| 277 | + return [] | ||
| 278 | + raw = re.sub(r"\s*-\s*>\s*", " > ", raw) | ||
| 279 | + candidates = [raw] | ||
| 280 | + if ">" in raw: | ||
| 281 | + candidates.extend(part.strip() for part in raw.split(">") if part.strip()) | ||
| 282 | + candidates.extend(re.split(r"[;;]", raw)) | ||
| 283 | + enriched = [] | ||
| 284 | + for item in candidates: | ||
| 285 | + item = clean_text(item) | ||
| 286 | + if not item: | ||
| 287 | + continue | ||
| 288 | + item = re.sub(r"^(?:\d+[.、)]\s*)+", "", item) | ||
| 289 | + item = re.sub(r"^(?:操作|点击|选择|设置|显示|进入|打开|查看|发送|支持|增加|新增)[::]?\s*", "", item) | ||
| 290 | + item = re.split(r"[,,。]", item, maxsplit=1)[0] | ||
| 291 | + item = re.split(r"\s{2,}", item, maxsplit=1)[0] | ||
| 292 | + item = rewrite_title(item) | ||
| 293 | + if item and not item.startswith(BAD_TITLE_STARTS): | ||
| 294 | + enriched.append(item) | ||
| 295 | + result = [] | ||
| 296 | + seen = set() | ||
| 297 | + for item in enriched: | ||
| 298 | + if item in seen: | ||
| 299 | + continue | ||
| 300 | + seen.add(item) | ||
| 301 | + result.append(item) | ||
| 302 | + return result | ||
| 303 | + | ||
| 304 | + | ||
| 81 | def normalize_rule(text: str) -> str: | 305 | def normalize_rule(text: str) -> str: |
| 82 | text = clean_text(text) | 306 | text = clean_text(text) |
| 83 | text = re.sub(r"^[a-zA-ZivxIVX]+[.、)]\s*", "", text) | 307 | text = re.sub(r"^[a-zA-ZivxIVX]+[.、)]\s*", "", text) |
| @@ -89,21 +313,40 @@ def normalize_rule(text: str) -> str: | @@ -89,21 +313,40 @@ def normalize_rule(text: str) -> str: | ||
| 89 | 313 | ||
| 90 | 314 | ||
| 91 | def choose_title(feature: str, atoms: list[dict]) -> str: | 315 | def choose_title(feature: str, atoms: list[dict]) -> str: |
| 92 | - candidates = [display_feature_scope(feature)] | 316 | + candidates: list[tuple[str, int]] = [ |
| 317 | + (rewrite_title(feature), 3), | ||
| 318 | + (normalize_feature_key(feature), 2), | ||
| 319 | + (display_feature_scope(feature), 1), | ||
| 320 | + ] | ||
| 93 | for atom in atoms: | 321 | for atom in atoms: |
| 94 | - for raw in (atom.get("A", ""), atom.get("C", "")): | ||
| 95 | - value = display_feature_scope(raw) | ||
| 96 | - if value and value != "未归类功能": | ||
| 97 | - candidates.append(value) | ||
| 98 | - filtered = [] | 322 | + for raw in (atom.get("feature_scope", ""),): |
| 323 | + for value in extract_title_fragments(raw): | ||
| 324 | + if value and value != "未归类功能": | ||
| 325 | + candidates.append((value, 3)) | ||
| 326 | + for raw in (atom.get("C", ""), atom.get("A", ""), atom.get("R", "")): | ||
| 327 | + for value in extract_title_fragments(raw): | ||
| 328 | + if value and value != "未归类功能": | ||
| 329 | + candidates.append((value, 1)) | ||
| 330 | + filtered: list[tuple[str, int]] = [] | ||
| 99 | seen = set() | 331 | seen = set() |
| 100 | - for item in candidates: | 332 | + for item, source_rank in candidates: |
| 101 | if not item or item in seen: | 333 | if not item or item in seen: |
| 102 | continue | 334 | continue |
| 103 | seen.add(item) | 335 | seen.add(item) |
| 104 | - filtered.append(item) | ||
| 105 | - filtered.sort(key=lambda x: (x == "未归类功能", len(x))) | ||
| 106 | - return filtered[0] if filtered else "未归类功能" | 336 | + filtered.append((item, source_rank)) |
| 337 | + if not filtered: | ||
| 338 | + return "未归类功能" | ||
| 339 | + | ||
| 340 | + def score(entry: tuple[str, int]) -> tuple[int, int, int, int, str]: | ||
| 341 | + title, source_rank = entry | ||
| 342 | + title = rewrite_title(title) | ||
| 343 | + good = 1 if is_good_title(title) else 0 | ||
| 344 | + path_bonus = 1 if " > " in title and not any(title.startswith(f"{prefix} >") for prefix in GENERIC_PREFIX_PATTERNS) else 0 | ||
| 345 | + ideal_len = -abs(len(title) - 10) | ||
| 346 | + return (good, source_rank, path_bonus, ideal_len, title) | ||
| 347 | + | ||
| 348 | + filtered.sort(key=score, reverse=True) | ||
| 349 | + return filtered[0][0] | ||
| 107 | 350 | ||
| 108 | 351 | ||
| 109 | def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]: | 352 | def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]: |
| @@ -124,16 +367,69 @@ def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]: | @@ -124,16 +367,69 @@ def sample_product_rules(atoms: list[dict], limit: int = 3) -> list[str]: | ||
| 124 | return rules | 367 | return rules |
| 125 | 368 | ||
| 126 | 369 | ||
| 370 | +def collect_rule_entries(atoms: list[dict]) -> list[dict]: | ||
| 371 | + entries = [] | ||
| 372 | + seen = set() | ||
| 373 | + for atom in sorted( | ||
| 374 | + atoms, | ||
| 375 | + key=lambda x: ( | ||
| 376 | + version_key(x.get("app_version", "")), | ||
| 377 | + x.get("atom_type", ""), | ||
| 378 | + x.get("merge_fingerprint", ""), | ||
| 379 | + x.get("R", ""), | ||
| 380 | + x.get("A", ""), | ||
| 381 | + ), | ||
| 382 | + ): | ||
| 383 | + for raw in (atom.get("R", ""), atom.get("A", ""), atom.get("canon_text", "")): | ||
| 384 | + text = normalize_rule(raw) | ||
| 385 | + if not text or len(text) < 2: | ||
| 386 | + continue | ||
| 387 | + if text in GENERIC_RESULTS: | ||
| 388 | + continue | ||
| 389 | + key = ( | ||
| 390 | + atom.get("app_version", ""), | ||
| 391 | + atom.get("atom_type", ""), | ||
| 392 | + text, | ||
| 393 | + ) | ||
| 394 | + if key in seen: | ||
| 395 | + continue | ||
| 396 | + seen.add(key) | ||
| 397 | + entries.append( | ||
| 398 | + { | ||
| 399 | + "version": atom.get("app_version", "") or "未知版本", | ||
| 400 | + "source": atom.get("atom_type", "") or "unknown", | ||
| 401 | + "text": text, | ||
| 402 | + } | ||
| 403 | + ) | ||
| 404 | + break | ||
| 405 | + return entries | ||
| 406 | + | ||
| 407 | + | ||
| 127 | def group_product_features(master_atoms: list[dict]) -> dict[str, dict]: | 408 | def group_product_features(master_atoms: list[dict]) -> dict[str, dict]: |
| 128 | grouped: dict[str, dict] = {} | 409 | grouped: dict[str, dict] = {} |
| 129 | by_feature: dict[str, list[dict]] = defaultdict(list) | 410 | by_feature: dict[str, list[dict]] = defaultdict(list) |
| 130 | for atom in master_atoms: | 411 | for atom in master_atoms: |
| 131 | if atom.get("atom_type") not in {"doc_rule", "definition", "rule", "case_rule"}: | 412 | if atom.get("atom_type") not in {"doc_rule", "definition", "rule", "case_rule"}: |
| 132 | continue | 413 | continue |
| 133 | - by_feature[atom.get("feature_scope", "未归类功能")].append(atom) | 414 | + normalized_feature = normalize_feature_key(atom.get("feature_scope", "未归类功能")) |
| 415 | + by_feature[normalized_feature].append(atom) | ||
| 134 | 416 | ||
| 135 | for feature, atoms in by_feature.items(): | 417 | for feature, atoms in by_feature.items(): |
| 136 | - modules = sorted({m for atom in atoms for m in atom.get("modules", []) if m}) | 418 | + modules = sorted( |
| 419 | + { | ||
| 420 | + normalized | ||
| 421 | + for atom in atoms | ||
| 422 | + for normalized in [normalize_module(atom.get("primary_module", ""))] | ||
| 423 | + if normalized | ||
| 424 | + } | ||
| 425 | + | { | ||
| 426 | + normalized | ||
| 427 | + for atom in atoms | ||
| 428 | + for module in atom.get("modules", []) | ||
| 429 | + for normalized in [normalize_module(module)] | ||
| 430 | + if normalized | ||
| 431 | + } | ||
| 432 | + ) | ||
| 137 | primary = [a for a in atoms if a.get("atom_type") in {"doc_rule", "definition"}] | 433 | primary = [a for a in atoms if a.get("atom_type") in {"doc_rule", "definition"}] |
| 138 | supplement = [a for a in atoms if a.get("atom_type") in {"rule", "case_rule"}] | 434 | supplement = [a for a in atoms if a.get("atom_type") in {"rule", "case_rule"}] |
| 139 | versions = sorted({a.get("app_version", "") for a in atoms if a.get("app_version")}, key=version_key) | 435 | versions = sorted({a.get("app_version", "") for a in atoms if a.get("app_version")}, key=version_key) |
| @@ -243,13 +539,13 @@ def render_versions(product_features: dict[str, dict]) -> str: | @@ -243,13 +539,13 @@ def render_versions(product_features: dict[str, dict]) -> str: | ||
| 243 | "", | 539 | "", |
| 244 | ] | 540 | ] |
| 245 | items = sorted(product_features.values(), key=lambda x: (-len(x["versions"]), x["title"].lower())) | 541 | items = sorted(product_features.values(), key=lambda x: (-len(x["versions"]), x["title"].lower())) |
| 246 | - for item in items[:220]: | 542 | + for item in items: |
| 247 | lines.append(f"## {item['title']}") | 543 | lines.append(f"## {item['title']}") |
| 248 | lines.append("") | 544 | lines.append("") |
| 249 | lines.append(f"- 模块:{', '.join(item['modules'])}") | 545 | lines.append(f"- 模块:{', '.join(item['modules'])}") |
| 250 | lines.append(f"- 版本:{', '.join(item['versions']) or '无'}") | 546 | lines.append(f"- 版本:{', '.join(item['versions']) or '无'}") |
| 251 | - lines.append(f"- 主事实样例:{';'.join(sample_product_rules(item['primary'], 2)) or '无'}") | ||
| 252 | - lines.append(f"- 补充样例:{';'.join(sample_product_rules(item['supplement'], 2)) or '无'}") | 547 | + lines.append(f"- 主事实数:{len(collect_rule_entries(item['primary']))}") |
| 548 | + lines.append(f"- 补充事实数:{len(collect_rule_entries(item['supplement']))}") | ||
| 253 | lines.append("") | 549 | lines.append("") |
| 254 | return "\n".join(lines) | 550 | return "\n".join(lines) |
| 255 | 551 | ||
| @@ -345,17 +641,33 @@ def render_module_file(module: str, items: list[dict], code_bucket: dict[str, li | @@ -345,17 +641,33 @@ def render_module_file(module: str, items: list[dict], code_bucket: dict[str, li | ||
| 345 | lines.append(f"- 约束样例:{';'.join(constraint_samples)}") | 641 | lines.append(f"- 约束样例:{';'.join(constraint_samples)}") |
| 346 | lines.extend(["", "## 主题清单", ""]) | 642 | lines.extend(["", "## 主题清单", ""]) |
| 347 | 643 | ||
| 348 | - for item in sorted(items, key=feature_rank)[:90]: | 644 | + for item in sorted(items, key=feature_rank): |
| 349 | lines.append(f"### {item['title']}") | 645 | lines.append(f"### {item['title']}") |
| 350 | lines.append("") | 646 | lines.append("") |
| 351 | if item["touchpoints"]: | 647 | if item["touchpoints"]: |
| 352 | lines.append(f"- 触点:{', '.join(item['touchpoints'])}") | 648 | lines.append(f"- 触点:{', '.join(item['touchpoints'])}") |
| 353 | if item["versions"]: | 649 | if item["versions"]: |
| 354 | lines.append(f"- 涉及版本:{', '.join(item['versions'])}") | 650 | lines.append(f"- 涉及版本:{', '.join(item['versions'])}") |
| 355 | - primary_rules = sample_product_rules(item["primary"], 3) | ||
| 356 | - supplement_rules = sample_product_rules(item["supplement"], 3) | ||
| 357 | - lines.append(f"- 产品主事实:{';'.join(primary_rules) or '无'}") | ||
| 358 | - lines.append(f"- 交互/测试补充:{';'.join(supplement_rules) or '无'}") | 651 | + primary_entries = collect_rule_entries(item["primary"]) |
| 652 | + supplement_entries = collect_rule_entries(item["supplement"]) | ||
| 653 | + lines.append(f"- 主事实条数:{len(primary_entries)}") | ||
| 654 | + lines.append(f"- 补充事实条数:{len(supplement_entries)}") | ||
| 655 | + lines.append("") | ||
| 656 | + lines.append("#### 产品主事实") | ||
| 657 | + lines.append("") | ||
| 658 | + if primary_entries: | ||
| 659 | + for entry in primary_entries: | ||
| 660 | + lines.append(f"- [{entry['version']}] {entry['text']}") | ||
| 661 | + else: | ||
| 662 | + lines.append("- 无") | ||
| 663 | + lines.append("") | ||
| 664 | + lines.append("#### 交互/测试补充") | ||
| 665 | + lines.append("") | ||
| 666 | + if supplement_entries: | ||
| 667 | + for entry in supplement_entries: | ||
| 668 | + lines.append(f"- [{entry['version']}] {entry['text']}") | ||
| 669 | + else: | ||
| 670 | + lines.append("- 无") | ||
| 359 | lines.append("") | 671 | lines.append("") |
| 360 | return "\n".join(lines) | 672 | return "\n".join(lines) |
| 361 | 673 |
| @@ -17,10 +17,22 @@ Use this skill when the task is to continue maintaining the repository at `äº§å“ | @@ -17,10 +17,22 @@ Use this skill when the task is to continue maintaining the repository at `äº§å“ | ||
| 17 | - new high-priority reference | 17 | - new high-priority reference |
| 18 | - backend repo update | 18 | - backend repo update |
| 19 | - full rebuild | 19 | - full rebuild |
| 20 | + - generator or export-rule update | ||
| 20 | 5. Run the matching script: | 21 | 5. Run the matching script: |
| 21 | - version rebuild: `bash scripts/rebuild_version_kb.sh <version> [backend_repo]` | 22 | - version rebuild: `bash scripts/rebuild_version_kb.sh <version> [backend_repo]` |
| 22 | - full rebuild: `bash scripts/rebuild_all_kb.sh [backend_repo]` | 23 | - full rebuild: `bash scripts/rebuild_all_kb.sh [backend_repo]` |
| 23 | - Dify import pack only: `python3 scripts/build_dify_import_pack.py` | 24 | - Dify import pack only: `python3 scripts/build_dify_import_pack.py` |
| 25 | + - if any generator / export / title-normalization logic changed, rebuild at least: | ||
| 26 | + - `python3 scripts/build_usable_knowledge_pack.py` | ||
| 27 | + - `python3 scripts/build_dify_import_pack.py` | ||
| 28 | + | ||
| 29 | +## Documentation Sync | ||
| 30 | + | ||
| 31 | +After any change to scripts, output structure, title-normalization logic, or maintenance behavior: | ||
| 32 | + | ||
| 33 | +1. Update the matching docs under `docs/`. | ||
| 34 | +2. Update this skill file if the workflow or rules changed. | ||
| 35 | +3. Treat doc and skill sync as mandatory follow-up work, not an optional reminder. | ||
| 24 | 36 | ||
| 25 | ## File placement rules | 37 | ## File placement rules |
| 26 | 38 | ||
| @@ -61,8 +73,9 @@ After any update: | @@ -61,8 +73,9 @@ After any update: | ||
| 61 | 73 | ||
| 62 | 1. Check `dist/dify_import/`, `dist/backend_code/`, `dist/final_kb/`. | 74 | 1. Check `dist/dify_import/`, `dist/backend_code/`, `dist/final_kb/`. |
| 63 | 2. Check `dist/quality/atom_quality_summary.md`. | 75 | 2. Check `dist/quality/atom_quality_summary.md`. |
| 64 | -3. Run Dify retrieval tests using the examples in `references/validation-queries.md`. | ||
| 65 | -4. After version updates, remind the user to sync the Feishu docs entry pages and version overview. | 76 | +3. If `build_usable_knowledge_pack.py` changed, verify the module files are still complete expanded knowledge files rather than truncated summaries. |
| 77 | +4. Run Dify retrieval tests using the examples in `references/validation-queries.md`. | ||
| 78 | +5. After version updates, remind the user to sync the Feishu docs entry pages and version overview. | ||
| 66 | 79 | ||
| 67 | ## Notes | 80 | ## Notes |
| 68 | 81 |
-
Please register or login to post a comment