419 lines
8.3 KiB
Markdown
419 lines
8.3 KiB
Markdown
# aliyun-aps-sync
|
||
|
||
Node 版阿里云 APS 同步工具。
|
||
|
||
当前主流程已经统一为:
|
||
|
||
- Playwright 抓取 APS 页面
|
||
- 本地保留 `current/history/delta/checkpoints/errors` 数据
|
||
- 同步过程中直接写入 MySQL
|
||
- 定时任务默认执行日增量
|
||
|
||
## 同步范围
|
||
|
||
- customers
|
||
- customerDetails
|
||
- orders
|
||
- orderDetails
|
||
- bills
|
||
- messages
|
||
|
||
## 模式说明
|
||
|
||
### Full 模式
|
||
|
||
执行:
|
||
|
||
```bash
|
||
npm run sync
|
||
```
|
||
|
||
如果要让 full sync 从已有 checkpoint 继续(覆盖 customers / customerDetails / orders / orderDetails / bills / messages):
|
||
|
||
```bash
|
||
npm run sync -- --resume
|
||
```
|
||
|
||
行为:
|
||
|
||
- 抓全量 customer + customerDetails
|
||
- 抓 orders / orderDetails / bills / messages
|
||
- 同步过程中直接写数据库
|
||
|
||
### Incremental 模式
|
||
|
||
执行:
|
||
|
||
```bash
|
||
npm run incremental
|
||
```
|
||
|
||
行为:
|
||
|
||
- 不抓 customer
|
||
- 抓 orders / orderDetails / bills / messages
|
||
- 以数据库 watermark + overlap 为增量窗口
|
||
|
||
### Hot 模式
|
||
|
||
执行:
|
||
|
||
```bash
|
||
npm run hot
|
||
```
|
||
|
||
行为:
|
||
|
||
- 每次只抓**当天订单**
|
||
- 从订单第一页开始扫描
|
||
- 订单列表按“连续稳定行 / 连续稳定页 / 最大页数”提前停止
|
||
- 订单详情只抓:新增订单、列表有变化订单、缺失详情订单、非终态且到达兜底刷新时间的订单
|
||
- 消息按数据库最新时间回退分钟 overlap 后抓取,并在旧页提前停止
|
||
|
||
适用场景:
|
||
|
||
- 白天高频追当天订单
|
||
- 订单量较大,不希望每 5 分钟重复扫完整个当天分页
|
||
- 需要兼顾详情完整性和抓取效率
|
||
|
||
## 登录
|
||
|
||
```bash
|
||
npm run login
|
||
```
|
||
|
||
会自动验证:
|
||
|
||
- 我的客户
|
||
- 账单查询
|
||
|
||
并保存登录态到:
|
||
|
||
- `.browser/`
|
||
- `.browser/storage-state.json`
|
||
|
||
## 账单
|
||
|
||
### 单独抓账单
|
||
|
||
```bash
|
||
npm run bills
|
||
```
|
||
|
||
### 从最新 checkpoint 继续抓账单
|
||
|
||
```bash
|
||
npm run bills -- --resume
|
||
```
|
||
|
||
## 订单
|
||
|
||
只同步订单:
|
||
|
||
```bash
|
||
npm run orders
|
||
```
|
||
|
||
说明:该命令会同时抓取:
|
||
|
||
- orders(订单列表)
|
||
- orderDetails(订单详情)
|
||
|
||
订单增量:
|
||
|
||
```bash
|
||
npm run orders -- --incremental
|
||
```
|
||
|
||
订单从 checkpoint 继续:
|
||
|
||
```bash
|
||
npm run orders -- --resume
|
||
```
|
||
|
||
## 消息
|
||
|
||
单独抓消息:
|
||
|
||
```bash
|
||
npm run messages
|
||
```
|
||
|
||
## 高频同步
|
||
|
||
手动执行一次高频同步:
|
||
|
||
```bash
|
||
npm run hot
|
||
```
|
||
|
||
如果 PowerShell 禁止 `npm.ps1`,可以直接执行:
|
||
|
||
```bash
|
||
node src/index.js hot
|
||
```
|
||
|
||
说明:
|
||
|
||
- `hot` 只覆盖当天订单、订单详情、消息
|
||
- 不抓 customer / customerDetails / bills
|
||
- 适合作为工作时间内的高频轮询任务
|
||
|
||
## 定时任务
|
||
|
||
```bash
|
||
npm run schedule
|
||
```
|
||
|
||
默认按 `.env` 中:
|
||
|
||
```env
|
||
ALIYUN_APS_SCHEDULE_MODE=incremental
|
||
```
|
||
|
||
执行日增量。
|
||
|
||
如果要执行 5 分钟高频同步,可以设置:
|
||
|
||
```env
|
||
ALIYUN_APS_SCHEDULE_MODE=hot
|
||
ALIYUN_APS_HOT_CRON=*/5 * * * *
|
||
```
|
||
|
||
然后执行:
|
||
|
||
```bash
|
||
npm run schedule
|
||
```
|
||
|
||
说明:
|
||
|
||
- `incremental`:按现有增量策略抓 orders / orderDetails / bills / messages
|
||
- `full`:按全量策略执行
|
||
- `hot`:每轮只抓当天 orders / orderDetails / messages
|
||
- hot 模式内置任务锁;如果上一轮还没结束,会跳过下一轮,避免重叠执行
|
||
|
||
## 增量窗口
|
||
|
||
### orders
|
||
|
||
由数据库中 `aps_order.order_time` 最大值决定,回退:
|
||
|
||
```env
|
||
ALIYUN_APS_ORDER_INCREMENTAL_OVERLAP_DAYS=2
|
||
```
|
||
|
||
### bills
|
||
|
||
由数据库中 `aps_bill.consumption_time` 最大值决定,回退:
|
||
|
||
```env
|
||
ALIYUN_APS_BILL_INCREMENTAL_OVERLAP_DAYS=7
|
||
```
|
||
|
||
### messages
|
||
|
||
由数据库中 `aliyun_aps_messages.gmt_modified/gmt_created` 最大值决定,回退:
|
||
|
||
```env
|
||
ALIYUN_APS_MESSAGE_INCREMENTAL_OVERLAP_DAYS=7
|
||
```
|
||
|
||
## 高频模式配置
|
||
|
||
推荐配置:
|
||
|
||
```env
|
||
ALIYUN_APS_SCHEDULE_MODE=hot
|
||
ALIYUN_APS_HOT_CRON=*/5 * * * *
|
||
ALIYUN_APS_HOT_MESSAGE_OVERLAP_MINUTES=15
|
||
ALIYUN_APS_HOT_ORDER_STABLE_THRESHOLD=100
|
||
ALIYUN_APS_HOT_ORDER_STABLE_PAGE_THRESHOLD=2
|
||
ALIYUN_APS_HOT_ORDER_MAX_PAGES=20
|
||
ALIYUN_APS_HOT_MESSAGE_MAX_PAGES=10
|
||
ALIYUN_APS_HOT_ORDER_DETAIL_REFRESH_MINUTES=30
|
||
ALIYUN_APS_HOT_FINAL_STATUSES=已完成,已关闭,已取消,已退款完成
|
||
```
|
||
|
||
含义:
|
||
|
||
- `ALIYUN_APS_HOT_CRON`:高频任务 cron,默认每 5 分钟一次
|
||
- `ALIYUN_APS_HOT_MESSAGE_OVERLAP_MINUTES`:消息高频模式的回扫分钟数
|
||
- `ALIYUN_APS_HOT_ORDER_STABLE_THRESHOLD`:订单扫描中连续多少条稳定记录后停止
|
||
- `ALIYUN_APS_HOT_ORDER_STABLE_PAGE_THRESHOLD`:订单扫描中连续多少页无新增/变更后停止
|
||
- `ALIYUN_APS_HOT_ORDER_MAX_PAGES`:订单每轮最多扫描页数,防止高峰期跑太久
|
||
- `ALIYUN_APS_HOT_MESSAGE_MAX_PAGES`:消息每轮最多扫描页数
|
||
- `ALIYUN_APS_HOT_ORDER_DETAIL_REFRESH_MINUTES`:非终态订单详情兜底刷新间隔
|
||
- `ALIYUN_APS_HOT_FINAL_STATUSES`:视为终态的订单状态,终态订单在无变化时会尽量跳过详情抓取
|
||
|
||
默认策略:
|
||
|
||
- 订单按最新到最旧扫描
|
||
- 新订单或列表字段变化的订单会进入详情抓取
|
||
- 已抓过且无变化的终态订单会直接跳过详情
|
||
- 非终态订单会按兜底刷新时间周期性重抓详情
|
||
- 消息使用 watermark + overlap,避免 5 分钟轮询时漏边界消息
|
||
|
||
## 数据库配置
|
||
|
||
`.env` 需要配置:
|
||
|
||
```env
|
||
ALIYUN_APS_SOURCE_ID=default
|
||
ALIYUN_APS_DB_HOST=
|
||
ALIYUN_APS_DB_PORT=3306
|
||
ALIYUN_APS_DB_USER=
|
||
ALIYUN_APS_DB_PASSWORD=
|
||
ALIYUN_APS_DB_NAME=
|
||
ALIYUN_APS_DB_CHARSET=utf8mb4
|
||
ALIYUN_APS_DB_CONNECTION_LIMIT=5
|
||
```
|
||
|
||
### 多账号 source_id
|
||
|
||
如果两个 APS 账号写入同一个数据库,每个账号必须配置不同的 `ALIYUN_APS_SOURCE_ID`:
|
||
|
||
```env
|
||
# 账号 A
|
||
ALIYUN_APS_SOURCE_ID=aliyun_account_a
|
||
|
||
# 账号 B
|
||
ALIYUN_APS_SOURCE_ID=aliyun_account_b
|
||
```
|
||
|
||
同步写库时会把 `source_id` 写入:
|
||
|
||
- `aps_customer`
|
||
- `aps_order`
|
||
- `aps_order_detail`
|
||
- `aps_bill`
|
||
- `aliyun_aps_messages`
|
||
|
||
增量水位也会按 `source_id` 查询,避免两个账号互相影响。
|
||
|
||
建议两个账号使用不同项目目录或不同 `data/.browser` 目录,避免本地登录态和 checkpoint 互相覆盖。
|
||
|
||
生产库建议把唯一键调整为 `source_id + 业务唯一键`,例如:
|
||
|
||
```sql
|
||
-- 示例,实际约束名以生产库为准
|
||
-- aps_order: UNIQUE(source_id, order_id)
|
||
-- aps_order_detail: UNIQUE(source_id, order_id)
|
||
-- aliyun_aps_messages: UNIQUE(source_id, msg_id)
|
||
```
|
||
|
||
## 浏览器配置
|
||
|
||
默认不再强制使用 Google Chrome。
|
||
|
||
可选配置:
|
||
|
||
```env
|
||
ALIYUN_APS_BROWSER_MODE=launch
|
||
ALIYUN_APS_BROWSER_CHANNEL=
|
||
ALIYUN_APS_BROWSER_EXECUTABLE_PATH=
|
||
ALIYUN_APS_CDP_URL=http://127.0.0.1:9222
|
||
```
|
||
|
||
说明:
|
||
|
||
- `ALIYUN_APS_BROWSER_MODE=launch`:由 Playwright 自己启动浏览器。
|
||
- `ALIYUN_APS_BROWSER_MODE=cdp`:附着到你手动打开的 Chrome/Edge。
|
||
- 两项都留空:使用 Playwright 自带 Chromium。
|
||
- `ALIYUN_APS_BROWSER_CHANNEL=chrome`:使用本机 Chrome。
|
||
- `ALIYUN_APS_BROWSER_CHANNEL=msedge`:使用本机 Edge。
|
||
- `ALIYUN_APS_BROWSER_EXECUTABLE_PATH=...`:指定本地浏览器可执行文件路径。
|
||
|
||
### 手动打开 Chrome 后再让脚本附着
|
||
|
||
如果阿里云风控要求你手动过滑块,可以改成:
|
||
|
||
```env
|
||
ALIYUN_APS_BROWSER_MODE=cdp
|
||
ALIYUN_APS_CDP_URL=http://127.0.0.1:9222
|
||
```
|
||
|
||
然后你手动启动浏览器:
|
||
|
||
```powershell
|
||
chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\temp\aps-manual-profile"
|
||
```
|
||
|
||
在浏览器里手动登录并过验证码后,再执行:
|
||
|
||
```bash
|
||
npm run sync
|
||
```
|
||
|
||
或:
|
||
|
||
```bash
|
||
npm run bills -- --resume
|
||
```
|
||
|
||
附着模式下脚本不会自动关闭你手动打开的浏览器。
|
||
|
||
## 邮件告警
|
||
|
||
任意运行异常会尝试:
|
||
|
||
- 保存错误上下文 JSON
|
||
- 截图当前页面
|
||
- 发送告警邮件
|
||
|
||
`.env` 配置:
|
||
|
||
```env
|
||
ALIYUN_APS_SMTP_HOST=
|
||
ALIYUN_APS_SMTP_PORT=465
|
||
ALIYUN_APS_SMTP_SECURE=true
|
||
ALIYUN_APS_SMTP_USER=
|
||
ALIYUN_APS_SMTP_PASS=
|
||
ALIYUN_APS_NOTIFY_EMAIL=
|
||
```
|
||
|
||
错误文件目录:
|
||
|
||
```text
|
||
data/errors/<dataset>/
|
||
```
|
||
|
||
## 本地数据目录
|
||
|
||
```text
|
||
data/current/
|
||
data/history/
|
||
data/delta/
|
||
data/checkpoints/
|
||
data/runs/
|
||
data/errors/
|
||
```
|
||
|
||
## customer 状态规则
|
||
|
||
- full 抓到 customer 时,默认写为 `active=1`
|
||
- messages 中如果明确识别到“释放”,则标记:
|
||
- `active=0`
|
||
- `status='released'`
|
||
- messages 中如果明确识别到“关联/报备成功/新增客户/绑定客户”,则恢复:
|
||
- `active=1`
|
||
- `status='active'`
|
||
|
||
## 安装
|
||
|
||
```bash
|
||
npm install
|
||
```
|
||
|
||
## 运行时热键
|
||
|
||
- `F7` 暂停
|
||
- `F8` 继续
|
||
- `F9` 终止
|
||
|
||
## 说明
|
||
|
||
- Python 入库脚本已不再是主流程依赖。
|
||
- bills 仍保留 checkpoint/resume 能力。
|
||
- messages 当前先按列表分页抓取,如后续页面需要详情抓取,再补 detail flow。
|