一、一次å‰ç¼†æŒ–断让我们断了3小时2018年,我们在杭州有两个机房,有一天施工队挖断了连通两个机房的å‰ç¼†ã€‚那一瞬间,所有跨机房的服务调用å¨éƒ¨å¤±è´¥ï¼Œè®¢å•系统、支付系统å¨éƒ¨å´©æºƒã€‚后来我们花了3个小时才恢复,但用户流失率直接飙升至30%。从那以后,我们开始认真考虑异地多活架构,不再把鸡蛋放在一个篮子里。二、异地多活架构概述2.1 多活形态多活形态对比: 1. 主备(Active-Standby) - 主机房处理所有流量 - å¤‡æœºæˆ¿ä» åšå¤‡ä»½ - 切换时需要时间恢复 2. 同城双活(Active-Active) - 两个机房同时处理流量 - 延迟1msï¼Œç”¨æˆ·æ— æ„ŸçŸ¥ - 成本较高 3. 异地多活(Multi-Region) - 多个地域同时处理流量 - 延迟3-10ms,需要优化 - 成本最高,可用性最高 4. å ¨çƒå¤šæ´» - è¦†ç›–å ¨çƒç”¨æˆ· - 智能DNS就近访问 - 数据同步挑战大2.2 架构设计原则┌─────────────────────────────────────────────────────────────────┐ │ 异地多活设计原则 │ │ │ │ 1. å•å ƒåŒ– │ │ - ä¸šåŠ¡æŒ‰å•å ƒåˆ’åˆ† │ │ - å•å ƒå† é—­çŽ¯ï¼Œå‡å°‘è·¨å•å ƒä¾èµ– │ │ │ │ 2. 数据分区 │ │ - 按用户ID或地区划分数据 │ │ - æ¯ä¸ªå•å ƒè´Ÿè´£è‡ªå·±çš„æ•°æ® │ │ │ │ 3. åŒåŸŽä¼˜å ˆ │ │ - ä¼˜å ˆè®¿é—®åŒåŸŽå•å ƒ │ │ - 减少跨地域延迟 │ │ │ │ 4. 最终一致性 │ │ - å è®¸çŸ­æš‚æ•°æ®ä¸ä¸€è‡´ │ │ - 通过异步同步修复 │ │ │ │ 5. æ• éšœè‡ªæ„ˆ │ │ - è‡ªåŠ¨æ£€æµ‹æ• éšœ │ │ - 自动切换流量 │ │ │ └──────────────────────────────────────────────────────────────────┘三、数据同步方案3.1 MySQL主从同步/** * 跨机房数据同步服务 */ServiceSlf4jpublicclassDataSyncService{/** * 实时同步(Canal) */publicvoidsyncWithCanal(){// Canaléç½®CanalConnectorconnectorCanalConnectors.newSingleConnector(newInetSocketAddress(127.0.0.1,11111),example,,);connector.connect();connector.subscribe(.*\\..*);connectorrollback();while(running){Messagemessageconnector.get(100);for(CanalEntry.Entryentry:message.getEntries()){if(entry.getEntryType()CanalEntry.EntryType.ROWDATA){// 处理变更数据CanalEntry.RowChangerowChangeCanalEntry.RowChange.parseFrom(entry.getStoreValue());for(CanalEntry.RowDatarowData:rowChange.getRowDatasList()){handleChange(entry.getHeader(),rowData);}}}}}/** * 处理变更数据 */privatevoidhandleChange(CanalEntry.Headerheader,CanalEntry.RowDatarowData){StringtableNameheader.getTableName();StringeventTypeheader.getEventType().name();log.info(数据变更: table{}, type{},tableName,eventType);switch(tableName){caseorders:syncOrder(rowData,eventType);break;caseproducts:syncProduct(rowData,eventType);break;// ... å¶ä»–表}}/** * åŒæ­¥åˆ°å ¶ä»–æœºæˆ¿ */privatevoidsyncToRemote(Stringtable,MapString,Objectdata,Stringtype){// 发送到消息队列messageProducer.send(data-sync,SyncMessage.builder().table(table).type(type).data(data).timestamp(System.currentTimeMillis()).source机房(hz-primary).build());}}3.2 Redis同步# Redis Cluster跨机房éç½®# 方案1:主从复制cluster-enabled yes cluster-config-file nodes.conf cluster-node-timeout 15000# 机房Areplicaof 10.0.1.1 6379# 机房Breplicaof 10.0.2.1 6379# 方案2:双写# 应用层同时写å¥ä¸¤ä¸ªæœºæˆ¿/** * Redis双写服务 */ServiceSlf4jpublicclassRedisDualWriteService{AutowiredprivateRedisTemplateString,ObjectredisTemplateA;AutowiredprivateRedisTemplateString,ObjectredisTemplateB;/** * 双写 */publicvoiddualWrite(Stringkey,Objectvalue){ListFutureBooleanfuturesnewArrayList();// å¼‚æ­¥å†™å¥æœºæˆ¿Afutures.add(asyncExecute(redisTemplateA,()-{redisTemplateA.opsForValue().set(key,value);returntrue;}));// å¼‚æ­¥å†™å¥æœºæˆ¿Bfutures.add(asyncExecute(redisTemplateB,()-{redisTemplateB.opsForValue().set(key,value);returntrue;}));// ç­‰å¾ç»“æžœfor(FutureBooleanfuture:futures){try{if(!future.get(2,TimeUnit.SECONDS)){log.error(双写失败);}}catch(Exceptione){log.error(双写异常,e);}}}/** * 读取时降级 */publicObjectread(Stringkey){try{// 优åˆè¯»æœ¬åœ°ObjectresultredisTemplateA.opsForValue().get(key);if(result!null){returnresult;}}catch(Exceptione){log.warn(读机房A失败,尝试机房B,e);}try{// 降级读机房BreturnredisTemplateB.opsForValue().get(key);}catch(Exceptione){log.error(读机房B也失败,e);returnnull;}}}3.3 消息队列同步/** * 消息跨机房同步 */ServiceSlf4jpublicclassMQSyncService{AutowiredprivateRocketMQTemplaterocketMQTemplate;/** * 发送跨机房消息 */publicvoidsendCrossRegion(Stringtopic,Objectmessage){// 发送到所有机房ListStringregionsArrays.asList(hz,sh,bj);for(Stringregion:regions){try{rocketMQTemplate.asyncSend(topic_region,message,newSendCallback(){OverridepublicvoidonSuccess(SendResultsendResult){log.info(发送成功: region{}, msgId{},region,sendResult.getMsgId());}OverridepublicvoidonException(Throwablee){log.error(发送失败: region{},region,e);}});}catch(Exceptione){log.error(发送异常: region{},region,e);}}}}四、流量切换方案4.1 DNS切换/** * DNS切换服务 */ServiceSlf4jpublicclassDnsSwitchService{/** * 切换流量到备用机房 */publicvoidswitchTraffic(Stringregion){// æ›´æ–°DNS解析UpdateDomainRequestrequestnewUpdateDomainRequest();request.setDomainName(api.example.com);request.setAction(UPDATE_DNS_LOAD_BALANCE);request.setRegion(region);alidnsClient.updateDomainRecord(request);// 刷新本地DNS缓存refreshLocalDns();log.info(流量切换到机房: {},region);// 发送通知alertingService.alert(流量切换,API流量切换到region机房);}/** * 健康检查 */publicbooleanhealthCheck(Stringregion){Stringurlhttp://region-api.example.com/health;try{ResponseEntityStringresponserestTemplate.getForEntity(url,String.class);returnresponse.getStatusCode()HttpStatus.OK;}catch(Exceptione){log.error(健康检查失败: region{},region,e);returnfalse;}}/** * 自动切换 */Scheduled(fixedRate5000)publicvoidautoSwitch(){StringprimaryRegionhz;// 检查主机房健康状态if(!healthCheck(primaryRegion)){log.warn(主机房不可用,准备切换);// 切换到备用机房for(StringbackupRegion:Arrays.asList(sh,bj)){if(healthCheck(backupRegion)){switchTraffic(backupRegion);break;}}}}}4.2 Nginx切换# Nginx主动健康检查 upstream backend { server 10.0.1.1:8080 max_fails3 fail_timeout30s; server 10.0.1.2:8080 max_fails3 fail_timeout30s; server 10.0.2.1:8080 backup; # 杭州机房2 server 10.0.3.1:8080 backup; # 上海机房 } server { location / { proxy_pass http://backend; proxy_next_upstream error timeout http_502; proxy_connect_timeout 5s; proxy_send_timeout 30s; proxy_read_timeout 30s; } }/** * Nginxé ç½®çƒ­æ›´æ–° */ServiceSlf4jpublicclassNginxReloadService{/** * åŠ¨æ€æ·»åŠ ä¸Šæ¸¸æœåŠ¡å™¨ */publicvoidaddUpstreamServer(Stringip,intport){StringconfigString.format(upstream backend {\n server %s:%d max_fails3 fail_timeout30s;\n}\n,ip,port);// 写å¥ä¸´æ—¶é ç½®writeToFile(/tmp/nginx-upstream.conf,config);// 执行reloadexecCommand(nginx -s reload);log.info(æ·»åŠ ä¸Šæ¸¸æœåŠ¡å™¨: {}:{},ip,port);}/** * 切换主备 */publicvoidswitchMasterBackup(Stringregion){// æ›´æ–°nginxéç½®ï¼Œåˆ‡æ¢åˆ°å¤‡ç”¨æœºæˆ¿StringupstreamConfigbuildUpstreamConfig(region);writeToFile(/etc/nginx/conf.d/backend.conf,upstreamConfig);// reloadexecCommand(nginx -s reload);log.info(切换主备: region{},region);}}五、跨地域延迟优化5.1 读写分离/** * 跨机房读写分离 */ServiceSlf4jpublicclassCrossRegionReadWriteService{/** * 写请求 - 路由到主机房 */WriteConnectionpublicvoidwrite(Orderorder){// 强制写主库orderMapper.insert(order);// 异步同步到从库asyncSyncToSlave(order);}/** * 读请求 - ä¼˜å ˆæœ¬åœ°æœºæˆ¿ */ReadConnection(regionlocal)publicOrderread(LongorderId){// åˆè¯»æœ¬åœ°ä»Žåº“OrderorderorderMapper.selectById(orderId);if(ordernull){// 本地没有,跨机房读取orderremoteRead(orderId);}returnorder;}/** * è¯»å†™åˆ†ç¦»é ç½® */BeanpublicDataSourcedataSource(){HikariDataSourceprimarycreateDataSource(jdbc:mysql://10.0.1.1:3306,primary);HikariDataSourcereplica1createDataSource(jdbc:mysql://10.0.1.2:3306,replica1);HikariDataSourcereplica2createDataSource(jdbc:mysql://10.0.2.1:3306,replica2);returnnewReplicationDataSource(primary,Arrays.asList(replica1,replica2));}}5.2 本地缓存/** * 多级缓存 */ServiceSlf4jpublicclassLocalCacheService{AutowiredprivateCaffeineCacheManagercacheManager;/** * æœ¬åœ°ç¼“å­˜é ç½® */PostConstructpublicvoidinit(){cacheManager.setCaffeine(Caffeine.newBuilder().maximumSize(10000).expireAfterWrite(1,TimeUnit.MINUTES).recordStats());}/** * è¯»å–ï¼ˆæœ¬åœ°ç¼“å­˜ä¼˜å ˆï¼‰ */publicObjectget(Stringkey){// 1. åˆè¯»æœ¬åœ°ç¼“å­˜ObjectvaluelocalCache.getIfPresent(key);if(value!null){returnvalue;}// 2. 读Redis(本地机房)valueredisTemplate.opsForValue().get(key);if(value!null){// 写奿œ¬åœ°ç¼“å­˜ localCache.put(key,value);returnvalue;}// 3. 读数据库valueloadFromDb(key);if(value!null){localCache.put(key,value);redisTemplate.opsForValue().set(key,value);}returnvalue;}}å­ã€å®¹ç¾åˆ‡æ¢æ¼”练6.1 演练方案/** * 容灾演练服务 */ServiceSlf4lpublicclassDrDrillService{/** * 执行容灾演练 */publicvoidexecuteDrDrill(DrDrillPlanplan){log.info(开始容灾演练: {},plan.getName());try{// 1. 通知相å³å›¢é˜ŸnotifyTeams(plan);// 2. 记录基线状态RecordBaselineStatus();// 3. 模拟æ•éšœsimulateFailure(plan.getFailureType());// 4. ç­‰å¾åˆ‡æ¢waitForSwitch(plan.getTimeout());// 5. 验证业务verifyBusiness(plan.getCheckPoints());// 6. 恢复recover();// 7. 记录结果saveDrillResult(plan);}catch(Exceptione){log.error(容灾演练失败,e);emergencyRecover();}log.info(容灾演练完成: {},plan.getName());}/** * 验证业务可用性 */privatevoidverifyBusiness(ListCheckPointcheckPoints){for(CheckPointcheckPoint:checkPoints){try{ObjectresultexecuteCheck(checkPoint);if(!evaluate(checkPoint,result)){thrownewDrillException(检查点验证失败: checkPoint.getName());}}catch(Exceptione){log.error(检查点异常: {},checkPoint.getName(),e);throwe;}}}}七、踩坑实录坑1:数据同步延迟跨机房数据同步延迟太大,导致用户看到的数据不一致。解决:优化同步链路,缩短同步时间窗口,å¿è¦æ—¶å¼ºåˆ¶è¯»ä¸»åº“。坑2:切换失败æ•障时DNS切换失败,流量切不过去。解决:多通道切换(DNS Nginx 消息),提高切换成功率。坑3:脑裂问题两个机房都认为对方æ•障,同时写数据,导致数据冲突。解决:使用分布式锁或 Paxos/Raft 保证一致性。坑4ï¼šæˆæœ¬å¤±æŽ§å¼‚åœ°å¤šæ´»éœ€è¦é¢å¤–çš„æœºå™¨å’Œç½‘ç»œï¼Œæˆæœ¬å¤ªé«˜ã€‚è§£å†³ï¼šæŒ‰ä¸šåŠ¡é‡è¦æ€§åˆ†çº§ï¼Œæ ¸å¿ƒä¸šåŠ¡å¤šæ´»ï¼Œæ™®é€šä¸šåŠ¡å•æœºæˆ¿ã€‚å‘5:运维困难跨机房运维复杂,容易操作失误。解决:自动化运维工å·ï¼Œæ ‡å‡†åŒ–操作流程。å«ã€æ€»ç»“异地多活是保障业务高可用的终极方案:单åƒåŒ–:减少跨单åƒä¾èµ–数据同步:Canal MQ Redis流量切换:DNS Nginx延迟优化:读写分离 本地缓存最佳实践:åˆåšåŒåŸŽåŒæ´»ï¼Œå†æ‰©å¼‚åœ°æ ¸å¿ƒä¸šåŠ¡ä¼˜åˆå¤šæ´»å®šæœŸåšå®¹ç¾æ¼”练保持éç½®ä¸€è‡´æ€§è¡€çš„æ•™è®­ï¼šå¤šæ´»æž¶æž„不是银弹,复杂度很高。在决定做多活之前,åˆè¯„ä¼°ä¸šåŠ¡é‡è¦æ€§å’Œä½ æ„¿æ„ä»˜å‡ºçš„æˆæœ¬ã€‚æ€è€ƒé¢˜ï¼šä½ çš„ç³»ç»Ÿæœ‰æ²¡æœ‰åšå¤šæ´»ï¼Ÿé‡åˆ°äº†å“ªäº›æŒ‘æˆ˜ï¼Ÿä¸ªäººè§‚ç‚¹ï¼Œä»ä¾›å‚考