# Python模型评估与验证# 模型评估是机器学习流程的关键环节# 交叉验证能更可靠地评估模型泛化性能# 1. 导入库import numpy as npfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import (cross_val_score, StratifiedKFold, train_test_split)from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import (confusion_matrix, classification_report,precision_score, recall_score, f1_score,roc_curve, roc_auc_score)from sklearn.ensemble import RandomForestClassifier# 2. 加载数据cancer load_breast_cancer()X, y cancer.data, cancer.targetX_train, X_test, y_train, y_test train_test_split(X, y, test_size0.3, random_state42)# 3. 交叉验证基础model LogisticRegression(max_iter5000, random_state42)cv_scores cross_val_score(model, X_train, y_train, cv5, scoringaccuracy)print(f 5 折交叉验证 )print(f每折得分: {cv_scores})print(f平均准确率: {cv_scores.mean():.4f})# 4. StratifiedKFold 分层交叉验证skf StratifiedKFold(n_splits5, shuffleTrue, random_state42)cv_strat cross_val_score(model, X_train, y_train, cvskf, scoringaccuracy)print(f\nStratifiedKFold 平均准确率: {cv_strat.mean():.4f})# 5. 多种评估指标print(f\n多种指标 (5折CV):)for metric in [accuracy, precision, recall, f1, roc_auc]:scores cross_val_score(model, X_train, y_train, cv5, scoringmetric)print(f {metric}: {scores.mean():.4f})# 6. 混淆矩阵model.fit(X_train, y_train)y_pred model.predict(X_test)cm confusion_matrix(y_test, y_pred)print(f\n 混淆矩阵 )print(f 预测负类 预测正类)print(f实际负类 TN{cm[0,0]:4d} FP{cm[0,1]:4d})print(f实际正类 FN{cm[1,0]:4d} TP{cm[1,1]:4d})# 7. 精确率、召回率、F1precision precision_score(y_test, y_pred)recall recall_score(y_test, y_pred)f1 f1_score(y_test, y_pred)print(f\n精确率 (Precision): {precision:.4f})print(f召回率 (Recall): {recall:.4f})print(fF1 分数: {f1:.4f})print(f\n完整分类报告:)print(classification_report(y_test, y_pred, target_namescancer.target_names))# 8. ROC 曲线和 AUCy_prob model.predict_proba(X_test)[:, 1]fpr, tpr, thresholds roc_curve(y_test, y_prob)auc_score roc_auc_score(y_test, y_prob)print(f\n ROC-AUC )print(fAUC 值: {auc_score:.4f})# 9. 不同模型对比print(f\n模型对比 (5折CV AUC):)models {LR: LogisticRegression(max_iter5000, random_state42),RF: RandomForestClassifier(n_estimators100, random_state42)}for name, m in models.items():scores cross_val_score(m, X_train, y_train, cv5, scoringroc_auc)print(f {name}: {scores.mean():.4f})# 10. 验证策略选择# 数据量大: 简单 train/test split# 数据量小: 必须交叉验证 (K5 或 K10)# 类别不平衡: 用 StratifiedKFold# 时间序列: 用 TimeSeriesSplitprint(f\n测试集准确率: {model.score(X_test, y_test):.4f})print(f交叉验证准确率: {cv_scores.mean():.4f})