统计分析方法与假设检验1. 技术分析1.1 统计分析概述统计分析是数据科学的基础方法统计分析类型 描述统计: 数据概括 推断统计: 假设检验 回归分析: 变量关系 时间序列: 时序数据 统计方法: 参数检验: t检验、方差分析 非参数检验: Mann-Whitney、卡方检验 相关性分析: Pearson、Spearman1.2 假设检验假设检验步骤 提出假设: H0和H1 选择检验方法 计算检验统计量 确定p值 做出决策 常见检验: t检验: 均值差异 方差分析: 多组均值 卡方检验: 独立性检验1.3 统计分布分布描述应用场景正态分布对称钟形自然现象t分布小样本样本均值卡方分布非负偏态方差检验F分布比值分布方差分析2. 核心功能实现2.1 描述统计import pandas as pd import numpy as np from scipy import stats class DescriptiveStatistics: def __init__(self, data): self.data data def summary(self): if isinstance(self.data, pd.DataFrame): return self.data.describe(includeall) elif isinstance(self.data, np.ndarray): return { mean: np.mean(self.data), median: np.median(self.data), std: np.std(self.data), min: np.min(self.data), max: np.max(self.data), skew: stats.skew(self.data), kurtosis: stats.kurtosis(self.data) } def frequency_table(self, categorical_data): freq pd.Series(categorical_data).value_counts() freq_percent freq / freq.sum() * 100 return pd.DataFrame({ count: freq, percent: freq_percent }) def confidence_interval(self, confidence0.95): mean np.mean(self.data) std np.std(self.data, ddof1) n len(self.data) se std / np.sqrt(n) margin stats.t.ppf((1 confidence) / 2, n - 1) * se return { mean: mean, lower: mean - margin, upper: mean margin, confidence: confidence }2.2 参数检验class ParametricTests: def __init__(self): pass def one_sample_t_test(self, data, popmean0): t_stat, p_value stats.ttest_1samp(data, popmean) return { t_statistic: t_stat, p_value: p_value, df: len(data) - 1, significant: p_value 0.05 } def two_sample_t_test(self, group1, group2, equal_varTrue): t_stat, p_value stats.ttest_ind(group1, group2, equal_varequal_var) return { t_statistic: t_stat, p_value: p_value, significant: p_value 0.05 } def paired_t_test(self, before, after): t_stat, p_value stats.ttest_rel(before, after) return { t_statistic: t_stat, p_value: p_value, significant: p_value 0.05 } def anova(self, *groups): f_stat, p_value stats.f_oneway(*groups) return { f_statistic: f_stat, p_value: p_value, significant: p_value 0.05 }2.3 非参数检验class NonParametricTests: def __init__(self): pass def mann_whitney_u_test(self, group1, group2): u_stat, p_value stats.mannwhitneyu(group1, group2) return { u_statistic: u_stat, p_value: p_value, significant: p_value 0.05 } def wilcoxon_signed_rank_test(self, before, after): w_stat, p_value stats.wilcoxon(before, after) return { w_statistic: w_stat, p_value: p_value, significant: p_value 0.05 } def kruskal_wallis_test(self, *groups): h_stat, p_value stats.kruskal(*groups) return { h_statistic: h_stat, p_value: p_value, significant: p_value 0.05 } def chi_square_test(self, observed, expectedNone): if expected is None: chi2, p_value, dof, expected stats.chisquare(observed) else: chi2, p_value stats.chisquare(observed, expected) return { chi2_statistic: chi2, p_value: p_value, degrees_of_freedom: dof if expected is None else len(observed) - 1, significant: p_value 0.05 }2.4 相关性分析class CorrelationAnalysis: def __init__(self): pass def pearson_correlation(self, x, y): corr, p_value stats.pearsonr(x, y) return { correlation: corr, p_value: p_value, significant: p_value 0.05, strength: self._interpret_correlation(corr) } def spearman_correlation(self, x, y): corr, p_value stats.spearmanr(x, y) return { correlation: corr, p_value: p_value, significant: p_value 0.05, strength: self._interpret_correlation(corr) } def kendall_tau(self, x, y): tau, p_value stats.kendalltau(x, y) return { tau: tau, p_value: p_value, significant: p_value 0.05 } def _interpret_correlation(self, corr): abs_corr abs(corr) if abs_corr 0.7: return 强相关 elif abs_corr 0.4: return 中等相关 elif abs_corr 0.1: return 弱相关 else: return 几乎无相关3. 性能对比3.1 参数vs非参数检验特性参数检验非参数检验假设正态分布无分布假设效率高(满足假设时)较低适用数据连续数据任意数据3.2 相关性方法对比方法适用数据度量特点Pearson连续数据线性相关最常用Spearman有序数据秩相关非参数Kendall分类数据一致性稳健3.3 检验效能对比检验类型效能适用场景t检验高两组均值ANOVA高多组均值卡方检验中分类数据4. 最佳实践4.1 统计检验流程def statistical_test_pipeline(data, test_typet_test): # 1. 数据检查 print( 数据描述 ) stats DescriptiveStatistics(data) print(stats.summary()) # 2. 正态性检验 _, p_value stats.shapiro(data) print(f\n正态性检验p值: {p_value}) is_normal p_value 0.05 # 3. 选择检验方法 if test_type t_test: if is_normal: result ParametricTests().one_sample_t_test(data) else: result NonParametricTests().wilcoxon_signed_rank_test(data, [0]*len(data)) print(\n 检验结果 ) print(result)4.2 多重比较校正def bonferroni_correction(p_values): corrected [min(p * len(p_values), 1.0) for p in p_values] return corrected def benjamini_hochberg_correction(p_values): n len(p_values) sorted_indices sorted(range(n), keylambda i: p_values[i]) sorted_p [p_values[i] for i in sorted_indices] corrected [] for i, p in enumerate(sorted_p): corrected_p p * n / (i 1) corrected.append(min(corrected_p, 1.0)) result [0] * n for i, idx in enumerate(sorted_indices): result[idx] corrected[i] return result5. 总结统计分析是数据科学的基础描述统计概括数据特征假设检验验证假设相关性分析探索变量关系非参数检验无分布假设对比数据如下参数检验效率更高(满足假设时)Pearson相关最常用需要进行多重比较校正推荐先检查数据分布统计分析为后续建模提供理论基础必须正确应用。