别再死记硬背Skip-gram公式了!用Python从零实现一个(附完整代码与可视化)
用Python从零实现Skip-gram可视化理解词向量生成原理当第一次接触词向量时很多人会被Skip-gram模型中复杂的数学公式吓退。但实际上这个模型的核心理念可以用简单的代码直观呈现。本文将带你用Python从零开始构建一个Skip-gram模型通过可视化手段让抽象的词向量生成过程变得触手可及。1. 准备工作与环境搭建在开始编码之前我们需要准备一个干净的Python环境。推荐使用Anaconda创建独立环境以避免依赖冲突conda create -n skipgram python3.8 conda activate skipgram pip install numpy matplotlib seaborn接下来准备一个简单的文本数据集。为了演示方便我们使用自构造的微型语料库corpus [ the quick brown fox jumps over the lazy dog, i love natural language processing, deep learning is changing the world ]2. 文本预处理与词汇表构建Skip-gram模型的第一步是将文本转换为模型可处理的数字形式。这个过程包括以下几个关键步骤分词与低频词过滤将句子拆分为单词列表并移除出现频率过低的词构建词汇表为每个唯一单词分配唯一索引生成训练样本根据滑动窗口创建(中心词背景词)对from collections import defaultdict import numpy as np def build_vocab(corpus, min_count1): word_counts defaultdict(int) for sentence in corpus: for word in sentence.split(): word_counts[word] 1 vocab {word: idx for idx, (word, count) in enumerate(word_counts.items()) if count min_count} idx_to_word {idx: word for word, idx in vocab.items()} return vocab, idx_to_word vocab, idx_to_word build_vocab(corpus) vocab_size len(vocab) print(f词汇表大小: {vocab_size}) print(f词汇表示例: {list(vocab.items())[:5]})3. Skip-gram模型的核心实现Skip-gram模型的核心是通过中心词预测周围背景词。我们将分步骤实现这个过程的每个组件。3.1 初始化词向量矩阵词向量矩阵是模型的核心参数包含两个部分中心词矩阵W (vocab_size × embedding_dim)背景词矩阵W (embedding_dim × vocab_size)def initialize_weights(vocab_size, embedding_dim2): np.random.seed(42) W np.random.randn(vocab_size, embedding_dim) * 0.01 W_prime np.random.randn(embedding_dim, vocab_size) * 0.01 return W, W_prime embedding_dim 2 # 为可视化方便使用2维 W, W_prime initialize_weights(vocab_size, embedding_dim)3.2 生成训练样本Skip-gram的训练依赖于(中心词背景词)对。以下函数根据滑动窗口生成这些样本def generate_training_data(corpus, vocab, window_size2): training_data [] for sentence in corpus: words sentence.split() for center_pos, center_word in enumerate(words): if center_word not in vocab: continue center_idx vocab[center_word] # 确定上下文窗口边界 start max(0, center_pos - window_size) end min(len(words), center_pos window_size 1) for context_pos in range(start, end): if context_pos center_pos: continue context_word words[context_pos] if context_word in vocab: context_idx vocab[context_word] training_data.append((center_idx, context_idx)) return np.array(training_data) train_data generate_training_data(corpus, vocab) print(f生成的训练样本数: {len(train_data)})4. 模型训练与可视化现在我们可以开始训练模型了。为了直观理解训练过程我们将实现前向传播计算损失反向传播更新参数实时可视化词向量变化4.1 训练循环实现import matplotlib.pyplot as plt from sklearn.manifold import TSNE def train_skipgram(train_data, W, W_prime, learning_rate0.01, epochs100): losses [] for epoch in range(epochs): total_loss 0 np.random.shuffle(train_data) for center_idx, context_idx in train_data: # 前向传播 h W[center_idx] # 中心词向量 u np.dot(W_prime.T, h) # 未归一化logits y_pred np.exp(u - np.max(u)) / np.sum(np.exp(u - np.max(u))) # softmax # 计算损失 (交叉熵) loss -np.log(y_pred[context_idx]) total_loss loss # 反向传播 grad y_pred.copy() grad[context_idx] - 1 # 对正确类的梯度 # 更新参数 dW_prime np.outer(h, grad) dW np.dot(W_prime, grad) W_prime - learning_rate * dW_prime W[center_idx] - learning_rate * dW losses.append(total_loss) # 每10轮可视化一次词向量 if epoch % 10 0 or epoch epochs - 1: visualize_embeddings(W, idx_to_word, epoch) return W, W_prime, losses def visualize_embeddings(W, idx_to_word, epoch): plt.figure(figsize(10, 8)) for i in range(len(W)): plt.scatter(W[i, 0], W[i, 1], alpha0.5) plt.text(W[i, 0], W[i, 1], idx_to_word[i], fontsize9) plt.title(f词向量空间 (Epoch {epoch})) plt.xlabel(维度1) plt.ylabel(维度2) plt.show()4.2 启动训练过程W_trained, W_prime_trained, losses train_skipgram(train_data, W, W_prime) # 绘制损失曲线 plt.plot(losses) plt.title(训练损失曲线) plt.xlabel(Epoch) plt.ylabel(Loss) plt.show()5. 模型优化技巧基础的Skip-gram实现虽然直观但在实际应用中存在效率问题。以下是三种常用的优化方法5.1 负采样优化负采样通过只更新少量负样本的参数大幅提升训练速度def negative_sampling_loss(center_idx, context_idx, W, W_prime, k5): # 正样本损失 h W[center_idx] u_pos np.dot(W_prime[:, context_idx], h) loss -np.log(1 / (1 np.exp(-u_pos))) # 负样本损失 neg_indices np.random.choice( [i for i in range(len(W)) if i ! context_idx], sizek, replaceFalse ) for neg_idx in neg_indices: u_neg np.dot(W_prime[:, neg_idx], h) loss - np.log(1 / (1 np.exp(u_neg))) return loss5.2 二次采样高频词通过概率性丢弃高频词来平衡词频影响def subsample_frequent_words(word_counts, threshold1e-5): total_count sum(word_counts.values()) word_probs {word: count/total_count for word, count in word_counts.items()} discard_probs {word: 1 - np.sqrt(threshold/word_probs[word]) for word in word_counts} return discard_probs5.3 短语检测将常共现的词对视为单个词单元from itertools import combinations def detect_phrases(corpus, threshold10): word_counts defaultdict(int) pair_counts defaultdict(int) # 统计词和词对出现次数 for sentence in corpus: words sentence.split() for word in words: word_counts[word] 1 for i in range(len(words)-1): pair (words[i], words[i1]) pair_counts[pair] 1 # 计算词对得分 phrase_scores {} for pair, count in pair_counts.items(): word1, word2 pair score (count - threshold) / (word_counts[word1] * word_counts[word2]) if score 0: phrase_scores[pair] score return phrase_scores6. 实际应用与扩展训练好的词向量可以用于多种NLP任务。以下是一些典型应用示例6.1 词相似度计算from sklearn.metrics.pairwise import cosine_similarity def most_similar(word, W, vocab, idx_to_word, topn5): if word not in vocab: return [] word_vec W[vocab[word]].reshape(1, -1) similarities cosine_similarity(word_vec, W)[0] similar_indices np.argsort(-similarities)[1:topn1] # 排除自身 return [(idx_to_word[idx], similarities[idx]) for idx in similar_indices] similar_words most_similar(language, W_trained, vocab, idx_to_word) print(f与language最相似的词: {similar_words})6.2 词向量可视化增强使用t-SNE对高维词向量进行降维可视化def visualize_with_tsne(W, idx_to_word): tsne TSNE(n_components2, random_state42) W_tsne tsne.fit_transform(W) plt.figure(figsize(12, 10)) for i in range(len(W_tsne)): plt.scatter(W_tsne[i, 0], W_tsne[i, 1], alpha0.5) plt.text(W_tsne[i, 0], W_tsne[i, 1], idx_to_word[i], fontsize9) plt.title(t-SNE降维后的词向量空间) plt.xlabel(t-SNE维度1) plt.ylabel(t-SNE维度2) plt.show() visualize_with_tsne(W_trained, idx_to_word)6.3 词类比任务def word_analogy(word_a, word_b, word_c, W, vocab, idx_to_word): vec_a W[vocab[word_a]] vec_b W[vocab[word_b]] vec_c W[vocab[word_c]] target_vec vec_b - vec_a vec_c target_vec target_vec.reshape(1, -1) similarities cosine_similarity(target_vec, W)[0] similar_indices np.argsort(-similarities)[:5] return [(idx_to_word[idx], similarities[idx]) for idx in similar_indices] analogy_result word_analogy(king, man, queen, W_trained, vocab, idx_to_word) print(fking:man :: queen:? 结果: {analogy_result})