使用gensim训练Word2vec

2022-04-08 default word2vec Comments

Gensim（generate similarity）是一个简单高效的自然语言处理Python库，用于抽取文档的语义主题（semantic topics）。Gensim的输入是原始的、无结构的数字文本（纯文本），内置的算法包括Word2Vec，FastText，潜在语义分析（Latent Semantic Analysis，LSA），潜在狄利克雷分布（Latent Dirichlet Allocation，LDA）等，通过计算训练语料中的统计共现模式自动发现文档的语义结构。这些算法都是非监督的，这意味着不需要人工输入——仅仅需要一组纯文本语料。一旦发现这些统计模式后，任何纯文本（句子、短语、单词）就能采用语义表示简洁地表达。

`models.word2vec` – Word2vec embeddings

该模块使用高度优化的 C 例程、数据流和 Pythonic 接口实现 word2vec 系列算法。

使用方法：

初始化模型

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

训练是流式传输的，因此“句子”可以是可迭代的，从磁盘或网络即时读取输入数据，而无需将整个语料库加载到 RAM 中。

注意：可迭代的句子必须是可重新启动的（不仅仅是生成器），以允许算法多次流过您的数据集。

加载模型并继续训练

1 2	model = Word2Vec.load("word2vec.model") model.train([["hello", "world"]], total_examples=1, epochs=1)

训练好的词向量存储在 KeyedVectors 实例中，如 model.wv

1 2	vector = model.wv['computer'] # get numpy vector of a word sims = model.wv.most_similar('computer', topn=10) # get other similar words

将训练好的向量分离成 KeyedVectors 的原因是，如果不再需要完整的模型状态（不需要继续训练），它的状态可以丢弃，只保留向量和它们的键正确。

from gensim.models import KeyedVectors
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors")
# Load back with memory-mapping = read-only, shared across processes.
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')
vector = wv['computer']  # Get numpy vector of a word

Gensim 还可以加载“word2vec C 格式”的词向量，作为 KeyedVectors 实例

from gensim.test.utils import datapath
# Load a word2vec model stored in the C *text* format.
wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)
# Load a word2vec model stored in the C *binary* format.
wv_from_bin = KeyedVectors.load_word2vec_format(datapath("euclidean_vectors.bin"), binary=True)

如果完成了模型的训练（即不再更新，仅查询），您可以切换到 KeyedVectors 实例：

1 2	word_vectors = model.wv del model

将训练好的参数导入到torch的embending中

vacab = {} #字典
vects = [] #参数矩阵
for i,key in enumerate(pretrained_vec.wv.index_to_key):
    vacab[key] = i
    vects.append(list(pretrained_vec.wv[key]))
vects = torch.Tensor(vects)
del pretrained_vec
embeding = nn.Embedding(len(vacab), vects.shape[1])
embeding.weight = nn.Parameter(vects)

多词 ngram 的嵌入

gensim.models.phrases 模块，可让使用搭配统计信息自动检测超过一个单词的短语。使用短语，您可以学习 word2vec 模型，其中“单词”实际上是多词表达，例如 new_york_times 或 Financial_crisis：

from gensim.models import Phrases
# Train a bigram detector.
bigram_transformer = Phrases(common_texts)
# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(bigram_transformer[common_texts], min_count=1)

预训练模型

在 Gensim 数据存储库中附带了几个已经预训练的模型：

import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')

参考资料

本文链接： https://www.yeahchen.cn/2022/04/08/使用gensim训练Word2vec/
版权声明： 本博客所有文章除特别声明外，均为转载，仅限本人学习记录，不做他用！如有冒犯，请>>与我联系<<！

Adam Chen

A Runner