预训练的Word Embeddings使用方法

机器学习

发布日期: 2022-03-28

更新日期: 2022-03-28

文章字数: 343

1. 读取预训练好的Word Embeddings文件

with open('file-path/glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, dtype=float, sep=" ")
        embeddings_index[word] = coefs
# 最终得到字典embeddings_index
# 键为字符串(单个单词)
# 值为一个(100, )的向量

2. 形成本模型中的字典

import tensorflow as tf
encoder = tf.keras.layers.TextVectorization()
encoder.adapt(X)
voc = encoder.get_vocabulary()
word_to_index = dict(zip(voc, range(len(voc))))
# 最终得到一个字典word_to_index
# 键为字符串(单个单词)
# 值为该单词的编号

3. 从预训练好的Word Embeddings中抽取出本模型字典的词向量

预训练好的Word Embeddings中的单词数量庞大，我们模型的字典可能并没有这么大，并不需要全部的词向量数据，因此只需要从中抽取出我们模型字典中有的单词的词向量即可。

num_tokens = len(voc)
embedding_dim = 100
hits = 0
misses = 0
cnt = 0
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_to_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        if cnt < 10:
            print(word)
            cnt +=1
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))
# 最终得到一个矩阵embedding_matrix
# 有些字典中的单词并没有在预训练的Word Embedding中出现
# 将其初始化为0即可

4. 利用embedding_matrix构建Embedding层

tf.keras.layers.Embedding(num_tokens, embedding_dim
                              , embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix)
                              , mask_zero=True, trainable=True),
# 视情况决定是否冻结Embedding层