以下代码主要是环境的配置是天池的天坑。官方指导不完全正确。以下是正确的。
1. 环境构建
name: hospital
channels:
- https://mirrors.ustc.edu.cn/anaconda/pkgs/free
- https://mirrors.ustc.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
- https://mirrors.ustc.edu.cn/anaconda/pkgs/main/
- https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
- defaults
dependencies:
- argon2-cffi=20.1.0=py36h2bbff1b_1
- async_generator=1.10=py36h28b3542_0
- attrs=21.2.0=pyhd3eb1b0_0
- bleach=4.0.0=pyhd3eb1b0_0
- certifi=2021.5.30=py36haa95532_0
- cffi=1.14.6=py36h2bbff1b_0
- colorama=0.4.4=pyhd3eb1b0_0
- console_shortcut=0.1.1=4
- decorator=5.0.9=pyhd3eb1b0_0
- defusedxml=0.7.1=pyhd3eb1b0_0
- entrypoints=0.3=py36_0
- importlib-metadata=4.8.1=py36haa95532_0
- importlib_metadata=4.8.1=hd3eb1b0_0
- ipykernel=5.3.4=py36h5ca1d4c_0
- ipython=6.1.0=py36_0
- ipython_genutils=0.2.0=pyhd3eb1b0_1
- jedi=0.18.0=py36haa95532_1
- jinja2=3.0.1=pyhd3eb1b0_0
- jsonschema=3.2.0=pyhd3eb1b0_2
- jupyter_client=7.0.1=pyhd3eb1b0_0
- jupyter_core=4.7.1=py36haa95532_0
- jupyterlab_pygments=0.1.2=py_0
- m2w64-gcc-libgfortran=5.3.0=6
- m2w64-gcc-libs=5.3.0=7
- m2w64-gcc-libs-core=5.3.0=7
- m2w64-gmp=6.1.0=2
- m2w64-libwinpthread-git=5.0.0.4634.697f757=2
- markupsafe=2.0.1=py36h2bbff1b_0
- mistune=0.8.4=py36he774522_0
- msys2-conda-epoch=20160418=1
- nbclient=0.5.3=pyhd3eb1b0_0
- nbconvert=6.0.7=py36_0
- nbformat=5.1.3=pyhd3eb1b0_0
- nest-asyncio=1.5.1=pyhd3eb1b0_0
- notebook=6.4.3=py36haa95532_0
- packaging=21.0=pyhd3eb1b0_0
- pandoc=2.12=haa95532_0
- pandocfilters=1.4.3=py36haa95532_1
- parso=0.8.2=pyhd3eb1b0_0
- pickleshare=0.7.5=pyhd3eb1b0_1003
- pip=21.0.1=py36haa95532_0
- prometheus_client=0.11.0=pyhd3eb1b0_0
- prompt_toolkit=1.0.15=py36_0
- pycparser=2.20=py_2
- pygments=2.10.0=pyhd3eb1b0_0
- pyparsing=2.4.7=pyhd3eb1b0_0
- pyrsistent=0.17.3=py36he774522_0
- python=3.6.13=h3758d61_0
- python-dateutil=2.8.2=pyhd3eb1b0_0
- pywin32=228=py36hbaba5e8_1
- pywinpty=0.5.7=py36_0
- pyzmq=22.2.1=py36hd77b12b_1
- send2trash=1.5.0=pyhd3eb1b0_1
- setuptools=58.0.4=py36haa95532_0
- simplegeneric=0.8.1=py36_2
- six=1.16.0=pyhd3eb1b0_0
- sqlite=3.36.0=h2bbff1b_0
- terminado=0.9.4=py36haa95532_0
- testpath=0.5.0=pyhd3eb1b0_0
- tornado=6.1=py36h2bbff1b_0
- traitlets=4.3.3=py36_0
- typing_extensions=3.10.0.2=pyh06a4308_0
- vc=14.2=h21ff451_1
- vs2015_runtime=14.27.29016=h5e58377_2
- wcwidth=0.2.5=pyhd3eb1b0_0
- webencodings=0.5.1=py36_1
- wheel=0.37.0=pyhd3eb1b0_1
- wincertstore=0.2=py36h7fe50ca_0
- winpty=0.4.3=4
- zipp=3.5.0=pyhd3eb1b0_0
- pip:
- absl-py==0.13.0
- astor==0.8.1
- blis==0.7.4
- catalogue==2.0.6
- charset-normalizer==2.0.6
- click==7.1.2
- cnradical==0.1.0
- contextvars==2.4
- cymem==2.0.5
- dataclasses==0.8
- gast==0.5.2
- grpcio==1.40.0
- idna==3.2
- immutables==0.16
- jiaba==0.0.1
- joblib==1.0.1
- keras==2.2.4
- keras-applications==1.0.8
- keras-contrib==2.0.8
- keras-preprocessing==1.1.2
- keras-self-attention==0.49.0
- markdown==3.3.4
- murmurhash==1.0.5
- numpy==1.19.5
- pandas==0.25.3
- pathy==0.6.0
- preshed==3.0.5
- protobuf==3.18.0
- pydantic==1.8.2
- pytz==2021.1
- pyyaml==5.4.1
- requests==2.26.0
- scikit-learn==0.24.2
- sklearn==0.0
- smart-open==5.2.1
- spacy==3.1.2
- spacy-legacy==3.0.8
- srsly==2.4.1
- tensorboard==1.12.2
- tensorflow==1.12.3
- termcolor==1.1.0
- thinc==8.0.10
- threadpoolctl==2.2.0
- tqdm==4.39.0
- typer==0.3.2
- urllib3==1.26.6
- wasabi==0.8.2
- werkzeug==2.0.1
prefix: D:\Anaconda\envs\hospital
yml文件的使用
conda env create -f 文件名字.yml --prefix 文件路径
keras-contrib的下载,记得翻墙
!pip install git+https://www.github.com/keras-team/keras-contrib.git
你可以翻墙之后,下载zip文件到本地放在某个位置。使用
pip install ./keras-contrib-master
2. 数据
自行提取
天池
3. 代码如下
建议使用 jupyter ,每一块放入一个cell来运行。
# 导入所需文件
import numpy as np
from sklearn.model_selection import ShuffleSplit
from data_utils import ENTITIES, Documents, Dataset, SentenceExtractor, make_predictions
from data_utils import Evaluator
from gensim.models import Word2Vec
# 数据文件读取
data_dir = "./data/train"
ent2idx = dict(zip(ENTITIES, range(1, len(ENTITIES) + 1)))
idx2ent = dict([(v, k) for k, v in ent2idx.items()])
# 训练集,测试集切分与打乱
docs = Documents(data_dir=data_dir)
rs = ShuffleSplit(n_splits=1, test_size=20, random_state=2018)
train_doc_ids, test_doc_ids = next(rs.split(docs))
train_docs, test_docs = docs[train_doc_ids], docs[test_doc_ids]
# 模型参数赋值
num_cates = max(ent2idx.values()) + 1
sent_len = 64
vocab_size = 3000
emb_size = 100
sent_pad = 10
sent_extrator = SentenceExtractor(window_size=sent_len, pad_size=sent_pad)
train_sents = sent_extrator(train_docs)
test_sents = sent_extrator(test_docs)
train_data = Dataset(train_sents, cate2idx=ent2idx)
train_data.build_vocab_dict(vocab_size=vocab_size)
test_data = Dataset(test_sents, word2idx=train_data.word2idx, cate2idx=ent2idx)
vocab_size = len(train_data.word2idx)
# 构建词嵌入模型
w2v_train_sents = []
for doc in docs:
w2v_train_sents.append(list(doc.text))
w2v_model = Word2Vec(w2v_train_sents)
w2v_embeddings = np.zeros((vocab_size, emb_size))
for char, char_idx in train_data.word2idx.items():
if char in w2v_model.wv:
w2v_embeddings[char_idx] = w2v_model.wv[char]
# 构建双向长短时记忆模型模型加crf模型
import keras
from keras.layers import Input, SimpleRNN, Embedding, Bidirectional, LSTM
from keras_contrib.layers import CRF
from keras.models import Model
def build_lstm_crf_model(num_cates, seq_len, vocab_size, model_opts=dict()):
opts = {
'emb_size': 256,
'emb_trainable': True,
'emb_matrix': None,
'lstm_units': 256,
'optimizer': keras.optimizers.Adam()
}
opts.update(model_opts)
input_seq = Input(shape=(seq_len,), dtype='int32')
if opts.get('emb_matrix') is not None:
embedding = Embedding(vocab_size, opts['emb_size'],
weights=[opts['emb_matrix']],
trainable=opts['emb_trainable'])
else:
embedding = Embedding(vocab_size, opts['emb_size'])
x = embedding(input_seq)
lstm = LSTM(opts['lstm_units'], return_sequences=True)
x = Bidirectional(lstm)(x)
crf = CRF(num_cates, sparse_target=True)
output = crf(x)
model = Model(input_seq, output)
model.compile(opts['optimizer'], loss=crf.loss_function, metrics=[crf.accuracy])
return model
# 双向长短时记忆模型+CRF条件随机场实例化
seq_len = sent_len + 2 * sent_pad
model = build_lstm_crf_model(num_cates, seq_len=seq_len, vocab_size=vocab_size,
model_opts={'emb_matrix': w2v_embeddings, 'emb_size': 100 , 'emb_trainable': False})
model.summary()
# 训练集,测试集形状
train_X, train_y = train_data[:]
print('train_X.shape', train_X.shape)
print('train_y.shape', train_y.shape)
# 双向长短时记忆模型与条件随机场模型训练
model.fit(train_X, train_y, batch_size=64, epochs=10)
# 模型预测
test_X, _ = test_data[:]
preds = model.predict(test_X, batch_size=64, verbose=True)
pred_docs = make_predictions(preds, test_data, sent_pad, docs, idx2ent)
# 输出评价指标
f_score, precision, recall = Evaluator.f1_score(test_docs, pred_docs)
print('f_score: ', f_score)
print('precision: ', precision)
print('recall: ', recall)
# 测试样本展示
sample_doc_id = list(pred_docs.keys())[3]
test_docs[sample_doc_id]
# 测试结果展示
pred_docs[sample_doc_id]
是米忽悠合作那间医院吗hh
对的
是大佬hh