瑞金医院MMC人工智能辅助构建知识图谱 -- 避坑指北

历史记录

清除记录

猜你想搜

AcWing热点
App
登录/注册

瑞金医院MMC人工智能辅助构建知识图谱 -- 避坑指北

作者：

以凝 , 2021-09-20 23:07:04 , 所有人可见 , 阅读 447

3

1

以下代码主要是环境的配置是天池的天坑。官方指导不完全正确。以下是正确的。

1. 环境构建

name: hospital
channels:
  - https://mirrors.ustc.edu.cn/anaconda/pkgs/free
  - https://mirrors.ustc.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.ustc.edu.cn/anaconda/pkgs/main/
  - https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
  - defaults
dependencies:
  - argon2-cffi=20.1.0=py36h2bbff1b_1
  - async_generator=1.10=py36h28b3542_0
  - attrs=21.2.0=pyhd3eb1b0_0
  - bleach=4.0.0=pyhd3eb1b0_0
  - certifi=2021.5.30=py36haa95532_0
  - cffi=1.14.6=py36h2bbff1b_0
  - colorama=0.4.4=pyhd3eb1b0_0
  - console_shortcut=0.1.1=4
  - decorator=5.0.9=pyhd3eb1b0_0
  - defusedxml=0.7.1=pyhd3eb1b0_0
  - entrypoints=0.3=py36_0
  - importlib-metadata=4.8.1=py36haa95532_0
  - importlib_metadata=4.8.1=hd3eb1b0_0
  - ipykernel=5.3.4=py36h5ca1d4c_0
  - ipython=6.1.0=py36_0
  - ipython_genutils=0.2.0=pyhd3eb1b0_1
  - jedi=0.18.0=py36haa95532_1
  - jinja2=3.0.1=pyhd3eb1b0_0
  - jsonschema=3.2.0=pyhd3eb1b0_2
  - jupyter_client=7.0.1=pyhd3eb1b0_0
  - jupyter_core=4.7.1=py36haa95532_0
  - jupyterlab_pygments=0.1.2=py_0
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=2.0.1=py36h2bbff1b_0
  - mistune=0.8.4=py36he774522_0
  - msys2-conda-epoch=20160418=1
  - nbclient=0.5.3=pyhd3eb1b0_0
  - nbconvert=6.0.7=py36_0
  - nbformat=5.1.3=pyhd3eb1b0_0
  - nest-asyncio=1.5.1=pyhd3eb1b0_0
  - notebook=6.4.3=py36haa95532_0
  - packaging=21.0=pyhd3eb1b0_0
  - pandoc=2.12=haa95532_0
  - pandocfilters=1.4.3=py36haa95532_1
  - parso=0.8.2=pyhd3eb1b0_0
  - pickleshare=0.7.5=pyhd3eb1b0_1003
  - pip=21.0.1=py36haa95532_0
  - prometheus_client=0.11.0=pyhd3eb1b0_0
  - prompt_toolkit=1.0.15=py36_0
  - pycparser=2.20=py_2
  - pygments=2.10.0=pyhd3eb1b0_0
  - pyparsing=2.4.7=pyhd3eb1b0_0
  - pyrsistent=0.17.3=py36he774522_0
  - python=3.6.13=h3758d61_0
  - python-dateutil=2.8.2=pyhd3eb1b0_0
  - pywin32=228=py36hbaba5e8_1
  - pywinpty=0.5.7=py36_0
  - pyzmq=22.2.1=py36hd77b12b_1
  - send2trash=1.5.0=pyhd3eb1b0_1
  - setuptools=58.0.4=py36haa95532_0
  - simplegeneric=0.8.1=py36_2
  - six=1.16.0=pyhd3eb1b0_0
  - sqlite=3.36.0=h2bbff1b_0
  - terminado=0.9.4=py36haa95532_0
  - testpath=0.5.0=pyhd3eb1b0_0
  - tornado=6.1=py36h2bbff1b_0
  - traitlets=4.3.3=py36_0
  - typing_extensions=3.10.0.2=pyh06a4308_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wcwidth=0.2.5=pyhd3eb1b0_0
  - webencodings=0.5.1=py36_1
  - wheel=0.37.0=pyhd3eb1b0_1
  - wincertstore=0.2=py36h7fe50ca_0
  - winpty=0.4.3=4
  - zipp=3.5.0=pyhd3eb1b0_0
  - pip:
    - absl-py==0.13.0
    - astor==0.8.1
    - blis==0.7.4
    - catalogue==2.0.6
    - charset-normalizer==2.0.6
    - click==7.1.2
    - cnradical==0.1.0
    - contextvars==2.4
    - cymem==2.0.5
    - dataclasses==0.8
    - gast==0.5.2
    - grpcio==1.40.0
    - idna==3.2
    - immutables==0.16
    - jiaba==0.0.1
    - joblib==1.0.1
    - keras==2.2.4
    - keras-applications==1.0.8
    - keras-contrib==2.0.8
    - keras-preprocessing==1.1.2
    - keras-self-attention==0.49.0
    - markdown==3.3.4
    - murmurhash==1.0.5
    - numpy==1.19.5
    - pandas==0.25.3
    - pathy==0.6.0
    - preshed==3.0.5
    - protobuf==3.18.0
    - pydantic==1.8.2
    - pytz==2021.1
    - pyyaml==5.4.1
    - requests==2.26.0
    - scikit-learn==0.24.2
    - sklearn==0.0
    - smart-open==5.2.1
    - spacy==3.1.2
    - spacy-legacy==3.0.8
    - srsly==2.4.1
    - tensorboard==1.12.2
    - tensorflow==1.12.3
    - termcolor==1.1.0
    - thinc==8.0.10
    - threadpoolctl==2.2.0
    - tqdm==4.39.0
    - typer==0.3.2
    - urllib3==1.26.6
    - wasabi==0.8.2
    - werkzeug==2.0.1
prefix: D:\Anaconda\envs\hospital

yml文件的使用

conda env create -f 文件名字.yml --prefix 文件路径

keras-contrib的下载，记得翻墙

!pip install git+https://www.github.com/keras-team/keras-contrib.git

你可以翻墙之后，下载zip文件到本地放在某个位置。使用
pip install ./keras-contrib-master

2. 数据

自行提取
天池

3. 代码如下

建议使用 jupyter ,每一块放入一个cell来运行。

# 导入所需文件
import numpy as np
from sklearn.model_selection import ShuffleSplit
from data_utils import ENTITIES, Documents, Dataset, SentenceExtractor, make_predictions
from data_utils import Evaluator
from gensim.models import Word2Vec


# 数据文件读取
data_dir = "./data/train"
ent2idx = dict(zip(ENTITIES, range(1, len(ENTITIES) + 1)))
idx2ent = dict([(v, k) for k, v in ent2idx.items()])


# 训练集，测试集切分与打乱
docs = Documents(data_dir=data_dir)
rs = ShuffleSplit(n_splits=1, test_size=20, random_state=2018)
train_doc_ids, test_doc_ids = next(rs.split(docs))
train_docs, test_docs = docs[train_doc_ids], docs[test_doc_ids]

# 模型参数赋值
num_cates = max(ent2idx.values()) + 1
sent_len = 64
vocab_size = 3000
emb_size = 100
sent_pad = 10
sent_extrator = SentenceExtractor(window_size=sent_len, pad_size=sent_pad)
train_sents = sent_extrator(train_docs)
test_sents = sent_extrator(test_docs)

train_data = Dataset(train_sents, cate2idx=ent2idx)
train_data.build_vocab_dict(vocab_size=vocab_size)

test_data = Dataset(test_sents, word2idx=train_data.word2idx, cate2idx=ent2idx)
vocab_size = len(train_data.word2idx)


# 构建词嵌入模型
w2v_train_sents = []
for doc in docs:
    w2v_train_sents.append(list(doc.text))
w2v_model = Word2Vec(w2v_train_sents)

w2v_embeddings = np.zeros((vocab_size, emb_size))
for char, char_idx in train_data.word2idx.items():
    if char in w2v_model.wv:
        w2v_embeddings[char_idx] = w2v_model.wv[char]

# 构建双向长短时记忆模型模型加crf模型
import keras
from keras.layers import Input, SimpleRNN, Embedding, Bidirectional, LSTM
from keras_contrib.layers import CRF
from keras.models import Model


def build_lstm_crf_model(num_cates, seq_len, vocab_size, model_opts=dict()):
    opts = {
        'emb_size': 256,
        'emb_trainable': True,
        'emb_matrix': None,
        'lstm_units': 256,
        'optimizer': keras.optimizers.Adam()
    }
    opts.update(model_opts)

    input_seq = Input(shape=(seq_len,), dtype='int32')
    if opts.get('emb_matrix') is not None:
        embedding = Embedding(vocab_size, opts['emb_size'], 
                              weights=[opts['emb_matrix']],
                              trainable=opts['emb_trainable'])
    else:
        embedding = Embedding(vocab_size, opts['emb_size'])
    x = embedding(input_seq)
    lstm = LSTM(opts['lstm_units'], return_sequences=True)
    x = Bidirectional(lstm)(x)
    crf = CRF(num_cates, sparse_target=True)
    output = crf(x)

    model = Model(input_seq, output)
    model.compile(opts['optimizer'], loss=crf.loss_function, metrics=[crf.accuracy])
    return model


# 双向长短时记忆模型+CRF条件随机场实例化
seq_len = sent_len + 2 * sent_pad
model = build_lstm_crf_model(num_cates, seq_len=seq_len, vocab_size=vocab_size, 
                             model_opts={'emb_matrix': w2v_embeddings, 'emb_size': 100 , 'emb_trainable': False})
model.summary()


# 训练集，测试集形状
train_X, train_y = train_data[:]
print('train_X.shape', train_X.shape)
print('train_y.shape', train_y.shape)


# 双向长短时记忆模型与条件随机场模型训练
model.fit(train_X, train_y, batch_size=64, epochs=10)


# 模型预测
test_X, _ = test_data[:]
preds = model.predict(test_X, batch_size=64, verbose=True)
pred_docs = make_predictions(preds, test_data, sent_pad, docs, idx2ent)


# 输出评价指标
f_score, precision, recall = Evaluator.f1_score(test_docs, pred_docs)
print('f_score: ', f_score)
print('precision: ', precision)
print('recall: ', recall)


# 测试样本展示
sample_doc_id = list(pred_docs.keys())[3]
test_docs[sample_doc_id]


# 测试结果展示
pred_docs[sample_doc_id]

3 评论

RyanMoriarty 2021-09-21 14:25

是米忽悠合作那间医院吗hh

以凝 2021-09-23 17:10

对的

RyanMoriarty 2021-09-23 20:35 回复了以凝的评论

是大佬hh

App 内打开