学术前言趋势分析-添加Notebook

This commit is contained in:
Yuzhong Liu
2021-01-02 16:22:17 +08:00
parent a2fe7c5c38
commit aadb27f725
5 changed files with 3325 additions and 0 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,477 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 任务说明\n",
"\n",
"- 学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类;\n",
"- 学习内容:使用论文标题完成类别分类;\n",
"- 学习成果:学会文本分类的基本方法、`TF-IDF`等;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据处理步骤\n",
"\n",
"在原始arxiv论文中论文都有对应的类别而论文类别是作者填写的。在本次任务中我们可以借助论文的标题和摘要完成\n",
"\n",
"- 对论文标题和摘要进行处理;\n",
"- 对论文类别进行处理;\n",
"- 构建文本分类模型;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 文本分类思路\n",
"\n",
"- 思路1TF-IDF+机器学习分类器\n",
"\n",
"直接使用TF-IDF对文本提取特征使用分类器进行分类分类器的选择上可以使用SVM、LR、XGboost等\n",
"\n",
"- 思路2FastText\n",
"\n",
"FastText是入门款的词向量利用Facebook提供的FastText工具可以快速构建分类器\n",
"\n",
"- 思路3WordVec+深度学习分类器\n",
"\n",
"WordVec是进阶款的词向量并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRnn或者BiLSTM。\n",
"\n",
"- 思路4Bert词向量\n",
"\n",
"Bert是高配款的词向量具有强大的建模学习能力。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 具体代码实现以及讲解\n",
"\n",
"为了方便大家入门文本分类我们选择思路1和思路2给大家讲解。首先完成字段读取"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:37:06.067689Z",
"start_time": "2021-01-02T07:37:05.413594Z"
}
},
"outputs": [],
"source": [
"# 导入所需的package\n",
"import seaborn as sns #用于画图\n",
"from bs4 import BeautifulSoup #用于爬取arxiv的数据\n",
"import re #用于正则表达式,匹配字符串的模式\n",
"import requests #用于网络连接,发送网络请求,使用域名获取对应信息\n",
"import json #读取数据我们的数据为json格式的\n",
"import pandas as pd #数据处理,数据分析\n",
"import matplotlib.pyplot as plt #画图工具"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:38:47.791291Z",
"start_time": "2021-01-02T07:38:45.515867Z"
}
},
"outputs": [],
"source": [
"def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',\n",
" 'report-no', 'categories', 'license', 'abstract', 'versions',\n",
" 'update_date', 'authors_parsed'], count=None):\n",
" '''\n",
" 定义读取文件的函数\n",
" path: 文件路径\n",
" columns: 需要选择的列\n",
" count: 读取行数\n",
" '''\n",
" \n",
" data = []\n",
" with open(path, 'r') as f: \n",
" for idx, line in enumerate(f): \n",
" if idx == count:\n",
" break\n",
" \n",
" d = json.loads(line)\n",
" d = {col : d[col] for col in columns}\n",
" data.append(d)\n",
"\n",
" data = pd.DataFrame(data)\n",
" return data\n",
"\n",
"data = readArxivFile('arxiv-metadata-oai-snapshot.json', \n",
" ['id', 'title', 'categories', 'abstract'],\n",
" 200000)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"为了方便数据的处理,我们可以将标题和摘要拼接一起完成分类。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:39:04.746931Z",
"start_time": "2021-01-02T07:39:04.199655Z"
}
},
"outputs": [],
"source": [
"data['text'] = data['title'] + data['abstract']\n",
"\n",
"data['text'] = data['text'].apply(lambda x: x.replace('\\n',' '))\n",
"data['text'] = data['text'].apply(lambda x: x.lower())\n",
"data = data.drop(['abstract', 'title'], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"由于原始论文有可能有多个类别,所以也需要处理:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:39:15.639828Z",
"start_time": "2021-01-02T07:39:15.214064Z"
}
},
"outputs": [],
"source": [
"# 多个类别,包含子分类\n",
"data['categories'] = data['categories'].apply(lambda x : x.split(' '))\n",
"\n",
"# 单个类别,不包含子分类\n",
"data['categories_big'] = data['categories'].apply(lambda x : [xx.split('.')[0] for xx in x])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"然后将类别进行编码,这里类别是多个,所以需要多编码:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:39:32.136609Z",
"start_time": "2021-01-02T07:39:31.088518Z"
}
},
"outputs": [],
"source": [
"from sklearn.preprocessing import MultiLabelBinarizer\n",
"mlb = MultiLabelBinarizer()\n",
"data_label = mlb.fit_transform(data['categories_big'].iloc[:])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 思路1\n",
"\n",
"思路1使用TFIDF提取特征限制最多4000个单词"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:40:19.903548Z",
"start_time": "2021-01-02T07:40:07.053896Z"
}
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"vectorizer = TfidfVectorizer(max_features=4000)\n",
"data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"由于这里是多标签分类可以使用sklearn的多标签分类进行封装"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:41:42.359030Z",
"start_time": "2021-01-02T07:41:40.804323Z"
}
},
"outputs": [],
"source": [
"# 划分训练集和验证集\n",
"from sklearn.model_selection import train_test_split\n",
"x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,\n",
" test_size = 0.2,random_state = 1)\n",
"\n",
"# 构建多标签分类模型\n",
"from sklearn.multioutput import MultiOutputClassifier\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:41:48.342696Z",
"start_time": "2021-01-02T07:41:48.063639Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.95 0.85 0.89 7925\n",
" 1 0.85 0.79 0.82 7339\n",
" 2 0.77 0.72 0.74 2944\n",
" 3 0.00 0.00 0.00 4\n",
" 4 0.72 0.48 0.58 2123\n",
" 5 0.51 0.66 0.58 987\n",
" 6 0.86 0.38 0.52 544\n",
" 7 0.71 0.69 0.70 3649\n",
" 8 0.76 0.61 0.68 3388\n",
" 9 0.85 0.88 0.87 10745\n",
" 10 0.46 0.13 0.20 1757\n",
" 11 0.79 0.04 0.07 729\n",
" 12 0.45 0.35 0.39 507\n",
" 13 0.54 0.36 0.43 1083\n",
" 14 0.69 0.14 0.24 3441\n",
" 15 0.84 0.20 0.33 655\n",
" 16 0.93 0.16 0.27 268\n",
" 17 0.87 0.43 0.58 2484\n",
" 18 0.82 0.38 0.52 692\n",
"\n",
" micro avg 0.81 0.65 0.72 51264\n",
" macro avg 0.70 0.43 0.50 51264\n",
"weighted avg 0.80 0.65 0.69 51264\n",
" samples avg 0.72 0.72 0.70 51264\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/lyz/.local/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
" _warn_prf(average, modifier, msg_start, len(result))\n",
"/home/lyz/.local/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.\n",
" _warn_prf(average, modifier, msg_start, len(result))\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"print(classification_report(y_test, clf.predict(x_test)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 思路2\n",
"\n",
"思路2使用深度学习模型单词进行词嵌入然后训练。将数据集处理进行编码并进行截断"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T07:57:52.147577Z",
"start_time": "2021-01-02T07:57:52.122238Z"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"x_train, x_test, y_train, y_test = train_test_split(data['text'].iloc[:100000], \n",
" data_label[:100000],\n",
" test_size = 0.95,random_state = 1)\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T08:00:14.205263Z",
"start_time": "2021-01-02T08:00:03.246020Z"
}
},
"outputs": [],
"source": [
"# parameter\n",
"max_features= 500\n",
"max_len= 150\n",
"embed_size=100\n",
"batch_size = 128\n",
"epochs = 5\n",
"\n",
"from keras.preprocessing.text import Tokenizer\n",
"from keras.preprocessing import sequence\n",
"\n",
"tokens = Tokenizer(num_words = max_features)\n",
"tokens.fit_on_texts(list(data['text'].iloc[:100000]))\n",
"\n",
"y_train = data_label[:100000]\n",
"x_sub_train = tokens.texts_to_sequences(data['text'].iloc[:100000])\n",
"x_sub_train = sequence.pad_sequences(x_sub_train, maxlen=max_len)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"定义模型并完成训练:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2021-01-02T08:08:55.690388Z",
"start_time": "2021-01-02T08:00:19.943791Z"
},
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/5\n",
"625/625 [==============================] - 103s 161ms/step - loss: 0.2149 - accuracy: 0.4019 - val_loss: 0.1167 - val_accuracy: 0.6583\n",
"Epoch 2/5\n",
"625/625 [==============================] - 102s 163ms/step - loss: 0.1141 - accuracy: 0.6699 - val_loss: 0.1058 - val_accuracy: 0.6883\n",
"Epoch 3/5\n",
"625/625 [==============================] - 103s 165ms/step - loss: 0.1059 - accuracy: 0.6923 - val_loss: 0.0998 - val_accuracy: 0.7059\n",
"Epoch 4/5\n",
"625/625 [==============================] - 103s 165ms/step - loss: 0.1000 - accuracy: 0.7019 - val_loss: 0.0962 - val_accuracy: 0.7171\n",
"Epoch 5/5\n",
"625/625 [==============================] - 105s 168ms/step - loss: 0.0961 - accuracy: 0.7143 - val_loss: 0.0950 - val_accuracy: 0.7214\n"
]
},
{
"data": {
"text/plain": [
"<tensorflow.python.keras.callbacks.History at 0x7f3c9f4ef6d8>"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# LSTM model\n",
"# Keras Layers:\n",
"from keras.layers import Dense,Input,LSTM,Bidirectional,Activation,Conv1D,GRU\n",
"from keras.layers import Dropout,Embedding,GlobalMaxPooling1D, MaxPooling1D, Add, Flatten\n",
"from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D# Keras Callback Functions:\n",
"from keras.callbacks import Callback\n",
"from keras.callbacks import EarlyStopping,ModelCheckpoint\n",
"from keras import initializers, regularizers, constraints, optimizers, layers, callbacks\n",
"from keras.models import Model\n",
"from keras.optimizers import Adam\n",
"\n",
"sequence_input = Input(shape=(max_len, ))\n",
"x = Embedding(max_features, embed_size, trainable=True)(sequence_input)\n",
"x = SpatialDropout1D(0.2)(x)\n",
"x = Bidirectional(GRU(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)\n",
"x = Conv1D(64, kernel_size = 3, padding = \"valid\", kernel_initializer = \"glorot_uniform\")(x)\n",
"avg_pool = GlobalAveragePooling1D()(x)\n",
"max_pool = GlobalMaxPooling1D()(x)\n",
"x = concatenate([avg_pool, max_pool]) \n",
"preds = Dense(19, activation=\"sigmoid\")(x)\n",
"\n",
"model = Model(sequence_input, preds)\n",
"model.compile(loss='binary_crossentropy',optimizer=Adam(lr=1e-3),metrics=['accuracy'])\n",
"model.fit(x_sub_train, y_train, \n",
" batch_size=batch_size, \n",
" validation_split=0.2,\n",
" epochs=epochs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because one or more lines are too long