实体关系抽取

文本中实体关系的抽取

Posted by CHENEY WANG on November 21, 2018

Relation Extraction of recruitment post via Piecewise Convolutional Neural Network


Abstract

Online employment-oriented service has access to large amounts of text data, and current recommender systems always tend to collect users’ operations on the website and focus on these features to satisfy the user’s current need. In order to improve the accuracy of recommendations. Constructing a knowledge base through text mining from its text data plays an important role in employment-oriented service to attract users. In this report, we present a case study that builds up a knowledge base that consists of the job’s name and associated workplaces, job’s name and associated company name, and so on. The system first extracts entity relations from post title and user query which can help us to establish a simple knowledge base and determine what kinds of relations are useful. Then word2vec and piecewise convolutional neural network were applied to extract relation from post content. Finally, our knowledge base established not only for the recommendation system but also can be used in autonomous question answering system of our service.

Overview of our approach

At a high level, the main purpose of our design is to improve the accuracy of recommendations. And in this report, we only discuss how to extract relations from data source and establish a domain knowledge base. Knowledge Base Construction Instruction:

  1. Collect information or data about recruitment.
  2. Define the set of features to process
  3. Do tokenization and label data
  4. Train a classifier / extractor to use the labeled training data to not extract relations from unseen data
  5. Computing the coverage of these relations.

Data Collection

To collect data, we constructed a web crawler to visit websites of the largest online blue-collar recruitment market in China. Below is the sample of our data set
Post Content:

Design

In this section, we present a solution inspired by Zeng et al. (2015) which is domain relations extracted via piecewise convolutional neural network. PCNNS are proposed for the automatic learning of features without complicated NLP preprocessing. And Figure 1 shows the neural network architecture for relation extraction. It illustrated the procedure that handles our data. This procedure includes four main parts: Vector Representation, Convolution, Piecewise Max Pooling, and Softmax Output. Figure 1

Vector Representation

The inputs of our model are raw word tokens. And In our method, each input word token is transformed into a vector by looking up pre-trained word embeddings. Moreover, we use position features to specify entity pairs, which are also transformed into vectors by looking up position embeddings.

Word Embeddings

Word embeddings are distributed representations of words that map each word in a text to a ‘k’- dimensional real-valued vector. They have recently been shown to capture both semantic and syntactic information about words very well, setting performance records in several word similarity tasks (Mikolov et al., 2013; Pennington et al., 2014). Using word embeddings that have been trained a priori has become common practice for enhancing many other NLP tasks (Parikh et al., 2014; Huang et al., 2014). And in this project ,we used the Skip-gram model(Mikolv et al.2013) to train word embedding.

Position Embeddings

In relation extraction, we focus on assigning labels to entity pairs. we use position features to specify entity pairs. A position feature is defined as the combination of the relative distances from the current word to e1 and e2. For instance, in the following example, the relative distances from “急需” to e1 (“金光小区”) and e2 (“保安”) are 2 and -4, respectively.

... 金光小区 现在 急需 招聘 小区巡逻 的 保安 一名 ...

Convolution

Convolution is one of the main building blocks of a PCNN. The term convolution refers to the mathematical combination of two functions to produce a third function. It merges two sets of information.
In the case of a PCNN, the convolution is performed on the input data with the use of a filter or kernel (these terms are used interchangeably) to then produce a feature map. And in this paper, convolution is executed by sliding the filter over the input which results in different feature maps.
In the example shown in Figure 1, we assume that the length of the filter is w (w = 3); thus, $w \in \mathbb{R}^m $(m = w∗d).We consider S to be a sequence{q1, q2, · · · , qs}, where $q_i \in \mathbb{R}^d$.In general, let $q_{i:j}$ refer to the concatenation of $q_i$ to $q_j$. The convolution operation involves taking the dot product of w with each w-gram in the sequence q to obtain another sequence $c \in \mathbb{R}^{s+w−1}$: where the index j ranges from 1 to s+w-1. Out-of-range inout values $q_i$,where i <1 or i>s, are taken to be zero. The ability to capture different features typically requires the use of multiple filters (or feature maps) in the convolution. Under the assumption that we use n filters (W = {w1,w2,…, wn}), the convolution operation can be expressed as follows:
The convolution result is a matrix C = {c1, c2, · · · , cn} ∈ $\mathbb{R}^{n×(s+w−1)}$. Figure 1 shows an example in which we use different filters in the convolution procedure.

Piecewise Max Pooling

The size of the convolution output matrix $C \in \mathbb{R}^{n×(s+w−1)}$ depends on the number of tokens s in the sentence that is fed into the network. To apply subsequent layers, the features that are extracted by the convolution layer must be combined such that they are independent of the sentence length. In traditional Convolution Neural Networks (CNNs), max pooling operations are often applied for this purpose (Collobert et al., 2011;Zeng et al., 2014). The idea of max pooling is to capture the most significant features (with the highest values) in each feature map that we get from convolution layer.
However, it is insufficient to address the issue in relations extraction. In this project, an input sentence can be divided into three segments based on the two selected entities. Therefore, we used a piecewise max pooling procedure that returns the maximum value in each segment instead of a single maximum value . As shown in Figure 1 and the sentence, the output of each convolutional filter $c_i$ is divided into three segments {$c_{i1}$, $c_{i2}$, $c_{i3}$} by “保安” and “金光小区”. The piecewise max pooling procedure can be expressed as follows:
For the output of each convolutional filter, we can obtain a 3-dimensional vector $p_i = {p_{i1},p_{i2},p_{i3}}$.We then concatenate all vectors $P_{1:n}$ and apply a non-linear function, such as the hyperbolic tangent. Finally, the piecewise max pooling procedure outputs a vector: where g $\in \mathbb{R}^{3n}$. The size of g is fixed and is not longer related to the sentence length.

Softmax Output

To compute the confidence of each relation, the feature vector g is fed into a softmax classifier. $W_1 \in \mathbb{R}^{n_13n}$ is the transformation matrix, and $o \in \mathbb{R}^{n_1}$ is the final output of the network, where $n_1$ is equal to the number of possible relation types for the relation extraction system.
Each output can then be interpreted as the confidence score of the corresponding relation. This score can be interpreted as a conditional probability by applying a softmax operation.

Implementation

We have a lot of historical blue collar recruitment data which is job content post by business.
First is to tokenizer these content. Chinese is a different language from English. For example, in English sentence,“Jobs is the founder of Apple Inc.” can easily split into words by whitespace and get part of speech of each word using some library in python like NLTK. Then, the sentence would be converted into the form like “Jobs[name] is the founder[noun] of Apple[organization] Inc”, in which “Jobs” is a person name and we assigned “Apple” an [organization] label as its part of speech. Commonly listed English parts of speech has noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, and sometimes numeral, article or determiner. However, in Chinese, if we have Chinese sentence like “马云是阿里巴巴创始人” which means Jack Ma is the founder of Alibaba, we would have to tokenizer this sentence into form like “马云/[name] 是/[v] 阿里巴巴/[organization brand] 创始人/[noun]” , and assign corresponding POS(part of speech) to each word. Additionally, In our training data like photo 2, we always have a sentence like “金光小区招聘保安一名” which means Jinguang Community recruits a security. We utilized a specific domain dictionary built by 58 research group to do tokenization, and the sentence “金光小区招聘保安一名” would be converted into this form “金光[organization brand] 小区[work place] 招聘[verb] 保安[job name] 一名[x]”. At last, we converted text in photo1 to text in photo 2 by Jieba which is a Chinese tokenizer. photo 1 (text without tokenization) photo 2 (text tokenized)
Each sample of our training data is a long-term text. Thus preprocessing is applied to divide the text into several different sentences which including a pair of entities and context between them.
In addition, word embedding is a very important part of our model. After we inputted tokenized text data into Skip-gram model, we could get a word vector dictionary like photo 3. photo 3
In the above photo, each word (each term )can be represented by a vector and its length is 200.
After preprocessing, I referred paper “Distant Supervised for relation extraction via Piecewise Convolutional Neural Networks”, and implemented its model PCNNs as our main model. Therefore, the 2-dimensional matrix of each sample which consists of position embedding and word embedding was used as the input of the piecewise convolutional neural network.
At last, softmax is applied to do classify. In general, we would have a lot of relations :

  1. Relations between company brand and workplace
  2. Relations between the company brand and job name
  3. Relations between the workplace and job name
  4. Relations between workplace and job salary
  5. Relations between company brand and job salary
    However, because of the constraints of our data, we only consider binary classify and just consider relations between company brand and workplace, relations between company brand and job name and relations between workplace and job name.

Performance Evaluation and Result

Holdout evaluation is an approach to out-of-sample evaluation whereby the available data are partitioned into a training set and a test set. The test set is thus out-of-sample data and is sometimes called the holdout set or holdout data.In this project, 10% of the data is used for testing, and 90% for training.

To evaluate this PCNNs method, we select the following two methods for comparison. One method is FP-Growth which represents a traditional model that is a method of association rule learning. Another is CNN which is a deep learning method was proposed by LeCun, Yann. Figure 2 shows the precision-recall curves for each method, where PCNNs denotes our method and demonstrates that PCNNs achieves higher precision over the entire range of recall. FP-growth is a method based on association rules and it would discover regularities between entities from our sentence. Even if in this particular situation, in the recruitment area, it has good performance, but it did not consider the semantics of sentences. Compared with CNN, PCNN extracted three main features of the sentence. In relations extraction area, the context of two entities is also necessary. However, CNN only extracts one main feature of the entire sentence in the max-pooling layer. In result, in terms of both precision and recall, PCNNs outperforms all other evaluated approaches. Figure 2
In addition, to evaluate the entities in relations. We calculated the coverage of entities in relations extracted compares with entities in an entire entity base in Table 1. And because of the different diversities of these three labels. The workplace is easier has high coverage than company name and job name by same training data.

 /                    |     entity base   |    entity extracted | Coverage
WorkPlace       |    498    |    481 |  96.5%
CompanyName        |      3888     |   2307 | 59.3%
JobName        |     24031  |   16304 | 67.8%

Table 1
At last, below is a graph form of relations extracted by our model and manual processed.

Conclutions

In this paper, we employed Piecewise Convolutional Neural Networks (PCNNs) for relation extraction in the recruitment domain. In our method, features are automatically learned without complicated NLP preprocessing. We also successfully devise a piecewise max pooling layer in the employed network to capture structural information and address the wrong label problem in relation extraction. Experimental results show that this approach offers significant improvements over this project compare with simple FP-Growth Algorithm.

Reference

Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Network
ImageNet Classification with Deep Convolutional Neural Networks
Distant supervision for relation extraction without labeled data