Knowledge Graph — A Powerful Data Science Technique to Mine Information from Text (with Python code)
Overview
- Knowledge graphs are one of the most fascinating concepts in data science
- Learn how to build a knowledge graph using text from Wikipedia data
- We will be working hands-on in Python to build our knowledge graph using the popular spaCy library
Introduction
Lionel Messi needs no introduction. Even folks who don’t follow football have heard about the brilliance of one of the greatest players to have graced the sport. Here’s his Wikipedia page:
Quite a lot of information there! We have text, tons of hyperlinks, and even an audio clip. That’s a lot of relevant and potentially useful information on a single page. The possibilities of putting this into a use case are endless.
However, there is a slight problem. This is not an ideal source of data to feed to our machines. Not in its current form anyway.
Can we find a way to make this text data readable for machines? Essentially, can we transform this text data into something that can be used by the machines and also can be interpreted easily by us?
Yes, we can! We can do it with the help of Knowledge Graphs (KG), one of the most fascinating concepts in data science. I have been blown away by the sheer potential and applications of knowledge graphs and I am sure you will as well.
In this article, you will learn what knowledge graphs are, why they’re useful, and then we’ll dive into code by building our own knowledge graph on data extracted from Wikipedia.
Table of Contents
- What is a Knowledge Graph?
- How to Represent Knowledge in a Graph?
- Sentence Segmentation
- Entities Extraction
- Relations Extraction - Build a Knowledge Graph from Text Data
What is a Knowledge Graph?
Let’s get one thing out of the way — we will see the term “graphs” a lot in this article. We do not mean bar charts, pie charts, and line plots when I say graphs. Here, we are talking about interconnected entities which can be people, locations, organizations, or even an event.
We can define a graph as a set of nodes and edges.
Take a look at the figure below:
Node A and Node B here are two different entities. These nodes are connected by an edge that represents the relationship between the two nodes. Now, this is the smallest knowledge graph we can build — it is also known as a triple.
Knowledge Graph’s come in a variety of shapes and sizes. For example, the knowledge graph of Wikidata had 59,910,568 nodes by October 2019.
How to Represent Knowledge in a Graph?
Before we get started with building Knowledge Graphs, it is important to understand how information or knowledge is embedded in these graphs.
Let me explain this using an example. If Node A = Putin and Node B = Russia, then it is quite likely that the edge would be “president of”:
A node or an entity can have multiple relations as well. Putin is not only the President of Russia, he also worked for the Soviet Union’s security agency, KGB. But how do we incorporate this new information about Putin in the knowledge graph above?
It’s actually pretty simple. Just add one more node for the new entity, KGB:
The new relationships can emerge not only from the first node but from any node in a knowledge graph as shown below:
Russia is a member of the Asia Pacific Economic Cooperation (APEC).
Identifying the entities and the relation between them is not a difficult task for us. However, manually building a knowledge graph is not scalable. Nobody is going to go through thousands of documents and extract all the entities and the relations between them!
That’s why machines are more suitable to perform this task as going through even hundreds or thousands of documents is child’s play for them. But then there is another challenge — machines do not understand natural language. This is where Natural Language Processing (NLP) comes into the picture.
To build a knowledge graph from the text, it is important to make our machine understand natural language. This can be done by using NLP techniques such as sentence segmentation, dependency parsing, parts of speech tagging, and entity recognition. Let’s discuss these in a bit more detail.
Sentence Segmentation
The first step in building a knowledge graph is to split the text document or article into sentences. Then, we will shortlist only those sentences in which there is exactly 1 subject and 1 object. Let’s look at a sample text below:
“Indian tennis player Sumit Nagal moved up six places from 135 to a career-best 129 in the latest men’s singles ranking. The 22-year-old recently won the ATP Challenger tournament. He made his Grand Slam debut against Federer in the 2019 US Open. Nagal won the first set.”
Let’s split the paragraph above into sentences:
- Indian tennis player Sumit Nagal moved up six places from 135 to a career-best 129 in the latest men’s singles ranking
- The 22-year-old recently won the ATP Challenger tournament
- He made his Grand Slam debut against Federer in the 2019 US Open
Out of these four sentences, we will shortlist the second and the fourth sentences because each of them contains 1 subject and 1 object.
In the second sentence, “22-year-old” is the subject and the object is “ATP Challenger tournament”. In the fourth sentence, the subject is “Nagal” and “first set” is the object:
The challenge is to make your machine understand the text, especially in the cases of multi-word objects and subjects. For example, extracting the objects in both the sentences above is a bit tricky. Can you think of any method to solve this problem?
Entities Extraction
The extraction of a single word entity from a sentence is not a tough task. We can easily do this with the help of parts of speech (POS) tags. The nouns and the proper nouns would be our entities.
However, when an entity spans across multiple words, then POS tags alone are not sufficient. We need to parse the dependency tree of the sentence. You can read more about dependency parsing in the following article.
Let’s get the dependency tags for one of the shortlisted sentences. I will use the popular spaCy library for this task:
Output:
The … det
22-year … amod
— … punct
old … nsubj
recently … advmod
won … ROOT
ATP … compound
Challenger … compound
tournament … dobj
. … punct
The subject ( nsubj) in this sentence as per the dependency parser is “old”. That is not the desired entity. We wanted to extract “22-year-old” instead.
The dependency tag of “22-year” is amod which means it is a modifier of “old”. Hence, we should define a rule to extract such entities.
The rule can be something like this — extract the subject/object along with its modifiers and also extract the punctuation marks between them.
But then look at the object (dobj) in the sentence. It is just “tournament” instead of “ATP Challenger tournament”. Here, we don’t have the modifiers but we do have compound words.
Compound words are those words that collectively form a new term with a different meaning. Therefore, we can update the above rule to - extract the subject/object along with its modifiers, compound words and also extract the punctuation marks between them.
In short, we will use dependency parsing to extract entities.
[If you have found value in this post, then please feel free to send a tip]
Extract Relations
Entity extraction is half the job done. To build a knowledge graph, we need edges to connect the nodes (entities) to one another. These edges are the relations between a pair of nodes.
Let’s go back to the example in the last section. We shortlisted a couple of sentences to build a knowledge graph:
Can you guess the relation between the subject and the object in these two sentences?
Both sentences have the same relation — “won”. Let’s see how these relations can be extracted. We will again use dependency parsing:
Output:
Nagal … nsubj
won … ROOT
the … det
first … amod
set … dobj
. … punct
To extract the relation, we have to find the ROOT of the sentence (which is also the verb of the sentence). Hence, the relation extracted from this sentence would be “won”.
Finally, the knowledge graph from these two sentences will be like this:
Build a Knowledge Graph from Text Data
Time to get our hands on some code! Let’s fire up our Jupyter Notebooks (or whatever IDE you prefer).
We will build a knowledge graph from scratch by using the text from a set of movies and films related to Wikipedia articles. I have already extracted around 4,300 sentences from over 500 Wikipedia articles. Each of these sentences contains exactly two entities — one subject and one object. You can download these sentences from here.
I suggest using Google Colab for this implementation to speed up the computation time.
Import Libraries
Read Data
Read the CSV file containing the Wikipedia sentences:
Output: (4318, 1)
Let’s inspect a few sample sentences:
candidate_sentences['sentence'].sample(5)
Output:
Let’s check the subject and object of one of these sentences. Ideally, there should be one subject and one object in the sentence:
Output:
Perfect! There is only one subject (‘process’) and only one object (‘standard’). You can check for other sentences in a similar manner.
Entity Pairs Extraction
To build a knowledge graph, the most important things are the nodes and the edges between them.
These nodes are going to be the entities that are present in the Wikipedia sentences. Edges are the relationships connecting these entities to one another. We will extract these elements in an unsupervised manner, i.e., we will use the grammar of the sentences.
The main idea is to go through a sentence and extract the subject and the object as and when they are encountered. However, there are a few challenges - an entity can span across multiple words, eg., “red wine”, and the dependency parsers tag only the individual words as subjects or objects.
So, I have created a function below to extract the subject and the object (entities) from a sentence while also overcoming the challenges mentioned above. I have partitioned the code into multiple chunks for your convenience:
Let me explain the code chunks in the function above:
Chunk 1
I have defined a few empty variables in this chunk. prv_tok_dep and prv_tok_text will hold the dependency tag of the previous word in the sentence and that previous word itself, respectively.
prefix and modifier will hold the text that is associated with the subject or the object.
Chunk 2
Next, we will loop through the tokens in the sentence. We will first check if the token is a punctuation mark or not. If yes, then we will ignore it and move on to the next token.
If the token is a part of a compound word (dependency tag = “compound”), we will keep it in the prefix variable. A compound word is a combination of multiple words linked to form a word with a new meaning (example — “Football Stadium”, “animal lover”).
As and when we come across a subject or an object in the sentence, we will add this prefix to it. We will do the same thing with the modifier words, such as “nice shirt”, “big house”, etc.
Chunk 3
Here, if the token is the subject, then it will be captured as the first entity in the ent1 variable. Variables such as prefix, modifier, prv_tok_dep, and prv_tok_text will be reset.
Chunk 4
Here, if the token is the object, then it will be captured as the second entity in the ent2 variable. Variables such as prefix, modifier, prv_tok_dep, and prv_tok_text will again be reset.
Chunk 5
Once we have captured the subject and the object in the sentence, we will update the previous token and its dependency tag.
Let’s test this function on a sentence:
get_entities("the film had 200 patents")
Output: [‘film’, ‘200 patents’]
Great, it seems to be working as planned. In the above sentence, ‘film’ is the subject and ‘200 patents’ is the object.
Now we can use this function to extract these entity pairs for all the sentences in our data:
The list entity_pairs contains all the subject-object pairs from the Wikipedia sentences. Let’s have a look at a few of them:
entity_pairs[10:20]
Output:
As you can see, there are a few pronouns in these entity pairs such as ‘we’, ‘it’, ‘she’, etc. We’d like to have proper nouns or nouns instead. Perhaps we can further improve the get_entities( ) function to filter out pronouns. For the time being, let’s leave it as it is and move on to the relation extraction part.
Relation / Predicate Extraction
This is going to be a very interesting aspect of this article. Our hypothesis is that the predicate is actually the main verb in a sentence.
For example, in the sentence — “Sixty Hollywood musicals were released in 1929”, the verb is “released in” and this is what we are going to use as the predicate for the triple generated from this sentence.
The function below is capable of capturing such predicates from the sentences. Here, I have used spaCy’s rule-based matching:
The pattern defined in the function tries to find the ROOT word or the main verb in the sentence. Once the ROOT is identified, then the pattern checks whether it is followed by a preposition (‘prep’) or an agent word. If yes, then it is added to the ROOT word.
Let me show you a glimpse of this function:
get_relation("John completed the task")
Output: completed
Similarly, let’s get the relations from all the Wikipedia sentences:
relations = [get_relation(i) for i in
tqdm(candidate_sentences['sentence'])]
Let’s take a look at the most frequent relations or predicates that we have just extracted:
pd.Series(relations).value_counts()[:50]
It turns out that relations like “A is B” and “A was B” are the most common relations. However, there are quite a few relations that are more associated with the overall theme — “the ecosystem around movies”. Some of the examples are “composed by”, “released in”, “produced”, “written by” and a few more.
Build a Knowledge Graph
We will finally create a knowledge graph from the extracted entities (subject-object pairs) and the predicates (relation between entities).
Let’s create a dataframe of entities and predicates:
Next, we will use the networkx library to create a network from this dataframe. The nodes will represent the entities and the edges or connections between the nodes will represent the relations between the nodes.
It is going to be a directed graph. In other words, the relation between any connected node pair is not two-way, it is only from one node to another. For example, “John eats pasta”:
Let’s plot the network:
Output:
🥶 Well, this is not exactly what we were hoping for (still looks quite a sight though!).
It turns out that we have created a graph with all the relations that we had. It becomes really hard to visualize a graph with these many relations or predicates.
So, it’s advisable to use only a few important relations to visualize a graph. I will take one relation at a time. Let’s start with the relation “composed by”:
Output:
That’s a much cleaner graph. Here the arrows point towards the composers. For instance, A.R. Rahman, who is a renowned music composer, has entities like “soundtrack score”, “film score”, and “music” connected to him in the graph above.
Let’s check out a few more relations.
Since writing is an important role in any movie, I would like to visualize the graph for the “written by” relation:
Output:
Awesome! This knowledge graph is giving us some extraordinary information. Guys like Javed Akhtar, Krishna Chaitanya, and Jaideep Sahni are all famous lyricists and this graph beautifully captures this relationship.
Let’s see the knowledge graph of another important predicate, i.e., the “released in”:
Output:
I can see quite a few interesting information in this graph. For example, look at this relationship — “several action horror movies released in the 1980s” and “pk released on 4844 screens”. These are facts and it shows us that we can mine such facts from just text. That’s quite amazing!
End Notes
In this article, we learned how to extract information from a given text in the form of triples and build a knowledge graph from it.
However, we restricted ourselves to use sentences with exactly 2 entities. Even then we were able to build quite informative knowledge graphs. Imagine the potential we have here!
I encourage you to explore this field of information extraction more to learn extraction of more complex relationships. In case you have any doubt or you want to share your thoughts, please feel free to use the comments section below.
Feel free to reach out to me at topmate.io/data_science