48pm is a platform for collecting and processing articles from the main online newspapers in the world. Currently (April 2021) it has about 30 million archived articles and around 40,000 new articles are added every day.
The collected articles are processed and the task of the process is:
- Assign a general topic to an article, hereinafter called thematic area, among a series of limited possibilities, such as Sport, Technology, Politics and more.
- Identify the subjects covered in the article, hereinafter called entities.
The name “entity” comes from the Wikidata entities from which they derive.
Purpose of the article
The article describes:
- The process of assigning a thematic area to an article
- The process to extract entities associated with an article
What is Natural language processing?
As described on Wikipedia:
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
Also on Wikipedia we find this statement:
Natural language understanding is often considered an AI-complete problem, as language recognition is thought to require extensive knowledge of the world and a great ability to manipulate it. For this reason, the definition of ‘understanding’ is one of the major problems of natural language processing.
Due to this complexity and an evident need for “an extended knowledge of the world” the process uses an extended set of information.
Toolbox consists of:
- Wikidata definitions
- Meta information about the properties-values of a Wikipedia definitions
- Extended set of articles
- Statistics on Significant Patterns and Significant Entities
- Statistics on “context information” (information not directly attributable to the text but associated with it) as Authors (who publishes articles) and Writers (who writes articles)
As said, the process does not aim at “awareness” about the content of an article, more “simply” it wants to assign a thematic area and extrapolate entities using:
- Pattern recognition
- Statistical models
- Property-value models
- Relations between entities
The use of property-value information associated with entities allows a conceptual model, not just a statistical. The value singer in the profession property is significant without statistical data. However, the choice of which value-properties are significant (and what they mean) should be the result of machine learning algorithms.
Through data mining techniques which is useful for example to identify significant patterns, the process learns from the acquired knowledge and continuously updates its predictive model.
analysis > result > learning
Therefore, the result of one analysis changes the models used in the subsequent analysis. This sequence has a problem: if the result is incorrect, the next model will be less accurate, and if an error is repeated without correction, here is a “false knowledge”.
There are several strategies to overcome or at least contain this problem, including:
- Thematic markers
- Human supervision
Selection of significant patterns
In the learning phase, the best candidates for becoming “significant patterns” are the most common ones within a homogeneous group of data already analyzed. In addition, several strategies are applied during the selection, for example:
- Removing incomplete patterns (like those that start or end with a preposition)
- Removing “sub patterns” of more significant patterns
- Removing patterns that have more generic “sub patterns”
Avoid language containers
Linguistic diversity is an enormous human expressive heritage, however it is also an undeniable communication barrier. In the first experiments of text analysis, each language was an isolated container but this approach was abandoned, believing that different expressions or more simply the same word in different languages should be considered as a synonym within the same vocabulary.
This does not mean that knowing the language of an article is irrelevant but the language is used as a property or, better, as a meta information. This approach has a huge advantage, the knowledge on patterns becomes universal, on the other hand it increases the possibility of collisions.
As mentioned, the entities are superimposable to the entities of Wikidata. The purpose of Wikidata is to describe objects, people, things and also abstract or logical entities through a property-value list. This type of data representation is very useful for computational processing but not all entities defined on Wikidata are significant for the purpose. In the learning phase, the significant entities derive from the significant patterns.
Although Wikidata provides an excellent query system (SPARQL), the information on the entities is saved in a new data structure optimized for the purpose.
Supervision makes it possible to report errors, for example wrong thematic area or entities.
To make supervision meaningful, in a context in which hundreds or thousands of automatic analyzes correspond to a single correction, two strategies have been adopted.
- the weight of a correction is greater than the weight of an automatic analysis.
- After a learning session the weights are re-balanced and contained within threshold values.
Let’s see the steps used to get the desired result.
Creation of patterns
The first processing is the transformation of the text into patterns.
The operation is simple, a pattern is a variable-length sequence of words.
Let’s take the phrase “A B C”, the resulting patterns are:
“A”, “A B”, “A B C”, “B”, “B C” and “C”.
A pattern serves two purposes:
1. Helping to identify the thematic area using statistical information. The contribution of a pattern depend on:
- Size (number of words)
2. Provide a list of possible entities.
Selection of significant patterns
Not all the patterns obtained are significant. In the analysis, the significant patterns are those selected in the learning phase.
Search for entities associated with patterns
After having identified the significant patterns the associated entities are retrieved. Not all patterns point to an entity and a single pattern can point to multiple entities.
Entities have two purposes:
1. Helping to identify the thematic area with:
- Statistical information
- Intrinsic property
2. Candidate itself as a possible valid entity.
Adding context information
The context information is:
- The language in which an article is written
- The author who published the article
- The writer who wrote the article
We have already mentioned how language plays a marginal role especially during analysis, while other information can play a very significant role.
To understand better, let’s use the example of a satirical article. Out of context it might be impossible, even for a person, to understand that it is a satirical article. The only way to get a laugh instead of worrying is to know it is from a satirical newspaper.
Authors and writers as Markers
Authors and writers have statistical characteristics just like patterns and entities. However, some of these, are used as thematic area markers.
In the analysis, a marker reduces the possibilities of choosing between the active thematic areas.
The markers act as teachers who, in addition to solving subtle problems of ambiguity as in the example cited above, are fundamental in a system where the incoming information is not quantitatively homogeneous.
It is a logical consequence to foresee that a system based on statistics will tend to polarize on the quantitatively most represented thematic areas.
The purpose of the markers, along with the value properties of the entities and human supervision, are critical to avoiding this drift.
Assignment of thematic macro-area
The information collected is used to deduce the thematic area of an article. Going into the details of how this happens is too specific, however the “decision tree” should also be the result of a generated model.
Resolution of ambiguous entities
The final steps are:
- Verify the consistency of the entities with the thematic area
- Select the best entity in case of overlap
These choices are currently based on statistical information. In an earlier version, information about relationships between entities was also used.
This type of analysis was temporarily disabled because it slowed down data processing too much (it will be restored in the future).
The process is still far from absolute precision, if ever it were attainable.
In a test, the answers given by the algorithm are compared with those given by a sample of people. The answers coincided in 69% of cases.
By checking the results with what could be considered valid answers, the computer answered correctly in 83% of cases, people in 93% of cases.