48pm - Natural language processing to search for thematic areas and entities in newspaper articles

Introduction

48pm is a platform for collecting and processing articles from the main online newspapers in the world. Currently (April 2021) it has about 30 million archived articles and around 40,000 new articles are added every day.

The collected articles are processed and the task of the process is:

  1. Assign a general topic to an article, hereinafter called thematic area, among a series of limited possibilities, such as Sport, Technology, Politics and more.
  2. Identify the subjects covered in the article, hereinafter called entities.
    The name “entity” comes from the Wikidata entities from which they derive.

Articles are accessible from a 48pm Press review application and, for developers, through a set of APIs.

Purpose of the article

The article describes:

What is Natural language processing?

As described on Wikipedia:

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

Also on Wikipedia we find this statement:

Natural language understanding is often considered an AI-complete problem, as language recognition is thought to require extensive knowledge of the world and a great ability to manipulate it. For this reason, the definition of ‘understanding’ is one of the major problems of natural language processing.

The toolbox

Due to this complexity and an evident need for “an extended knowledge of the world” the process uses an extended set of information.

Toolbox consists of:

As said, the process does not aim at “awareness” about the content of an article, more “simply” it wants to assign a thematic area and extrapolate entities using:

The use of property-value information associated with entities allows a conceptual model, not just a statistical. The value singer in the profession property is significant without statistical data. However, the choice of which value-properties are significant (and what they mean) should be the result of machine learning algorithms.

Learning

Through data mining techniques which is useful for example to identify significant patterns, the process learns from the acquired knowledge and continuously updates its predictive model.

analysis > result > learning

Therefore, the result of one analysis changes the models used in the subsequent analysis. This sequence has a problem: if the result is incorrect, the next model will be less accurate, and if an error is repeated without correction, here is a “false knowledge”.

There are several strategies to overcome or at least contain this problem, including:

Selection of significant patterns

In the learning phase, the best candidates for becoming “significant patterns” are the most common ones within a homogeneous group of data already analyzed. In addition, several strategies are applied during the selection, for example:

Avoid language containers

Linguistic diversity is an enormous human expressive heritage, however it is also an undeniable communication barrier. In the first experiments of text analysis, each language was an isolated container but this approach was abandoned, believing that different expressions or more simply the same word in different languages should be considered as a synonym within the same vocabulary.
This does not mean that knowing the language of an article is irrelevant but the language is used as a property or, better, as a meta information. This approach has a huge advantage, the knowledge on patterns becomes universal, on the other hand it increases the possibility of collisions.

Entity selection

As mentioned, the entities are superimposable to the entities of Wikidata. The purpose of Wikidata is to describe objects, people, things and also abstract or logical entities through a property-value list. This type of data representation is very useful for computational processing but not all entities defined on Wikidata are significant for the purpose. In the learning phase, the significant entities derive from the significant patterns.

Although Wikidata provides an excellent query system (SPARQL), the information on the entities is saved in a new data structure optimized for the purpose.

Human supervision

Supervision makes it possible to report errors, for example wrong thematic area or entities.
To make supervision meaningful, in a context in which hundreds or thousands of automatic analyzes correspond to a single correction, two strategies have been adopted.

Data processing

Let’s see the steps used to get the desired result.

Creation of patterns

The first processing is the transformation of the text into patterns.
The operation is simple, a pattern is a variable-length sequence of words.

Let’s take the phrase “A B C”, the resulting patterns are:
“A”, “A B”, “A B C”, “B”, “B C” and “C”.

A pattern serves two purposes:

1. Helping to identify the thematic area using statistical information. The contribution of a pattern depend on:

2. Provide a list of possible entities.

Selection of significant patterns

Not all the patterns obtained are significant. In the analysis, the significant patterns are those selected in the learning phase.

Search for entities associated with patterns

After having identified the significant patterns the associated entities are retrieved. Not all patterns point to an entity and a single pattern can point to multiple entities.

Entities have two purposes:

1. Helping to identify the thematic area with:

2. Candidate itself as a possible valid entity.

Adding context information

The context information is:

We have already mentioned how language plays a marginal role especially during analysis, while other information can play a very significant role.

To understand better, let’s use the example of a satirical article. Out of context it might be impossible, even for a person, to understand that it is a satirical article. The only way to get a laugh instead of worrying is to know it is from a satirical newspaper.

Authors and writers as Markers
Authors and writers have statistical characteristics just like patterns and entities. However, some of these, are used as thematic area markers.

In the analysis, a marker reduces the possibilities of choosing between the active thematic areas.

The markers act as teachers who, in addition to solving subtle problems of ambiguity as in the example cited above, are fundamental in a system where the incoming information is not quantitatively homogeneous.

It is a logical consequence to foresee that a system based on statistics will tend to polarize on the quantitatively most represented thematic areas.

The purpose of the markers, along with the value properties of the entities and human supervision, are critical to avoiding this drift.

Assignment of thematic macro-area

The information collected is used to deduce the thematic area of an article. Going into the details of how this happens is too specific, however the “decision tree” should also be the result of a generated model.

Resolution of ambiguous entities

The final steps are:

These choices are currently based on statistical information. In an earlier version, information about relationships between entities was also used.
This type of analysis was temporarily disabled because it slowed down data processing too much (it will be restored in the future).

Results

The process is still far from absolute precision, if ever it were attainable.

In a test, the answers given by the algorithm are compared with those given by a sample of people. The answers coincided in 69% of cases.

By checking the results with what could be considered valid answers, the computer answered correctly in 83% of cases, people in 93% of cases.

48pm algorithm efficiency in recognizing thematic areas

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store