Extracting and processing LISN data for Cartolabe (LSA projection, HDBSCAN clustering)

In this example we will:

  • extract entities (authors, articles, labs, words) from a collection of scientific articles

  • project those entities in 2 dimensions

  • cluster them

  • find their nearest neighbors.

Download data

We will first download the CSV file that contains all articles from HAL (https://hal.archives-ouvertes.fr/) published by authors from LISN (Laboratoire Interdisciplinaire des Sciences du Numérique) between 2000-2022.

from download import download  # noqa

# csv_url = "https://zenodo.org/record/7323538/files/lisn_2000_2022.csv"
csv_url = "https://zenodo.org/record/7323386/files/lisn_2000_2022.csv"

download(csv_url, "../datas/lisn_2000_2022.csv", kind='file',
         progressbar=True, replace=False)
Replace is False and data exists, so doing nothing. Use replace=True to re-download the data.

'../datas/lisn_2000_2022.csv'

Load data to dataframe

import pandas as pd  # noqa

df = pd.read_csv('../datas/lisn_2000_2022.csv', index_col=0)

df.head()
structId_i authFullName_s en_abstract_s en_keyword_s en_title_s structAcronym_s producedDateY_i producedDateM_i halId_s docid en_domainAllCodeLabel_fs
0 [2544, 92966, 411575, 441569] Frédéric Blanqui In the last twenty years, several approaches t... Higher-order rewriting,Termination,Confluence Termination and Confluence of Higher-Order Rew... LRI,UP11,CNRS,LISN 2000 7.0 inria-00105556 105556 Logic in Computer Science,Computer Science
1 [2544, 92966, 411575, 441569] Sébastien Tixeuil When a distributed system is subject to transi... Self-stabilization,Distributed Systems,Distrib... Efficient Self-stabilization LRI,UP11,CNRS,LISN 2000 1.0 tel-00124843 124843 Networking and Internet Architecture,Computer ...
2 [1167, 300340, 301492, 564132, 441569, 2544, 9... Michèle Sebag,Céline Rouveirol One of the obstacles to widely using first-ord... Bounded reasoning,First order logic,Inductive ... Resource-bounded relational reasoning: inducti... LMS,X,PSL,CNRS,LRI,UP11,CNRS,LISN 2000 NaN hal-00111312 2263842 Mechanics,Engineering Sciences,physics
3 [994, 15786, 301340, 303171, 441569, 34499, 81... Philippe Balbiani,Jean-François Condotta,Gérar... This paper organizes the topologic forms of th... Temporal reasoning,Constraint handling,Computa... Reasoning about generalized intervals : Horn r... LIPN,UP13,USPC,CNRS,IRIT,UT1,UT2J,UT3,CNRS,Tou... 2000 NaN hal-03300321 3300321 Artificial Intelligence,Computer Science
4 [1315, 25027, 59704, 564132, 300009, 441569, 4... Roberto Di Cosmo,Delia Kesner,Emmanuel Polonovski We refine the simulation technique introduced ... Linear logic,Proof nets,Lambda-calculus,Explic... Proof Nets and Explicit Substitutions LIENS,DI-ENS,ENS-PSL,PSL,Inria,CNRS,CNRS,LRI,U... 2000 NaN hal-00384955 384955 Logic in Computer Science,Computer Science


The dataframe that we just read consists of 4262 articles as rows.

print(df.shape[0])
4262

And their authors, abstract, keywords, title, research labs and domain as columns.

print(*df.columns, sep="\n")
structId_i
authFullName_s
en_abstract_s
en_keyword_s
en_title_s
structAcronym_s
producedDateY_i
producedDateM_i
halId_s
docid
en_domainAllCodeLabel_fs

Creating correspondance matrices for each entity type

From this table of articles, we want to extract matrices that will map the correspondance between these articles and the entities we want to use.

Authors

Let’s start with the authors for example. We want to create a matrix where the rows represent the articles and the columns represent the authors. Each cell (n, m) will have a 1 in it if the nth article was written by the mth author.

from cartodata.loading import load_comma_separated_column  # noqa

authors_mat, authors_scores = load_comma_separated_column(df, 'authFullName_s')

The load_comma_separated_column function takes in a dataframe and the name of a column and returns two objects:

  • a sparse matrix

  • a pandas Series

Each column of the sparce matrix authors_mat, corresponds to an author and each row corresponds to an article. We see that there are 7348 distict authors for 4262 articles.

authors_mat.shape
(4262, 7348)

The series, which we named authors_scores, contains the list of authors extracted from the column authFullName_s with a score that is equal to the number of rows (articles) that this value was mapped within the authors_mat matrix.

authors_scores.head()
Frédéric Blanqui       4
Sébastien Tixeuil     47
Michèle Sebag        137
Céline Rouveirol       2
Philippe Balbiani      2
dtype: int64

If we look at the 2nd column of the matrix, which corresponds to the author Sébastien Tixeuil, we can see that it has 47 non-zero rows, each row indicating which articles he authored.

print(authors_mat[:, 1])
(1, 0)        1
(7, 0)        1
(22, 0)       1
(60, 0)       1
(128, 0)      1
(136, 0)      1
(150, 0)      1
(179, 0)      1
(205, 0)      1
(212, 0)      1
(233, 0)      1
(238, 0)      1
(241, 0)      1
(246, 0)      1
(262, 0)      1
(282, 0)      1
(294, 0)      1
(356, 0)      1
(358, 0)      1
(359, 0)      1
(363, 0)      1
(371, 0)      1
(372, 0)      1
(409, 0)      1
(498, 0)      1
(501, 0)      1
(536, 0)      1
(541, 0)      1
(542, 0)      1
(878, 0)      1
(893, 0)      1
(1600, 0)     1
(1717, 0)     1
(2037, 0)     1
(2075, 0)     1
(2116, 0)     1
(2222, 0)     1
(2373, 0)     1
(2449, 0)     1
(2450, 0)     1
(2611, 0)     1
(2732, 0)     1
(2976, 0)     1
(2986, 0)     1
(3107, 0)     1
(3221, 0)     1
(3791, 0)     1

Labs

Similarly, we can create matrices for the labs by simply passing the structAcronym_s column to the function.

labs_mat, labs_scores = load_comma_separated_column(df,
                                                    'structAcronym_s',
                                                    filter_acronyms=True)
labs_scores.head()
LRI      4789
UP11     6271
CNRS    10217
LISN     5203
LMS         1
dtype: int64

Checking the number of columns of the sparse matrix labs_mat, we see that there are 1818 distict labs.

labs_mat.shape[1]
1818

Filtering low score entities

A lot of the authors and labs that we just extracted from the dataframe have a very low score, which means they’re only linked to one or two articles. To improve the quality of our data, we’ll filter the authors and labs by removing those that appear less than 4 times.

To do this, we’ll use the filter_min_score function.

from cartodata.operations import filter_min_score  # noqa

authors_before = len(authors_scores)
labs_before = len(labs_scores)

authors_mat, authors_scores = filter_min_score(authors_mat,
                                               authors_scores,
                                               4)
labs_mat, labs_scores = filter_min_score(labs_mat,
                                         labs_scores,
                                         4)

print(f"Removed {authors_before - len(authors_scores)} authors with less "
      f"than 4 articles from a total of {authors_before} authors.")
print(f"Working with {len(authors_scores)} authors.\n")

print(f"Removed {labs_before - len(labs_scores)} labs with less than "
      f"4 articles from a total of {labs_before}.")
print(f"Working with {len(labs_scores)} labs.")
Removed 6654 authors with less than 4 articles from a total of 7348 authors.
Working with 694 authors.

Removed 1255 labs with less than 4 articles from a total of 1818.
Working with 563 labs.

Words

For the words, it’s a bit trickier because we want to extract n-grams (groups of n terms) instead of just comma separated values. We’ll call the load_text_column which uses scikit-learn’s CountVectorizer to create a vocabulary and map the tokens.

from cartodata.loading import load_text_column  # noqa
from sklearn.feature_extraction import text as sktxt  # noqa

with open('../datas/stopwords.txt', 'r') as stop_file:
    stopwords = sktxt.ENGLISH_STOP_WORDS.union(
        set(stop_file.read().splitlines()))

df['text'] = df['en_abstract_s'] + '.' \
    + df['en_keyword_s'].astype(str) + '.' \
    + df['en_title_s'].astype(str) + '.' \
    + df['en_domainAllCodeLabel_fs'].astype(str)

words_mat, words_scores = load_text_column(df['text'],
                                           4,
                                           10,
                                           0.05,
                                           stopwords=stopwords)

Here words_scores contains a list of all the n-grams extracted from the documents with their score,

words_scores.head()
abilities     21
ability      164
absence       53
absolute      19
abstract     174
dtype: int64

and the words_mat matrix counts the occurences of each of the 4282 n-grams for all the articles.

words_mat.shape
(4262, 4645)

To get a better representation of the importance of each term, we’ll also apply a TF-IDF (term-frequency times inverse document-frequency) normalization on the matrix.

The normalize_tfidf simply calls scikit-learn’s TfidfTransformer class.

from cartodata.operations import normalize_tfidf  # noqa

words_mat = normalize_tfidf(words_mat)

""
words_mat.shape
(4262, 4645)

Articles

Finally, we need to create a matrix that simply maps each article to itself.

from cartodata.loading import load_identity_column  # noqa

articles_mat, articles_scores = load_identity_column(df, 'en_title_s')
articles_scores.head()
Termination and Confluence of Higher-Order Rewrite Systems                                    1.0
Efficient Self-stabilization                                                                  1.0
Resource-bounded relational reasoning: induction and deduction through stochastic matching    1.0
Reasoning about generalized intervals : Horn representability and tractability                1.0
Proof Nets and Explicit Substitutions                                                         1.0
dtype: float64

Dimension reduction

One way to see the matrices that we created is as coordinates in the space of all articles. What we want to do is to reduce the dimension of this space to make it easier to work with and see.

LSA projection

We’ll start by using the LSA (Latent Semantic Analysis) technique to identify keywords in our data and thus reduce the number of rows in our matrices. The lsa_projection method takes three arguments:

  • the number of dimensions you want to keep

  • the matrix of documents/words frequency

  • a list of matrices to project

It returns a list of the same length containing the matrices projected in the latent space.

We also apply an l2 normalization to each feature of the projected matrices.

from cartodata.projection import lsa_projection  # noqa
from cartodata.operations import normalize_l2  # noqa

lsa_matrices = lsa_projection(80,
                              words_mat,
                              [articles_mat, authors_mat, words_mat, labs_mat])
lsa_matrices = list(map(normalize_l2, lsa_matrices))

We’ve reduced the number of rows in each of articles_mat, authors_mat, words_mat and labs_mat to just 80.

print(f"articles_mat: {lsa_matrices[0].shape}")
print(f"authors_mat: {lsa_matrices[1].shape}")
print(f"words_mat: {lsa_matrices[2].shape}")
print(f"labs_mat: {lsa_matrices[3].shape}")
articles_mat: (80, 4262)
authors_mat: (80, 694)
words_mat: (80, 4645)
labs_mat: (80, 563)

This makes it easier to work with them for clustering or nearest neighbors tasks, but we also want to project them on a 2D space to be able to map them.

UMAP projection

The UMAP (Uniform Manifold Approximation and Projection) is a dimension reduction technique that can be used for visualisation similarly to t-SNE.

We use this algorithm to project our matrices in 2 dimensions.

from cartodata.projection import umap_projection  # noqa

umap_matrices = umap_projection(lsa_matrices)

Now that we have 2D coordinates for our points, we can try to plot them to get a feel of the data’s shape.

import matplotlib.pyplot as plt  # noqa
import numpy as np  # noqa
import seaborn as sns  # noqa
# %matplotlib inline

sns.set(style='white', rc={'figure.figsize': (12, 8)})

labels = ('article', "auth", "words", "labs")
colors = ['g', 'r', 'b', 'y']


def plot(matrices):
    plt.close('all')
    fig, ax = plt.subplots()

    axes = []

    for i, m in enumerate(matrices):
        axes.append(ax.scatter(m[0, :], m[1, :],
                               color=colors[i], s=10, alpha=0.25,
                               label=labels[i]))

    leg = ax.legend(tuple(axes), labels, fancybox=True, shadow=True)

    return fig, ax


fig, ax = plot(umap_matrices)
workflow lisn lsa hdbscan

On the plot above, articles are shown in green, authors in red, words in blue and labs in yellow. Because we don’t have labels for the points, it doesn’t make much sense as is. But we can see that the data shows some clusters which we could try to identify.

Clustering

In order to identify clusters, we use the HDBSCAN clustering technique on the articles. We’ll also try to label these clusters by selecting the most frequent words that appear in each cluster’s articles.

from cartodata.clustering import create_kmeans_clusters, create_hdbscan_clusters  # noqa

cluster_labels = []
c_lda, c_umap, c_scores, c_knn, _, _, _ = create_hdbscan_clusters(8,  # number of clusters to create
                                                        # 2D matrix of articles
                                                        umap_matrices[0],
                                                        # the 2D matrix of words
                                                        umap_matrices[2],
                                                        # the articles to words matrix
                                                        words_mat,
                                                        # word scores
                                                        words_scores,
                                                        # a list of initial cluster labels
                                                        cluster_labels,
                                                        # LDA space matrix of words
                                                        lsa_matrices[2])
print(f"Number of clusters: {len(c_scores)}")
c_scores

""
import matplotlib.patheffects as pe # noqa

fig, ax = plot(umap_matrices)

for i in range(len(c_scores)):
    ax.text(c_umap[0, i], c_umap[1, i], c_scores.index[i],
            size=10, color='black',
            path_effects=[pe.withStroke(linewidth=4, foreground="white")])
    # ax.annotate(c_scores.index[i], (c_umap[0, i], c_umap[1, i]),
    #            color='red')
workflow lisn lsa hdbscan
Nothing in cache, initial Fitting with min_cluster_size=15 Found 98 clusters in 0.2594478300015908s
Max Fitting with min_cluster_size=30 Found 58 clusters in 0.10882276900156285s
Max Fitting with min_cluster_size=60 Found 15 clusters in 0.1059613020006509s
Max Fitting with min_cluster_size=120 Found 9 clusters in 0.09796786900187726s
Max Fitting with min_cluster_size=240 Found 3 clusters in 0.08980225200139103s
Midpoint Fitting with min_cluster_size=180 Found 6 clusters in 0.09148516000277596s
Midpoint Fitting with min_cluster_size=150 Found 7 clusters in 0.0915639389968419s
Midpoint Fitting with min_cluster_size=135 Found 8 clusters in 0.09265108899853658s
No need Re-Fitting with min_cluster_size=135
Clusters cached: [3, 6, 7, 8, 9, 15, 58, 98]
Number of clusters: 8

The 8 clusters that we created give us a general idea of what the big clusters of data contain. But we’ll probably want a finer level of detail if we start to zoom in and focus on smaller areas. So we’ll also create a second bigger group of clusters. To do this, simply increase the number of clusters we want.

mc_lsa, mc_umap, mc_scores, mc_knn, _, _, _ = create_hdbscan_clusters(32,
                                                            umap_matrices[0],
                                                            umap_matrices[2],
                                                            words_mat,
                                                            words_scores,
                                                            cluster_labels,
                                                            lsa_matrices[2])
print(f"Number of clusters: {len(mc_scores)}")
mc_scores
Nothing in cache, initial Fitting with min_cluster_size=15 Found 98 clusters in 0.112819390000368s
Max Fitting with min_cluster_size=30 Found 58 clusters in 0.11198548099855543s
Max Fitting with min_cluster_size=60 Found 15 clusters in 0.10821385600138456s
Midpoint Fitting with min_cluster_size=45 Found 32 clusters in 0.10785280299751321s
No need Re-Fitting with min_cluster_size=45
Clusters cached: [15, 32, 58, 98]
Number of clusters: 32

agent                                                  45
benchmarking, evolution strategies                     83
verification, semantics                               285
monte carlo search, bandit                            154
dynamics, mechanics                                   138
boolean networks, formal languages automata theory     48
floating point arithmetic, floating                    47
compiler, linear systems                              150
vertices, bounds                                      178
fault tolerance, stabilizing                           88
operations, mixed integer linear programming           60
belief propagation, causal                             62
covariance matrix adaptation, multi objective          82
secondary structure, sequence                         122
combinatorics, automata                               131
biology, protein                                      148
population, gradient                                   97
reinforcement learning, scheduling                     65
resource, adaptive                                     55
internet architecture, service                        173
french, natural language                              102
visual analytics                                       46
image, decision making                                 91
query, documents                                      124
papers, community                                      50
display, spatial                                      236
social sciences, social networks                      118
ontology, mining                                      108
interfaces, toolkit                                    63
movement, creative                                     45
information visualization, smartwatch                  57
abstraction, matrices                                  63
dtype: int64

Nearest neighbors

One more thing which could be useful to appreciate the quality of our data would be to get each point’s nearest neighbors. If our data processing is done correctly, we expect the related articles, labs, words and authors to be located close to each other.

Finding nearest neighbors is a common task with various algorithms aiming to solve it. The get_neighbors method uses one of these algorithms to find the nearest points of each type. It takes an optional weight parameter to tweak the distance calculation to select points that have a higher score but are maybe a bit farther instead of just selecting the closest neighbors.

Because we want to find the neighbors of each type (articles, authors, words, labs) for all of the entities, we call the get_neighbors method in a loop and store its results in an array.

from cartodata.neighbors import get_neighbors  # noqa

scores = [articles_scores, authors_scores, words_scores, labs_scores]
weights = [0, 0.5, 0.5, 0]
all_neighbors = []

# make sure that ../dumps directory exists
dump_dir = "../dumps/lisn"
natures = ['articles',
           'authors',
           'words',
           'labs',
           'hl_clusters',
           'ml_clusters']

for idx in range(len(lsa_matrices)):
    all_neighbors.append(get_neighbors(lsa_matrices[idx],
                                       scores[idx],
                                       lsa_matrices,
                                       weights[idx],
                                       dump_dir=dump_dir,
                                       neighbors_nature=natures[idx],
                                       natures=natures[:4]))

Export file using exporter

First we should save the matrices.

from cartodata.operations import (
    dump_matrices, dump_scores, load_matrices_from_dumps, dump_objects
)  # noqa

matrices = [articles_mat, authors_mat, words_mat, labs_mat]
scores = [articles_scores, authors_scores, words_scores, labs_scores]

# add the clusters to list of 2d matrices and scores
umap_extended = list(umap_matrices)
umap_extended.extend([c_umap, mc_umap])
scores_extended = list(scores)
scores_extended.extend([c_scores, mc_scores])

dump_matrices(natures[:4], lsa_matrices, 'lsa', dump_dir)
dump_matrices(natures[:4], matrices, 'mat', dump_dir)

dump_matrices(natures, umap_extended, 'umap', dump_dir)
dump_scores(natures, scores_extended, dump_dir)

We can now export the data. We will first create the exporter.

from cartodata.exporting import Exporter  # noqa

exporter = Exporter(dump_dir, natures, natures[:4])

exporter.add_reference('articles', 'labs')
exporter.add_reference('articles', 'authors')
exporter.add_reference('authors', 'labs')

Export to json file

We can export the data to a json file.

export_json_file = 'lisn_workflow_lsa.json'

exporter.export_to_json(export_json_file)

This creates the lisn_workflow_lsa.json file which contains a list of points ready to be imported into Cartolabe. Have a look at it to check that it contains everything.

import json  # noqa

with open(dump_dir + "/" + export_json_file, 'r') as f:
    data = json.load(f)

data[1]['position']
[7.8399810791015625, 7.9352192878723145]

Export to feather file

We can export the data to a feather file as well.

export_feather_file = 'lisn_workflow_lsa.feather'

exporter.export_to_feather(export_feather_file)

df = pd.read_feather(dump_dir + "/" + export_feather_file)

df['position'][1]
array([7.83998108, 7.93521929])

Total running time of the script: (0 minutes 52.309 seconds)

Gallery generated by Sphinx-Gallery