Extracting and processing LISN data for Cartolabe with Aligned UMAP

( Doc2vec projection)


In this example we will:

  • extract entities (authors, articles, labs, words) from a collection of scientific articles

  • project those entities in 2 dimensions using aligned UMAP

  • cluster them

  • find their nearest neighbors.

This example uses 6 datasets containing articles from HAL (https://hal.archives-ouvertes.fr/) published by authors from LISN (Laboratoire Interdisciplinaire des Sciences du Numérique). Each dataset contains cumulative data starting from 2010-2012 to 2010-2022 (inclusive).

Using aligned UMAP will help us align UMAP embeddings for consecutive datasets.

# %matplotlib widget

Download data

We will start by downloading 6 datasets from HAL.

from cartodata.scraping import scrape_hal, process_domain_column   # noqa
import os   # noqa


def fetch_data(from_year, to_year, struct_ids, struct='hal'):
    """
    Fetch scientific publications of struct from HAL.
    """

    filename = f"../datas/{struct.lower()}_{from_year}_{to_year - 1}.csv"

    if os.path.exists(filename):
        return

    filters = {}

    if struct:
        filters['structId_i'] = struct_ids
    years = range(from_year, to_year)
    df = scrape_hal(struct, filters, years, cool_down=2)
    process_domain_column(df)

    # Save the dataframe into a csv file
    df.to_csv(filename)
    return df


# Fetch LRI data from year 2010 to 2022 inclusive
for i in range(2, 14, 2):
    fetch_data(2010, 2011+i, "(1050003 2544)", 'lisn')
/builds/2mk6rsew/0/hgozukan/cartolabe-data/cartodata/scraping.py:116: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, year_df])
/builds/2mk6rsew/0/hgozukan/cartolabe-data/cartodata/scraping.py:116: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, year_df])
/builds/2mk6rsew/0/hgozukan/cartolabe-data/cartodata/scraping.py:116: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, year_df])
/builds/2mk6rsew/0/hgozukan/cartolabe-data/cartodata/scraping.py:116: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, year_df])
/builds/2mk6rsew/0/hgozukan/cartolabe-data/cartodata/scraping.py:116: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, year_df])
/builds/2mk6rsew/0/hgozukan/cartolabe-data/cartodata/scraping.py:116: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, year_df])

Load data

Now we will load each downloaded dataset to a dataframe.

import pandas as pd   # noqa


def load_data(from_year, total, struct):
    dataframes = []

    for i in range(total):
        file_name = f"../datas/{struct.lower()}_{from_year}_{from_year + ( 2 + 2 * i)}.csv"
        print(file_name)
        df = pd.read_csv(file_name, index_col=0)

        dataframes.append(df)

    return dataframes


dataframes = load_data(2010, 6, "lisn")
dataframes[0].head()
../datas/lisn_2010_2012.csv
../datas/lisn_2010_2014.csv
../datas/lisn_2010_2016.csv
../datas/lisn_2010_2018.csv
../datas/lisn_2010_2020.csv
../datas/lisn_2010_2022.csv
structId_i authFullName_s en_abstract_s en_keyword_s en_title_s structAcronym_s producedDateY_i producedDateM_i halId_s docid en_domainAllCodeLabel_fs
0 [2544, 92966, 411575, 441569, 56057, 2544, 929... Olivier Teytaud This document is devoted to artificial intelli... Evolutionary optimization,Parallelism Artificial Intelligence and Optimization with ... LRI,UP11,CNRS,TAO,LRI,UP11,CNRS,Inria,LISN 2010 11.0 tel-01078099 1078099 Optimization and Control,Mathematics
1 [2544, 92966, 411575, 441569, 56057, 2544, 929... J.-B Hoock,O Teytaud When looking for relevant mutations of a learn... Mots-clés Reinforcement learning,Monte-Carlo ... Bandit-Based Genetic Programming with Applicat... LRI,UP11,CNRS,TAO,LRI,UP11,CNRS,Inria,LISN 2010 5.0 hal-01098456 1098456 Computer Science and Game Theory,Computer Science
2 [16574, 179741, 300351, 56050, 2544, 92966, 41... Pierre Cubaud,Alexandre Topol,Emmanuel Pietriga This document provides feedback about the curr... NaN ALMA Graphical User Interfaces CEDRIC,ENSIIE,CNAM,IN-SITU,LRI,UP11,CNRS,Inria... 2010 3.0 hal-01126094 1126094 Computer Interaction,Computer Science
3 [2544, 92966, 411575, 441569, 3210, 301243, 30... Dominique Gouyou-Beauchamps,Cyril Nicaud Generalizing an idea used by Alonso to generat... Random generation,Binomial distribution Random Generation Using Binomial Approximations LRI,UP11,CNRS,LIGM,UPEM,ENPC,BEZOUT,CNRS,CNRS,... 2010 NaN hal-01185570 1185570 Computational Geometry,Computer Science
4 [391379, 233, 93591, 441569, 103784, 2071, 300... Olivier Bodini,Yann Ponty We address the uniform random generation of wo... Boltzmann sampling,Context-free languages,Rand... Multi-dimensional Boltzmann Sampling of Languages APR,LIP6,UPMC,CNRS,AMIB,LIX,X,IP Paris,CNRS,LR... 2010 6.0 hal-00450763 1185592 Computational Geometry,Computer Science


We can list the total number of articles starting from 2010 to a certain year.

for i in range(6):
    print(f"{2010 + (i*2) + 2} => {dataframes[i].shape[0]}")
2012 => 765
2014 => 1361
2016 => 1983
2018 => 2514
2020 => 2945
2022 => 2987

Creating correspondance matrices for each entity type

From this table of articles, we want to extract matrices that will map the correspondance between these articles and the entities we want to use.

from cartodata.loading import load_comma_separated_column  # noqa
from cartodata.loading import load_identity_column  # noqa


def create_correspondance_matrices(func, dataframes, column_name):
    matrices = []
    scores = []

    for i in range(len(dataframes)):
        mat, score = func(dataframes[i], column_name)
        matrices.append(mat)
        scores.append(score)

    return matrices, scores

The load_comma_separated_column function takes in a dataframe and the name of a column and returns two objects:

a sparse matrix a pandas Series

Each column of the sparse matrix, corresponds to the entity specified by column_name and each row corresponds to an article.

Authors

Let’s start with the authors for example. We want to create a matrix where the rows represent the articles and the columns represent the authors. Each cell (n, m) will have a 1 in it if the nth article was written by the mth author.

As we have multiple dataframes, the results will be arrays corresponding to specified dataframes.

matrices_auth, scores_auth = create_correspondance_matrices(
    load_comma_separated_column, dataframes, 'authFullName_s')

We can see the number of (articles, authors) for each year:

for i in range(len(matrices_auth)):
    print(f"{2010 + (i*2) + 2} => {matrices_auth[i].shape}")
2012 => (765, 1666)
2014 => (1361, 2338)
2016 => (1983, 3135)
2018 => (2514, 3882)
2020 => (2945, 4554)
2022 => (2987, 4652)

The score in scores_auth is a series that contain the list of authors extracted from the column authFullName_s with a score that is equal to the number of rows (articles) that this value was mapped within the corresponding matrix in matrices_auth.

Let’s have a look at the number of articles per author for each year:

df_scores_auth = pd.concat(scores_auth, axis=1).reset_index().rename(
    columns={i: f"{2010 + (i*2) + 2}" for i in range(6)}
)

df_scores_auth.head(20)

""
df_scores_auth.describe()
2012 2014 2016 2018 2020 2022
count 1666.000000 2338.000000 3135.000000 3882.000000 4554.000000 4652.00000
mean 2.206483 2.465783 2.590112 2.685987 2.705314 2.69239
std 3.080791 4.044985 4.862247 5.312939 5.627630 5.61770
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
25% 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
50% 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000
75% 2.000000 3.000000 2.000000 2.000000 2.000000 2.00000
max 41.000000 60.000000 77.000000 94.000000 113.000000 113.00000


Labs

Similarly, we can create matrices for the labs by simply passing the structAcronym_s column to the function.

matrices_labs, scores_labs = create_correspondance_matrices(
    load_comma_separated_column, dataframes, 'structAcronym_s')

We can see the number of (articles, labs) for each year:

for i in range(len(matrices_labs)):
    print(f"{2010 + (i*2) + 2} => {matrices_labs[i].shape}")
2012 => (765, 602)
2014 => (1361, 838)
2016 => (1983, 1041)
2018 => (2514, 1254)
2020 => (2945, 1426)
2022 => (2987, 1447)

Let’s have a look at the number of articles per lab for each year:

df_scores_labs = pd.concat(scores_labs, axis=1).reset_index().rename(
    columns={i: f"{2010 + (i*2) + 2}" for i in range(10)}
)

df_scores_labs.head(20)

""
df_scores_labs.describe()
2012 2014 2016 2018 2020 2022
count 602.000000 838.00000 1041.000000 1254.000000 1426.000000 1447.000000
mean 15.131229 19.21599 22.337176 23.566986 24.131837 24.230822
std 106.106441 159.00098 201.821286 230.275677 248.591470 250.370755
min 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000
25% 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000
50% 2.000000 2.00000 2.000000 2.000000 2.000000 2.000000
75% 4.000000 5.00000 6.000000 6.000000 6.000000 7.000000
max 1712.000000 3001.00000 4218.000000 5300.000000 6183.000000 6297.000000


Filtering low score entities

A lot of the authors and labs that we just extracted from the dataframe have a very low score, which means they’re only linked to one or two articles. To improve the quality of our data, we’ll filter the entities by removing those that appear less than certain number of times.

from cartodata.operations import filter_min_score  # noqa


def filter_low_score(matrices, scores, count, entity):

    filtered_matrices = []
    filtered_scores = []

    for i, (matrix, score) in enumerate(zip(matrices, scores)):
        before = len(score)
        filtered_mat, filtered_score = filter_min_score(matrix, score, count)

        filtered_matrices.append(filtered_mat)
        filtered_scores.append(filtered_score)

        print(f'Removed {before - len(filtered_score)} {entity}s with less than ' +
              f'{count} articles from a total of {before} {entity}s.')

        print(f'Working with {len(filtered_score)} {entity} for year {2010 + (i*2) + 2}.\n')

    return filtered_matrices, filtered_scores

Authors

We will remove the authors with less than 5 publications.

auth_pub_count = 5

filtered_auth_matrices, filtered_auth_scores = filter_low_score(
    matrices_auth, scores_auth, auth_pub_count, "author")
Removed 1578 authors with less than 5 articles from a total of 1666 authors.
Working with 88 author for year 2012.

Removed 2179 authors with less than 5 articles from a total of 2338 authors.
Working with 159 author for year 2014.

Removed 2887 authors with less than 5 articles from a total of 3135 authors.
Working with 248 author for year 2016.

Removed 3566 authors with less than 5 articles from a total of 3882 authors.
Working with 316 author for year 2018.

Removed 4168 authors with less than 5 articles from a total of 4554 authors.
Working with 386 author for year 2020.

Removed 4258 authors with less than 5 articles from a total of 4652 authors.
Working with 394 author for year 2022.

Labs

We will remove the labs with less than 20 publications.

lab_pub_count = 20

filtered_lab_matrices, filtered_lab_scores = filter_low_score(
    matrices_labs, scores_labs, lab_pub_count, "labs")
Removed 566 labss with less than 20 articles from a total of 602 labss.
Working with 36 labs for year 2012.

Removed 762 labss with less than 20 articles from a total of 838 labss.
Working with 76 labs for year 2014.

Removed 938 labss with less than 20 articles from a total of 1041 labss.
Working with 103 labs for year 2016.

Removed 1131 labss with less than 20 articles from a total of 1254 labss.
Working with 123 labs for year 2018.

Removed 1286 labss with less than 20 articles from a total of 1426 labss.
Working with 140 labs for year 2020.

Removed 1305 labss with less than 20 articles from a total of 1447 labss.
Working with 142 labs for year 2022.

Words

For the words, it’s a bit trickier because we want to extract n-grams (groups of n terms) instead of just comma separated values. We’ll call the load_text_column which uses scikit-learn’s CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_ to create a vocabulary and map the tokens.

from cartodata.loading import load_text_column  # noqa
from sklearn.feature_extraction import text as sktxt  # noqa
from cartodata.operations import normalize_tfidf  # noqa


def process_words(dataframes, stopwords):

    words_matrices = []
    words_scores = []

    for df in dataframes:
        df['text'] = df['en_abstract_s'] + ' ' \
            + df['en_keyword_s'].astype(str) + ' ' \
            + df['en_title_s'].astype(str) + ' ' \
            + df['en_domainAllCodeLabel_fs'].astype(str)

        words_mat, words_score = load_text_column(df['text'],
                                                  4,
                                                  10,
                                                  0.05,
                                                  stopwords=stopwords)
        # apply term-frequency times inverse document-frequency normalization
        words_mat = normalize_tfidf(words_mat)

        words_matrices.append(words_mat)
        words_scores.append(words_score)

    return words_matrices, words_scores


with open('../datas/stopwords.txt', 'r') as stop_file:
    stopwords = sktxt.ENGLISH_STOP_WORDS.union(
        set(stop_file.read().splitlines()))

    words_matrices, words_scores = process_words(dataframes, stopwords)

We can list number of terms per article for each year:

for i in range(len(words_matrices)):
    print(f"{2010 + (i*2) + 2} => {words_matrices[i].shape}")
2012 => (765, 1015)
2014 => (1361, 1801)
2016 => (1983, 2483)
2018 => (2514, 3045)
2020 => (2945, 3512)
2022 => (2987, 3560)

Articles

Finally, we need to create a matrix that simply maps each article to itself.

matrices_article, scores_article = create_correspondance_matrices(
    load_identity_column, dataframes, 'en_title_s')

""
for i in range(len(matrices_article)):
    print(f"{2010 + (i*2) + 2} => {matrices_article[i].shape}")
2012 => (765, 765)
2014 => (1361, 1361)
2016 => (1983, 1983)
2018 => (2514, 2514)
2020 => (2945, 2945)
2022 => (2987, 2987)

Dimension reduction

One way to see the matrices that we created is as coordinates in the space of all articles. What we want to do is to reduce the dimension of this space to make it easier to work with and see.

Doc2vec projection

This example uses the Doc2vec technique to identify keywords in our data and thus reduce the number of rows in our matrices. The doc2vec_projection method takes at least four arguments:

  • the number of dimensions you want to keep

  • the matrix identifier of documents/words frequency

  • a list of matrices to project

  • the list of text to index (it is possible to add doc2vec parameters with

the same syntax of the gensim doc2vec function).

It returns a list of matrices projected in the latent space.

We also apply an l2 normalization to each feature of the projected matrices.

from cartodata.projection import doc2vec_projection  # noqa
from cartodata.operations import normalize_l2  # noqa

dimensions = [20, 20, 30, 30, 40, 40]

def run_doc2vec_projection(articles_matrices, auth_matrices, words_matrices,
                       lab_matrices, dimensions, dataframes, column_name):
    doc2vec_matrices = []

    for articles_mat, auth_mat, words_mat, labs_mat, dim, df in zip(
        articles_matrices, auth_matrices, words_matrices, lab_matrices, dimensions, dataframes):

        doc2vec_mat = doc2vec_projection(dim,
                                         3,
                                         [articles_mat, auth_mat, words_mat, labs_mat],
                                         df[column_name])

        doc2vec_mat = list(map(normalize_l2, doc2vec_mat))

        doc2vec_matrices.append(doc2vec_mat)

    return doc2vec_matrices

doc2vec_matrices = run_doc2vec_projection(matrices_article, filtered_auth_matrices,
                                  words_matrices, filtered_lab_matrices,
                                  dimensions, dataframes, 'text')

List (number_of_entity, dimensions) per year:

for j, type_ in enumerate(["ARTICLES", "AUTHORS", "WORDS", "LABS"]):
    print(type_)
    for i in range(len(doc2vec_matrices)):
        print(f"201{i} => {doc2vec_matrices[i][j].shape}")
ARTICLES
2010 => (20, 765)
2011 => (20, 1361)
2012 => (30, 1983)
2013 => (30, 2514)
2014 => (40, 2945)
2015 => (40, 2987)
AUTHORS
2010 => (20, 88)
2011 => (20, 159)
2012 => (30, 248)
2013 => (30, 316)
2014 => (40, 386)
2015 => (40, 394)
WORDS
2010 => (20, 1015)
2011 => (20, 1801)
2012 => (30, 2483)
2013 => (30, 3045)
2014 => (40, 3512)
2015 => (40, 3560)
LABS
2010 => (20, 36)
2011 => (20, 76)
2012 => (30, 103)
2013 => (30, 123)
2014 => (40, 140)
2015 => (40, 142)

Aligned UMAP projection

To use aligned UMAP, we need to create relations between each consecutive dataset that maps each entity index (article, author, word, lab) in one dataset to corresponding entity index in the following dataset.

We will also create color mappings to view each entity in consecutive maps with the same color.

import numpy as np  # noqa


def make_relation(from_df, to_df):
    # create a new dataframe with index from from_df, and values as integers
    # starting from 0 to the length of from_df
    left = pd.DataFrame(data=np.arange(len(from_df)), index=from_df.index)

    # create a new dataframe with index from to_df, and values as integers
    # starting from 0 to the length of to_df
    right = pd.DataFrame(data=np.arange(len(to_df)), index=to_df.index)

    # merge left and right dataframes on the intersection of keys of both
    # dataframes preserving the order of left keys
    merge = pd.merge(left, right, how="inner",
                     left_index=True, right_index=True)

    return dict(merge.values)


def generate_relations(filtered_scores):

    relations = []

    for i in range(len(filtered_scores) - 1):
        relation = make_relation(filtered_scores[i], filtered_scores[i+1])

        relations.append(relation)

    return relations


def generate_colors(filtered_scores):

    colors = []

    for i in range(len(filtered_scores)):
        color = make_relation(filtered_scores[i], filtered_scores[-1])

        colors.append(color)

    return colors


def concat_scores(scores_article, scores_auth, scores_word, scores_lab):
    """Concatenates article, auth, words and labs score for each year and
       returns the concatenated frames in a list.
    """
    concatenated_scores = []

    for score_article, score_auth, score_word, score_lab in zip(
        scores_article, scores_auth, scores_word, scores_lab
    ):
        concatenated_scores.append(
            pd.concat([score_article, score_auth, score_word, score_lab]))

    return concatenated_scores


filtered_scores = concat_scores(
    scores_article, filtered_auth_scores, words_scores, filtered_lab_scores)


relations = generate_relations(filtered_scores)

colors_auth = generate_colors(filtered_auth_scores)
colors_words = generate_colors(words_scores)
colors_lab = generate_colors(filtered_lab_scores)
colors_articles = generate_colors(scores_article)

We should get transpose of each doc2vec matrix to be able to process by UMAP.

def get_transpose(doc2vec_matrices):
    transposed_mats = []

    for doc2vec in doc2vec_matrices:
        transposed_mats.append(np.hstack(doc2vec).T)

    return transposed_mats


transposed_mats = get_transpose(doc2vec_matrices)

We create an AlignedUMAP instance to generate 10 maps for the 10 datasets.

from umap.aligned_umap import AlignedUMAP  # noqa

n_neighbors = [10, 10, 20, 20, 50, 80]
min_dists = [0.05, 0.01, 0.01, 0.01, 0.001, 0.001]
spreads = [1, 1, 1, 1, 1, 1]

n = 6

reducer = AlignedUMAP(
    n_neighbors=n_neighbors[:n],
    min_dist=min_dists[:n],
    #    n_neighbors=10,
    #    min_dist=0.05,
    init='random',
    n_epochs=300)

Aligned UMAP requires at least two datasets and the mapping between the entities of the datasets. We can change the number of initial datasets changing the value of the variable n.

reducer.fit_transform(transposed_mats[:n], relations=relations[:n-1])
/usr/local/lib/python3.9/site-packages/umap/layouts.py:1044: RuntimeWarning: overflow encountered in cast
  epochs_per_sample[m].astype(np.float32) / negative_sample_rate
/usr/local/lib/python3.9/site-packages/umap/layouts.py:1049: RuntimeWarning: overflow encountered in cast
  epoch_of_next_sample.append(epochs_per_sample[m].astype(np.float32))

ListType[array(float32, 2d, C)]([[[ 1.636822   4.3104906]
 [-2.5236168 -2.0890348]
 [ 4.760189  -5.08321  ]
 ...
 [ 6.92317   -1.5047461]
 [ 6.876193  -1.5049655]
 [ 5.191547  -0.0401044]], [[-1.5608968 -2.107365 ]
 [-2.3126545 -2.7574205]
 [ 4.386353   2.9121335]
 ...
 [ 4.935302  -1.7496797]
 [ 3.0446644  0.941432 ]
 [ 3.0300503  1.4598998]], [[-1.9637237  -4.284528  ]
 [-1.5079978  -4.3302937 ]
 [ 6.923994    1.4484942 ]
 ...
 [ 0.76254576 -4.921924  ]
 [ 1.286736   -0.81693447]
 [ 1.5955446  -0.95411617]], [[-1.6925725  -4.697059  ]
 [-0.98491496 -3.441007  ]
 [ 6.1821823   0.7781479 ]
 ...
 [ 3.5565672  -2.5444643 ]
 [ 7.748843    0.28574288]
 [ 3.465931   -0.65967333]], [[-1.3731371  -4.6526    ]
 [-1.0955343  -4.0270705 ]
 [ 3.599194   -1.14507   ]
 ...
 [ 2.9202418  -0.47803032]
 [ 5.5627213  -1.3473547 ]
 [ 2.0311272  -1.689975  ]], [[-0.7290484  -4.4938498 ]
 [-0.9665915  -4.021041  ]
 [ 3.1820416  -0.44369522]
 ...
 [ 5.6325536  -1.2587452 ]
 [ 2.156493   -1.4376152 ]
 [ 5.267645   -0.9021303 ]]])

Then we will generate maps for the remaining dataset one by one by feeding the reducer with the corresponding matrix and relation dictionary.

def update_reducer(reducer, matrices, relations,
                   n_neighbors, min_dists, spreads):

    for mat, rel, n_neighbor, min_dist, spread in zip(
            matrices, relations, n_neighbors, min_dists, spreads
    ):
        # reducer.update(mat, relations=rel)
        reducer.update(mat, relations=rel, n_neighbors=n_neighbor,
                       min_dist=min_dist, spread=spread, verbose=True)


update_reducer(reducer, transposed_mats[n:], relations[n-1:],
               n_neighbors[n:], min_dists[n:], spreads[n:])

""
for embedding in reducer.embeddings_:
    print(embedding.shape)
(1904, 2)
(3397, 2)
(4817, 2)
(5998, 2)
(6983, 2)
(7083, 2)

Plot maps

To plot the maps, we will reorganize the data.

We will add author data to embeddings, so that we can visualize author name when the plot is hovered.

def update_embeddings_with_entities(embeddings, filtered_scores):
    r_embeddings = []

    for embedding, score in zip(embeddings, filtered_scores):
        score_indices = np.matrix(score.index).T
        embd = np.hstack((embedding, score_indices)).T
        r_embeddings.append(np.asarray(embd))

    return r_embeddings


def decompose_embeddings(embeddings):
    article_embeddings = []
    auth_embeddings = []
    word_embeddings = []
    lab_embeddings = []

    for i, embedding in enumerate(embeddings):
        len_article = len(scores_article[i])
        len_auth = len(filtered_auth_scores[i])
        len_word = len(words_scores[i])

        article_embeddings.append(embeddings[i][:len_article])
        auth_embeddings.append(
            embeddings[i][len_article: len_article + len_auth])
        word_embeddings.append(
            embeddings[i][len_article + len_auth:len_article + len_auth + len_word])
        lab_embeddings.append(
            embeddings[i][len_article + len_auth + len_word:])

    return article_embeddings, auth_embeddings, word_embeddings, lab_embeddings


article_embeddings, auth_embeddings, word_embeddings, lab_embeddings = decompose_embeddings(
    reducer.embeddings_)

We will add values to embeddings, so that we can visualize article title, author name, word or lab name when the entity is hovered on the plot.

article_embeddings_names = update_embeddings_with_entities(
    article_embeddings, scores_article
)

auth_embeddings_names = update_embeddings_with_entities(
    auth_embeddings, filtered_auth_scores
)

word_embeddings_names = update_embeddings_with_entities(
    word_embeddings, words_scores
)

lab_embeddings_names = update_embeddings_with_entities(
    lab_embeddings, filtered_lab_scores
)

We will create a colormap that will map each entity (article, author, word, lab) to same color in every plot.

from matplotlib import cm   # noqa


def create_color_map(max_list):
    cmap = cm.get_cmap('gist_ncar', len(max_list))
    return cmap


cmap_article = create_color_map(scores_article[-1])
cmap_auth = create_color_map(filtered_auth_scores[-1])
cmap_word = create_color_map(words_scores[-1])
cmap_lab = create_color_map(filtered_lab_scores[-1])
/builds/2mk6rsew/0/hgozukan/cartolabe-data/examples/workflow_aligned_lisn_doc2vec_kmeans.py:577: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
  cmap = cm.get_cmap('gist_ncar', len(max_list))

Now we can plot all the maps.

import matplotlib.pyplot as plt  # noqa
import numpy as np   # noqa
import mplcursors    # noqa


plt.close('all')

def axis_bounds(embedding):
    left, right = embedding.T[0].min(), embedding.T[0].max()
    bottom, top = embedding.T[1].min(), embedding.T[1].max()
    adj_h, adj_v = (right - left) * 0.1, (top - bottom) * 0.1

    return [left - adj_h, right + adj_h, bottom - adj_v, top + adj_v]


ax_bound = axis_bounds(np.vstack(reducer.embeddings_))
ax_bound



def plot(title, embedding_article, embedding_auth, embedding_word, embedding_lab,
         colors_article, colors_auth, colors_word, colors_lab,
         c_scores=None, c_umap=None, n_clusters=None):

    colors_article_list = list(colors_article.values())
    colors_auth_list = list(colors_auth.values())
    colors_word_list = list(colors_word.values())
    colors_lab_list = list(colors_lab.values())

    fig, ax = plt.subplots(1, 1, figsize=(10, 9))

    ax.set_title(title)
    ax.axis(ax_bound)

    article = ax.scatter(embedding_article[0], embedding_article[1], c=colors_article_list, cmap=cmap_article,
                         vmin=0, vmax=len(scores_article[-1]), marker='x', label="article")

    auth = ax.scatter(embedding_auth[0], embedding_auth[1], c=colors_auth_list, cmap=cmap_auth,
                      vmin=0, vmax=len(filtered_auth_scores[-1]), label="auth")

    word = ax.scatter(embedding_word[0], embedding_word[1],
                      vmin=0, vmax=len(words_scores[-1]), marker='+', label="word")

    lab = ax.scatter(embedding_lab[0], embedding_lab[1], c=colors_lab_list, cmap=cmap_lab,
                     vmin=0, vmax=len(filtered_lab_scores[-1]), marker='s', label="lab")

    # from https://matplotlib.org/stable/gallery/event_handling/legend_picking.html
    leg = ax.legend((article, auth, word, lab), ("article",
                                                 "auth", "words", "labs"), fancybox=True, shadow=True)

    if n_clusters is not None:
        for i in range(n_clusters):
            ax.annotate(c_scores.index[i], (c_umap[0, i], c_umap[1, i]))

#    crs_article = mplcursors.cursor(article, hover=True)

#    @crs_article.connect("add")
#    def on_add(sel):
#        sel.annotation.set(text=embedding_article[2][sel.target.index])

    crs_auth = mplcursors.cursor(auth, hover=True)

    @crs_auth.connect("add")
    def on_add(sel):
        sel.annotation.set(text=embedding_auth[2][sel.target.index])

#    crs_word = mplcursors.cursor(word, hover=True)

#    @crs_word.connect("add")
#    def on_add(sel):
#        sel.annotation.set(text=embedding_word[2][sel.target.index])

    crs_lab = mplcursors.cursor(lab, hover=True)

    @crs_lab.connect("add")
    def on_add(sel):
        sel.annotation.set(text=embedding_lab[2][sel.target.index])

    plt.show()


for i, (embedding_article, embedding_auth, embedding_word, embedding_lab,
        color_article, color_auth, color_word, color_lab) in enumerate(zip(
    article_embeddings_names, auth_embeddings_names, word_embeddings_names, lab_embeddings_names,
    colors_articles, colors_auth, colors_words, colors_lab)):

    plot(f"Year {2010 + (i * 2) + 2}", embedding_article, embedding_auth, embedding_word, embedding_lab,
         color_article, color_auth, color_word, color_lab)
  • Year 2012
  • Year 2014
  • Year 2016
  • Year 2018
  • Year 2020
  • Year 2022
/usr/local/lib/python3.9/site-packages/mplcursors/_pick_info.py:55: UserWarning: No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
  paths = scatter.__wrapped__(*args, **kwargs)

Clustering

In order to identify clusters, we use the KMeans clustering technique on the articles. We’ll also try to label these clusters by selecting the most frequent words that appear in each cluster’s articles.

from cartodata.clustering import create_kmeans_clusters  # noqa
n_clusters = 8


def find_clusters(n_clusters, article_embeddings, word_embeddings,
                  words_matrices, words_scores, matrices):

    c_scores_all = []
    c_umap_all = []

    for i in range(6):
        cluster_labels = []
        c_lda, c_umap, c_scores, c_knn, _, _, _ = create_kmeans_clusters(n_clusters,  # number of clusters to create
                                                                # 2D matrix of articles
                                                                article_embeddings[i].T,
                                                                # the 2D matrix of words
                                                                word_embeddings[i].T,
                                                                # the articles to words matrix
                                                                words_matrices[i],
                                                                # word scores
                                                                words_scores[i],
                                                                # a list of initial cluster labels
                                                                cluster_labels,
                                                                # space matrix of words
                                                                matrices[i][2])
        c_scores_all.append(c_scores)
        c_umap_all.append(c_umap)

    return c_scores_all, c_umap_all


c_scores_all, c_umap_all = find_clusters(n_clusters, article_embeddings, word_embeddings,
                                         words_matrices, words_scores, doc2vec_matrices)

Let’s plot with the clusters.

for i, (embedding_article, embedding_auth, embedding_word, embedding_lab,
        color_article, color_auth, color_word, color_lab, c_scores, c_umap) in enumerate(zip(
    article_embeddings_names, auth_embeddings_names, word_embeddings_names, lab_embeddings_names,
    colors_articles, colors_auth, colors_words, colors_lab, c_scores_all, c_umap_all)):

    plot(f"Year {2010+ (i*2) + 2}", embedding_article, embedding_auth, embedding_word, embedding_lab,
         color_article, color_auth, color_word, color_lab, c_scores, c_umap, n_clusters)
  • Year 2012
  • Year 2014
  • Year 2016
  • Year 2018
  • Year 2020
  • Year 2022
/usr/local/lib/python3.9/site-packages/mplcursors/_pick_info.py:55: UserWarning: No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
  paths = scatter.__wrapped__(*args, **kwargs)

The 8 clusters that we created give us a general idea of what the big clusters of data contain. But we’ll probably want a finer level of detail if we start to zoom in and focus on smaller areas. So we’ll also create a second bigger group of clusters. To do this, simply increase the number of clusters we want.

n_clusters = 32

mc_scores_all, mc_umap_all = find_clusters(
    n_clusters, article_embeddings, word_embeddings, words_matrices, words_scores, doc2vec_matrices
)
Warning: Less than 2 words in cluster 5 with (0) words.
Warning: Less than 2 words in cluster 5 with (0) words.
Warning: Less than 2 words in cluster 10 with (0) words.
Warning: Less than 2 words in cluster 10 with (0) words.
Warning: Less than 2 words in cluster 29 with (1) words.
Warning: Less than 2 words in cluster 29 with (1) words.
Warning: Less than 2 words in cluster 8 with (0) words.
Warning: Less than 2 words in cluster 8 with (0) words.

Nearest neighbors

One more thing which could be useful to appreciate the quality of our data would be to get each point’s nearest neighbors. If our data processing is done correctly, we expect the related articles, labs, words and authors to be located close to each other.

Finding nearest neighbors is a common task with various algorithms aiming to solve it. The get_neighbors method uses one of these algorithms to find the nearest points of each type. It takes an optional weight parameter to tweak the distance calculation to select points that have a higher score but are maybe a bit farther instead of just selecting the closest neighbors.

Because we want to find the neighbors of each type (articles, authors, words, labs) for all of the entities, we call the get_neighbors method in a loop and store its results in an array.

from cartodata.neighbors import get_neighbors  # noqa


def find_neighbors(articles_scores, authors_scores, words_scores,
                   labs_scores, matrices):

    weights = [0, 0.5, 0.5, 0]

    all_neighbors = []
    all_scores = []

    for i in range(6):
        scores = [articles_scores[i], authors_scores[i],
                  words_scores[i], labs_scores[i]]
        neighbors = []

        matrix = matrices[i]
        for idx in range(len(matrix)):
            neighbors.append(get_neighbors(matrix[idx],
                                           scores[idx],
                                           matrix,
                                           weights[idx])
                                 )
        all_neighbors.append(neighbors)
        all_scores.append(scores)

    return all_neighbors, all_scores


all_neighbors, all_scores = find_neighbors(scores_article, filtered_auth_scores,
                                           words_scores, filtered_lab_scores, doc2vec_matrices)

Exporting

We now have sufficient data to create a meaningfull visualization.

from cartodata.operations import export_to_json  # noqa

natures = ['articles',
           'authors',
           'words',
           'labs',
           'hl_clusters',
           'ml_clusters']


def export(from_year, struct, article_embeddings, authors_embeddings, word_embeddings, lab_embeddings,
           c_umap, mc_umap, c_scores, mc_scores, all_neighbors, all_scores):


    for i in range(6):
        export_file = f"../datas/{struct}_workflow_doc2vec_aligned_{from_year}_{from_year + (2 + 2 * i)}.json"

        # add the clusters to list of 2d matrices and scores
        umap_matrices = [article_embeddings[i].T,
                         authors_embeddings[i].T,
                         word_embeddings[i].T,
                         lab_embeddings[i].T,
                         c_umap[i],
                         mc_umap[i]]

        all_scores[i].extend([c_scores[i], mc_scores[i]])


        # create a json export file with all the infos
        export_to_json(natures,
                       umap_matrices,
                       all_scores[i],
                       export_file,
                       neighbors_natures=natures[:4],
                       neighbors=all_neighbors[i])


export(2010, "lisn", article_embeddings, auth_embeddings, word_embeddings, lab_embeddings,
           c_umap_all, mc_umap_all, c_scores_all, mc_scores_all, all_neighbors, all_scores)

This creates the files:

  • lisn_workflow_doc2vec_aligned_2010_2012.json

  • lisn_workflow_doc2vec_aligned_2010_2014.json

  • lisn_workflow_doc2vec_aligned_2010_2016.json

  • lisn_workflow_doc2vec_aligned_2010_2018.json

  • lisn_workflow_doc2vec_aligned_2010_2020.json

  • lisn_workflow_doc2vec_aligned_2010_2022.json

each of which contains a list of points ready to be imported into Cartolabe. Have a look at it to check that it contains everything.

import json  # noqa

export_file = "../datas/lisn_workflow_doc2vec_aligned_2010_2022.json"

with open(export_file, 'r') as f:
    data = json.load(f)

data[1]
{'position': [-0.966591477394104, -4.021040916442871], 'score': 1.0, 'rank': 1, 'nature': 'articles', 'label': 'Bandit-Based Genetic Programming with Application to Reinforcement Learning', 'neighbors': {'articles': [1, 1869, 1478, 834, 387, 1552, 2871, 1895, 2155, 422], 'authors': [2987, 3028, 3043, 3032, 3035, 3053, 3034, 3208, 3157, 3047], 'words': [5394, 3767, 5395, 6515, 3768, 5396, 3652, 4068, 6338, 5651], 'labs': [6944, 7011, 7032, 7066, 6985, 6951, 7030, 6983, 6994, 6996]}}

Total running time of the script: (24 minutes 33.673 seconds)

Gallery generated by Sphinx-Gallery