.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/workflow_aligned_lisn_lsa_kmeans.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_workflow_aligned_lisn_lsa_kmeans.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_workflow_aligned_lisn_lsa_kmeans.py:


Extracting and processing LISN data for Cartolabe with Aligned UMAP
===========================================================================

(LSA projection)
=================

In this example, we will:

- extract entities (authors, articles, labs, words) from a collection of scientific articles
- project those entities in 2 dimensions using aligned UMAP
- cluster them
- find their nearest neighbors.

This example uses 6 datasets containing articles from HAL (https://hal.archives-ouvertes.fr/) published by authors from LISN (Laboratoire Interdisciplinaire des Sciences du Numérique). Each dataset contains cumulative data starting from 2010-2012 to 2010-2022 (inclusive).

Using aligned UMAP will help us align UMAP embeddings for consecutive datasets. 

.. GENERATED FROM PYTHON SOURCE LINES 20-23

.. code-block:: Python


    # %matplotlib widget








.. GENERATED FROM PYTHON SOURCE LINES 24-28

Download data
================

We will start by downloading 6 datasets from HAL.

.. GENERATED FROM PYTHON SOURCE LINES 28-60

.. code-block:: Python


    from cartodata.scraping import scrape_hal, process_domain_column   # noqa
    import os   # noqa


    def fetch_data(from_year, to_year, struct_ids, struct='hal'):
        """
        Fetch scientific publications of struct from HAL.
        """

        filename = f"../datas/{struct.lower()}_{from_year}_{to_year - 1}.csv"

        if os.path.exists(filename):
            return

        filters = {}

        if struct:
            filters['structId_i'] = struct_ids
        years = range(from_year, to_year)
        df = scrape_hal(struct, filters, years, cool_down=2)
        process_domain_column(df)

        # Save the dataframe into a csv file
        df.to_csv(filename)
        return df


    # Fetch LRI data from year 2010 to 2022 inclusive
    for i in range(2, 14, 2):
        fetch_data(2010, 2011+i, "(1050003 2544)", 'lisn')








.. GENERATED FROM PYTHON SOURCE LINES 61-65

Load data
================

Now we will load each downloaded dataset to a dataframe.

.. GENERATED FROM PYTHON SOURCE LINES 65-85

.. code-block:: Python


    import pandas as pd   # noqa


    def load_data(from_year, total, struct):
        dataframes = []

        for i in range(total):
            file_name = f"../datas/{struct.lower()}_{from_year}_{from_year + ( 2 + 2 * i)}.csv"
            print(file_name)
            df = pd.read_csv(file_name, index_col=0)

            dataframes.append(df)

        return dataframes


    dataframes = load_data(2010, 6, "lisn")
    dataframes[0].head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ../datas/lisn_2010_2012.csv
    ../datas/lisn_2010_2014.csv
    ../datas/lisn_2010_2016.csv
    ../datas/lisn_2010_2018.csv
    ../datas/lisn_2010_2020.csv
    ../datas/lisn_2010_2022.csv


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>structId_i</th>
          <th>authFullName_s</th>
          <th>en_abstract_s</th>
          <th>en_keyword_s</th>
          <th>en_title_s</th>
          <th>structAcronym_s</th>
          <th>producedDateY_i</th>
          <th>producedDateM_i</th>
          <th>halId_s</th>
          <th>docid</th>
          <th>en_domainAllCodeLabel_fs</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>[2544, 92966, 411575, 441569, 56057, 2544, 929...</td>
          <td>Olivier Teytaud</td>
          <td>This document is devoted to artificial intelli...</td>
          <td>Evolutionary optimization,Parallelism</td>
          <td>Artificial Intelligence and Optimization with ...</td>
          <td>LRI,UP11,CNRS,TAO,LRI,UP11,CNRS,Inria,LISN</td>
          <td>2010</td>
          <td>11.0</td>
          <td>tel-01078099</td>
          <td>1078099</td>
          <td>Optimization and Control,Mathematics</td>
        </tr>
        <tr>
          <th>1</th>
          <td>[2544, 92966, 411575, 441569, 56057, 2544, 929...</td>
          <td>J.-B Hoock,O Teytaud</td>
          <td>When looking for relevant mutations of a learn...</td>
          <td>Mots-clés  Reinforcement learning,Monte-Carlo ...</td>
          <td>Bandit-Based Genetic Programming with Applicat...</td>
          <td>LRI,UP11,CNRS,TAO,LRI,UP11,CNRS,Inria,LISN</td>
          <td>2010</td>
          <td>5.0</td>
          <td>hal-01098456</td>
          <td>1098456</td>
          <td>Computer Science and Game Theory,Computer Science</td>
        </tr>
        <tr>
          <th>2</th>
          <td>[16574, 179741, 300351, 56050, 2544, 92966, 41...</td>
          <td>Pierre Cubaud,Alexandre Topol,Emmanuel Pietriga</td>
          <td>This document provides feedback about the curr...</td>
          <td>NaN</td>
          <td>ALMA Graphical User Interfaces</td>
          <td>CEDRIC,ENSIIE,CNAM,IN-SITU,LRI,UP11,CNRS,Inria...</td>
          <td>2010</td>
          <td>3.0</td>
          <td>hal-01126094</td>
          <td>1126094</td>
          <td>Computer Interaction,Computer Science</td>
        </tr>
        <tr>
          <th>3</th>
          <td>[2544, 92966, 411575, 441569, 3210, 301243, 30...</td>
          <td>Dominique Gouyou-Beauchamps,Cyril Nicaud</td>
          <td>Generalizing an idea used by Alonso to generat...</td>
          <td>Random generation,Binomial distribution</td>
          <td>Random Generation Using Binomial Approximations</td>
          <td>LRI,UP11,CNRS,LIGM,UPEM,ENPC,BEZOUT,CNRS,CNRS,...</td>
          <td>2010</td>
          <td>NaN</td>
          <td>hal-01185570</td>
          <td>1185570</td>
          <td>Computational Geometry,Computer Science</td>
        </tr>
        <tr>
          <th>4</th>
          <td>[391379, 233, 93591, 441569, 103784, 2071, 300...</td>
          <td>Olivier Bodini,Yann Ponty</td>
          <td>We address the uniform random generation of wo...</td>
          <td>Boltzmann sampling,Context-free languages,Rand...</td>
          <td>Multi-dimensional Boltzmann Sampling of Languages</td>
          <td>APR,LIP6,UPMC,CNRS,AMIB,LIX,X,IP Paris,CNRS,LR...</td>
          <td>2010</td>
          <td>6.0</td>
          <td>hal-00450763</td>
          <td>1185592</td>
          <td>Computational Geometry,Computer Science</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 86-87

We can list the total number of articles starting from 2010 to a certain year.

.. GENERATED FROM PYTHON SOURCE LINES 87-91

.. code-block:: Python


    for i in range(6):
        print(f"{2010 + (i*2) + 2} => {dataframes[i].shape[0]}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => 765
    2014 => 1361
    2016 => 1983
    2018 => 2514
    2020 => 2945
    2022 => 2987




.. GENERATED FROM PYTHON SOURCE LINES 92-97

Creating correspondance matrices for each entity type
================================================================

From this table of articles, we want to extract matrices that will map the
correspondance between these articles and the entities we want to use.

.. GENERATED FROM PYTHON SOURCE LINES 97-114

.. code-block:: Python


    from cartodata.loading import load_comma_separated_column  # noqa
    from cartodata.loading import load_identity_column  # noqa


    def create_correspondance_matrices(func, dataframes, column_name):
        matrices = []
        scores = []

        for i in range(len(dataframes)):
            mat, score = func(dataframes[i], column_name)
            matrices.append(mat)
            scores.append(score)

        return matrices, scores









.. GENERATED FROM PYTHON SOURCE LINES 115-123

The `load_comma_separated_column` function takes in a `dataframe` and the
`name of a column` and returns two objects:

- a sparse matrix
- a pandas Series

Each column of the sparse matrix, corresponds to the entity specified by `column_name` and each row
corresponds to an article.

.. GENERATED FROM PYTHON SOURCE LINES 125-134

Authors
--------------

Let's start with the authors for example. We want to create a matrix where
the rows represent the articles and the columns represent the authors. Each
cell (n, m) will have a 1 in it if the *nth* article was written by the *mth*
author.

As we have multiple dataframes, the results will be arrays corresponding to specified dataframes.

.. GENERATED FROM PYTHON SOURCE LINES 134-139

.. code-block:: Python


    matrices_auth, scores_auth = create_correspondance_matrices(
        load_comma_separated_column, dataframes, 'authFullName_s'
    )








.. GENERATED FROM PYTHON SOURCE LINES 140-141

We can see the number of (articles, authors) for each year:

.. GENERATED FROM PYTHON SOURCE LINES 141-145

.. code-block:: Python


    for i in range(len(matrices_auth)):
        print(f"{2010 + (i*2) + 2} => {matrices_auth[i].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => (765, 1666)
    2014 => (1361, 2338)
    2016 => (1983, 3135)
    2018 => (2514, 3882)
    2020 => (2945, 4554)
    2022 => (2987, 4652)




.. GENERATED FROM PYTHON SOURCE LINES 146-151

The `score` in `scores_auth` is a series that contain the list of authors
extracted from the column `authFullName_s` with a score that is equal to the
number of rows (articles) that this value was mapped within the corresponding matrix in `matrices_auth`.

Let's have a look at the number of articles per author for each year:

.. GENERATED FROM PYTHON SOURCE LINES 151-161

.. code-block:: Python


    df_scores_auth = pd.concat(scores_auth, axis=1).reset_index().rename(
        columns={i: f"{2010 + (i*2) + 2}" for i in range(6)}
    )

    df_scores_auth.head(20)

    ""
    df_scores_auth.describe()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>2012</th>
          <th>2014</th>
          <th>2016</th>
          <th>2018</th>
          <th>2020</th>
          <th>2022</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>count</th>
          <td>1666.000000</td>
          <td>2338.000000</td>
          <td>3135.000000</td>
          <td>3882.000000</td>
          <td>4554.000000</td>
          <td>4652.00000</td>
        </tr>
        <tr>
          <th>mean</th>
          <td>2.206483</td>
          <td>2.465783</td>
          <td>2.590112</td>
          <td>2.685987</td>
          <td>2.705314</td>
          <td>2.69239</td>
        </tr>
        <tr>
          <th>std</th>
          <td>3.080791</td>
          <td>4.044985</td>
          <td>4.862247</td>
          <td>5.312939</td>
          <td>5.627630</td>
          <td>5.61770</td>
        </tr>
        <tr>
          <th>min</th>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.00000</td>
        </tr>
        <tr>
          <th>25%</th>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.00000</td>
        </tr>
        <tr>
          <th>50%</th>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.00000</td>
        </tr>
        <tr>
          <th>75%</th>
          <td>2.000000</td>
          <td>3.000000</td>
          <td>2.000000</td>
          <td>2.000000</td>
          <td>2.000000</td>
          <td>2.00000</td>
        </tr>
        <tr>
          <th>max</th>
          <td>41.000000</td>
          <td>60.000000</td>
          <td>77.000000</td>
          <td>94.000000</td>
          <td>113.000000</td>
          <td>113.00000</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 162-167

Labs
----------

Similarly, we can create matrices for the labs by simply passing the
`structAcronym_s` column to the function.

.. GENERATED FROM PYTHON SOURCE LINES 167-171

.. code-block:: Python


    matrices_labs, scores_labs = create_correspondance_matrices(
        load_comma_separated_column, dataframes, 'structAcronym_s')








.. GENERATED FROM PYTHON SOURCE LINES 172-173

We can see the number of (articles, labs) for each year:

.. GENERATED FROM PYTHON SOURCE LINES 173-177

.. code-block:: Python


    for i in range(len(matrices_labs)):
        print(f"{2010 + (i*2) + 2} => {matrices_labs[i].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => (765, 602)
    2014 => (1361, 838)
    2016 => (1983, 1041)
    2018 => (2514, 1254)
    2020 => (2945, 1426)
    2022 => (2987, 1447)




.. GENERATED FROM PYTHON SOURCE LINES 178-179

Let's have a look at the number of articles per author for each year:

.. GENERATED FROM PYTHON SOURCE LINES 179-189

.. code-block:: Python


    df_scores_labs = pd.concat(scores_labs, axis=1).reset_index().rename(
        columns={i: f"{2010 + (i*2) + 2}" for i in range(10)}
    )

    df_scores_labs.head(20)

    ""
    df_scores_labs.describe()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>2012</th>
          <th>2014</th>
          <th>2016</th>
          <th>2018</th>
          <th>2020</th>
          <th>2022</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>count</th>
          <td>602.000000</td>
          <td>838.00000</td>
          <td>1041.000000</td>
          <td>1254.000000</td>
          <td>1426.000000</td>
          <td>1447.000000</td>
        </tr>
        <tr>
          <th>mean</th>
          <td>15.131229</td>
          <td>19.21599</td>
          <td>22.337176</td>
          <td>23.566986</td>
          <td>24.131837</td>
          <td>24.230822</td>
        </tr>
        <tr>
          <th>std</th>
          <td>106.106441</td>
          <td>159.00098</td>
          <td>201.821286</td>
          <td>230.275677</td>
          <td>248.591470</td>
          <td>250.370755</td>
        </tr>
        <tr>
          <th>min</th>
          <td>1.000000</td>
          <td>1.00000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
        </tr>
        <tr>
          <th>25%</th>
          <td>1.000000</td>
          <td>1.00000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
          <td>1.000000</td>
        </tr>
        <tr>
          <th>50%</th>
          <td>2.000000</td>
          <td>2.00000</td>
          <td>2.000000</td>
          <td>2.000000</td>
          <td>2.000000</td>
          <td>2.000000</td>
        </tr>
        <tr>
          <th>75%</th>
          <td>4.000000</td>
          <td>5.00000</td>
          <td>6.000000</td>
          <td>6.000000</td>
          <td>6.000000</td>
          <td>7.000000</td>
        </tr>
        <tr>
          <th>max</th>
          <td>1712.000000</td>
          <td>3001.00000</td>
          <td>4218.000000</td>
          <td>5300.000000</td>
          <td>6183.000000</td>
          <td>6297.000000</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 190-197

Filtering low score entities
===============================

A lot of the authors and labs that we just extracted from the dataframe have
a very low score, which means they're only linked to one or two articles. To
improve the quality of our data, we'll filter the entities by removing those
that appear less than certain number of times.

.. GENERATED FROM PYTHON SOURCE LINES 197-221

.. code-block:: Python


    from cartodata.operations import filter_min_score  # noqa


    def filter_low_score(matrices, scores, count, entity):

        filtered_matrices = []
        filtered_scores = []

        for i, (matrix, score) in enumerate(zip(matrices, scores)):
            before = len(score)
            filtered_mat, filtered_score = filter_min_score(matrix, score, count)

            filtered_matrices.append(filtered_mat)
            filtered_scores.append(filtered_score)

            print(f'Removed {before - len(filtered_score)} {entity}s with less than ' +
                  f'{count} articles from a total of {before} {entity}s.')

            print(f'Working with {len(filtered_score)} {entity} for year {2010 + (i*2) + 2}.\n')

        return filtered_matrices, filtered_scores









.. GENERATED FROM PYTHON SOURCE LINES 222-226

Authors
-------------

We will remove the authors with less than 5 publications.

.. GENERATED FROM PYTHON SOURCE LINES 226-232

.. code-block:: Python


    auth_pub_count = 5

    filtered_auth_matrices, filtered_auth_scores = filter_low_score(
        matrices_auth, scores_auth, auth_pub_count, "author")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Removed 1578 authors with less than 5 articles from a total of 1666 authors.
    Working with 88 author for year 2012.

    Removed 2179 authors with less than 5 articles from a total of 2338 authors.
    Working with 159 author for year 2014.

    Removed 2887 authors with less than 5 articles from a total of 3135 authors.
    Working with 248 author for year 2016.

    Removed 3566 authors with less than 5 articles from a total of 3882 authors.
    Working with 316 author for year 2018.

    Removed 4168 authors with less than 5 articles from a total of 4554 authors.
    Working with 386 author for year 2020.

    Removed 4258 authors with less than 5 articles from a total of 4652 authors.
    Working with 394 author for year 2022.





.. GENERATED FROM PYTHON SOURCE LINES 233-237

Labs
---------

We will remove the labs with less than 20 publications.

.. GENERATED FROM PYTHON SOURCE LINES 237-243

.. code-block:: Python


    lab_pub_count = 20

    filtered_lab_matrices, filtered_lab_scores = filter_low_score(
        matrices_labs, scores_labs, lab_pub_count, "labs")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Removed 566 labss with less than 20 articles from a total of 602 labss.
    Working with 36 labs for year 2012.

    Removed 762 labss with less than 20 articles from a total of 838 labss.
    Working with 76 labs for year 2014.

    Removed 938 labss with less than 20 articles from a total of 1041 labss.
    Working with 103 labs for year 2016.

    Removed 1131 labss with less than 20 articles from a total of 1254 labss.
    Working with 123 labs for year 2018.

    Removed 1286 labss with less than 20 articles from a total of 1426 labss.
    Working with 140 labs for year 2020.

    Removed 1305 labss with less than 20 articles from a total of 1447 labss.
    Working with 142 labs for year 2022.





.. GENERATED FROM PYTHON SOURCE LINES 244-252

Words
--------------

For the words, it's a bit trickier because we want to extract n-grams (groups
of n terms) instead of just comma separated values. We'll call the
load_text_column which uses scikit-learn's CountVectorizer
<https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>_
to create a vocabulary and map the tokens.

.. GENERATED FROM PYTHON SOURCE LINES 252-289

.. code-block:: Python


    from cartodata.loading import load_text_column  # noqa
    from sklearn.feature_extraction import text as sktxt  # noqa
    from cartodata.operations import normalize_tfidf  # noqa


    def process_words(dataframes, stopwords):

        words_matrices = []
        words_scores = []

        for df in dataframes:
            df['text'] = df['en_abstract_s'] + ' ' \
                + df['en_keyword_s'].astype(str) + ' ' \
                + df['en_title_s'].astype(str) + ' ' \
                + df['en_domainAllCodeLabel_fs'].astype(str)

            words_mat, words_score = load_text_column(df['text'],
                                                      4,
                                                      10,
                                                      0.05,
                                                      stopwords=stopwords)
            # apply term-frequency times inverse document-frequency normalization
            words_mat = normalize_tfidf(words_mat)

            words_matrices.append(words_mat)
            words_scores.append(words_score)

        return words_matrices, words_scores


    with open('../datas/stopwords.txt', 'r') as stop_file:
        stopwords = sktxt.ENGLISH_STOP_WORDS.union(
            set(stop_file.read().splitlines()))

        words_matrices, words_scores = process_words(dataframes, stopwords)








.. GENERATED FROM PYTHON SOURCE LINES 290-291

We can list number of terms per article for each year:

.. GENERATED FROM PYTHON SOURCE LINES 291-295

.. code-block:: Python


    for i in range(len(words_matrices)):
        print(f"{2010 + (i*2) + 2} => {words_matrices[i].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => (765, 1015)
    2014 => (1361, 1801)
    2016 => (1983, 2483)
    2018 => (2514, 3045)
    2020 => (2945, 3512)
    2022 => (2987, 3560)




.. GENERATED FROM PYTHON SOURCE LINES 296-300

Articles
------------

Finally, we need to create a matrix that simply maps each article to itself.

.. GENERATED FROM PYTHON SOURCE LINES 300-308

.. code-block:: Python


    matrices_article, scores_article = create_correspondance_matrices(
        load_identity_column, dataframes, 'en_title_s')

    ""
    for i in range(len(matrices_article)):
        print(f"{2010 + (i*2) + 2} => {matrices_article[i].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => (765, 765)
    2014 => (1361, 1361)
    2016 => (1983, 1983)
    2018 => (2514, 2514)
    2020 => (2945, 2945)
    2022 => (2987, 2987)




.. GENERATED FROM PYTHON SOURCE LINES 309-331

Dimension reduction
====================

One way to see the matrices that we created is as coordinates in the space of
all articles. What we want to do is to reduce the dimension of this space to
make it easier to work with and see.

LSA projection
--------------------------

We'll start by using the LSA (Latent Semantic Analysis) technique to identify
keywords in our data and thus reduce the number of rows in our matrices. The
`lsa_projection` method takes three arguments:

- the number of dimensions you want to keep
- the matrix of documents/words frequency
- a list of matrices to project

It returns a list of the same length containing the matrices projected in the
latent space.

We also apply an l2 normalization to each feature of the projected matrices.

.. GENERATED FROM PYTHON SOURCE LINES 331-360

.. code-block:: Python


    from cartodata.projection import lsa_projection  # noqa
    from cartodata.operations import normalize_l2  # noqa

    dimensions = [12, 12, 14, 16, 20, 20]


    def run_lsa_projection(articles_matrices, auth_matrices, words_matrices, 
                           lab_matrices, dimensions):
        lsa_matrices = []

        for articles_mat, auth_mat, words_mat, labs_mat, dim in zip(
            articles_matrices, auth_matrices, words_matrices, lab_matrices, dimensions
        ):

            lsa_mat = lsa_projection(
                dim, words_mat, [articles_mat, auth_mat, words_mat, labs_mat]
            )
            lsa_mat = list(map(normalize_l2, lsa_mat))

            lsa_matrices.append(lsa_mat)

        return lsa_matrices


    lsa_matrices = run_lsa_projection(matrices_article, filtered_auth_matrices, 
                                      words_matrices, filtered_lab_matrices, 
                                      dimensions)








.. GENERATED FROM PYTHON SOURCE LINES 361-362

List (number_of_articles, dimensions) per year:

.. GENERATED FROM PYTHON SOURCE LINES 362-366

.. code-block:: Python


    for i in range(len(lsa_matrices)):
        print(f"{2010 + (i*2) + 2} => {lsa_matrices[i][0].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => (12, 765)
    2014 => (12, 1361)
    2016 => (14, 1983)
    2018 => (16, 2514)
    2020 => (20, 2945)
    2022 => (20, 2987)




.. GENERATED FROM PYTHON SOURCE LINES 367-368

List (number_of_authors, dimensions) per year:

.. GENERATED FROM PYTHON SOURCE LINES 368-372

.. code-block:: Python


    for i in range(len(lsa_matrices)):
        print(f"{2010 + (i*2) + 2} => {lsa_matrices[i][1].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => (12, 88)
    2014 => (12, 159)
    2016 => (14, 248)
    2018 => (16, 316)
    2020 => (20, 386)
    2022 => (20, 394)




.. GENERATED FROM PYTHON SOURCE LINES 373-374

List (number_of_words, dimensions) per year:

.. GENERATED FROM PYTHON SOURCE LINES 374-378

.. code-block:: Python


    for i in range(len(lsa_matrices)):
        print(f"{2010 + (i*2) + 2} => {lsa_matrices[i][2].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => (12, 1015)
    2014 => (12, 1801)
    2016 => (14, 2483)
    2018 => (16, 3045)
    2020 => (20, 3512)
    2022 => (20, 3560)




.. GENERATED FROM PYTHON SOURCE LINES 379-380

List (number_of_labs, dimensions) per year:

.. GENERATED FROM PYTHON SOURCE LINES 380-384

.. code-block:: Python


    for i in range(len(lsa_matrices)):
        print(f"{2010 + (i*2) + 2} => {lsa_matrices[i][3].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2012 => (12, 36)
    2014 => (12, 76)
    2016 => (14, 103)
    2018 => (16, 123)
    2020 => (20, 140)
    2022 => (20, 142)




.. GENERATED FROM PYTHON SOURCE LINES 385-391

Aligned UMAP projection
===========================

To use aligned UMAP, we need to create relations between each consecutive dataset that maps each entity index (article, author, word, lab) in one dataset to corresponding entity index in the following dataset.

We will also create color mappings to view each entity in consecutive maps with the same color.

.. GENERATED FROM PYTHON SOURCE LINES 391-463

.. code-block:: Python


    import numpy as np  # noqa


    def make_relation(from_df, to_df):
        # create a new dataframe with index from from_df, and values as integers 
        # starting from 0 to the length of from_df
        left = pd.DataFrame(data=np.arange(len(from_df)), index=from_df.index)
    
        # create a new dataframe with index from to_df, and values as integers 
        # starting from 0 to the length of to_df
        right = pd.DataFrame(data=np.arange(len(to_df)), index=to_df.index)
    
        # merge left and right dataframes on the intersection of keys of both
        # dataframes preserving the order of left keys
        merge = pd.merge(left, right, how="inner",
                         left_index=True, right_index=True)
    
        return dict(merge.values)


    def generate_relations(filtered_scores):

        relations = []

        for i in range(len(filtered_scores) - 1):
            relation = make_relation(filtered_scores[i], filtered_scores[i+1])

            relations.append(relation)

        return relations


    def generate_colors(filtered_scores):

        colors = []

        for i in range(len(filtered_scores)):
            color = make_relation(filtered_scores[i], filtered_scores[-1])

            colors.append(color)

        return colors


    def concat_scores(scores_article, scores_auth, scores_word, scores_lab):
        """Concatenates article, auth, words and labs score for each year and 
           returns the concatenated frames in a list.
        """
        concatenated_scores = []

        for score_article, score_auth, score_word, score_lab in zip(
            scores_article, scores_auth, scores_word, scores_lab
        ):
            concatenated_scores.append(
                pd.concat([score_article, score_auth, score_word, score_lab]))

        return concatenated_scores


    filtered_scores = concat_scores(
        scores_article, filtered_auth_scores, words_scores, filtered_lab_scores)


    relations = generate_relations(filtered_scores)

    colors_auth = generate_colors(filtered_auth_scores)
    colors_words = generate_colors(words_scores)
    colors_lab = generate_colors(filtered_lab_scores)
    colors_articles = generate_colors(scores_article)









.. GENERATED FROM PYTHON SOURCE LINES 464-465

We should get transpose of each lsa matrix to be able to process by UMAP.

.. GENERATED FROM PYTHON SOURCE LINES 465-477

.. code-block:: Python


    def get_transpose(lsa_matrices):
        transposed_mats = []

        for lsa in lsa_matrices:
            transposed_mats.append(np.hstack(lsa).T)

        return transposed_mats


    transposed_mats = get_transpose(lsa_matrices)








.. GENERATED FROM PYTHON SOURCE LINES 478-479

We create an AlignedUMAP instance to generate 6 maps for the 6 datasets.

.. GENERATED FROM PYTHON SOURCE LINES 479-495

.. code-block:: Python


    from umap.aligned_umap import AlignedUMAP  # noqa

    n_neighbors = [10, 30, 30, 50, 70, 100]
    min_dists = [0.1, 0.3, 0.3, 0.5, 0.7, 0.7]
    spreads = [1, 1, 1, 1, 1, 1]

    n = 6

    reducer = AlignedUMAP(
        n_neighbors=n_neighbors[:n],
        min_dist=min_dists[:n],
        #    n_neighbors=10,
        #    min_dist=0.05,
        init='random',
        n_epochs=200)







.. GENERATED FROM PYTHON SOURCE LINES 496-497

Aligned UMAP requires at least two datasets and the mapping between the entities of the datasets. We can change the number of initial datasets changing the value of the variable `n`.

.. GENERATED FROM PYTHON SOURCE LINES 497-501

.. code-block:: Python


    reducer.fit_transform(transposed_mats[:n], relations=relations[:n-1])






.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    ListType[array(float32, 2d, C)]([[[-4.6312943   2.1897452 ]
     [-4.7019587   2.4623086 ]
     [ 3.172899   -1.254874  ]
     ...
     [-0.44537532  0.0364257 ]
     [-0.3615355   0.10424785]
     [ 0.9505725  -4.1422253 ]], [[-4.4378047  1.8940356]
     [-4.6877837  2.1463704]
     [ 1.8862357 -5.3941364]
     ...
     [ 3.8317637 -3.2522857]
     [ 1.7630607  2.5508578]
     [ 1.7112528  2.5630774]], [[-4.4434514  1.7681804]
     [-4.7892733  2.054415 ]
     [ 1.5745841 -5.694587 ]
     ...
     [-2.4695678  0.777273 ]
     [ 1.6377203  2.8798878]
     [ 1.9659495  2.442149 ]], [[-4.6376925   1.53457   ]
     [-4.694312    1.5876851 ]
     [ 2.19716    -5.3580184 ]
     ...
     [-3.5690544  -1.2518331 ]
     [ 2.4115925  -5.407163  ]
     [ 0.04822341 -5.1132517 ]], [[-4.5427794   1.2628535 ]
     [-4.617718    1.3386154 ]
     [ 2.430242   -5.260712  ]
     ...
     [ 0.05606826 -5.724693  ]
     [-1.1132172  -3.6751866 ]
     [-1.1428653   0.9819731 ]], [[-4.5396147   1.2146386 ]
     [-4.6235642   1.2929305 ]
     [ 2.5026178  -5.452615  ]
     ...
     [-0.9563716  -3.7903018 ]
     [ 1.0241965   0.85914576]
     [ 0.7459364  -3.1982822 ]]])



.. GENERATED FROM PYTHON SOURCE LINES 502-503

Then we will generate maps for the remaining dataset one by one by feeding the reducer with the corresponding matrix and relation dictionary.

.. GENERATED FROM PYTHON SOURCE LINES 503-522

.. code-block:: Python


    def update_reducer(reducer, matrices, relations,
                       n_neighbors, min_dists, spreads):

        for mat, rel, n_neighbor, min_dist, spread in zip(
                matrices, relations, n_neighbors, min_dists, spreads
        ):
            # reducer.update(mat, relations=rel)
            reducer.update(mat, relations=rel, n_neighbors=n_neighbor,
                           min_dist=min_dist, spread=spread, verbose=True)


    update_reducer(reducer, transposed_mats[n:], relations[n-1:],
                   n_neighbors[n:], min_dists[n:], spreads[n:])

    ""
    for embedding in reducer.embeddings_:
        print(embedding.shape)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    (1904, 2)
    (3397, 2)
    (4817, 2)
    (5998, 2)
    (6983, 2)
    (7083, 2)




.. GENERATED FROM PYTHON SOURCE LINES 523-527

Pickle reducer
===============

It is possible to serialize and save the reducer object for later use.

.. GENERATED FROM PYTHON SOURCE LINES 527-550

.. code-block:: Python


    # From https://github.com/lmcinnes/umap/issues/672

    from numba.typed import List  # noqa
    import pickle  # noqa

    params = reducer.get_params()
    attributes_names = [attr for attr in reducer.__dir__(
    ) if attr not in params and attr[0] != '_']
    attributes = {key: reducer.__getattribute__(key) for key in attributes_names}
    attributes['embeddings_'] = list(reducer.embeddings_)

    for x in ['fit', 'fit_transform', 'update', 'get_params', 
              'set_params', "get_metadata_routing"]:
        del attributes[x]

    all_params = {
        'umap_params': params,
        'umap_attributes': {key: value for key, value in attributes.items()}
    }

    pickle.dump(all_params, open('pickled_reducer.pkl', 'wb'))








.. GENERATED FROM PYTHON SOURCE LINES 551-555

Reload from pickle
====================

We can reload the reducer:

.. GENERATED FROM PYTHON SOURCE LINES 555-566

.. code-block:: Python


    params_new = pickle.load(open('pickled_reducer.pkl', 'rb'))
    reducer = AlignedUMAP()
    reducer.set_params(**params_new.get('umap_params'))

    for attr, value in params_new.get('umap_attributes').items():
        reducer.__setattr__(attr, value)
    reducer.__setattr__('embeddings_', List(
        params_new.get('umap_attributes').get('embeddings_')))









.. GENERATED FROM PYTHON SOURCE LINES 567-571

Plot maps
=============

To plot the maps, we will reorganize the data.

.. GENERATED FROM PYTHON SOURCE LINES 571-608

.. code-block:: Python


    def update_embeddings_with_entities(embeddings, filtered_scores):
        r_embeddings = []

        for embedding, score in zip(embeddings, filtered_scores):
            score_indices = np.matrix(score.index).T
            embd = np.hstack((embedding, score_indices)).T
            r_embeddings.append(np.asarray(embd))

        return r_embeddings


    def decompose_embeddings(embeddings):
        article_embeddings = []
        auth_embeddings = []
        word_embeddings = []
        lab_embeddings = []

        for i, embedding in enumerate(embeddings):
            len_article = len(scores_article[i])
            len_auth = len(filtered_auth_scores[i])
            len_word = len(words_scores[i])

            article_embeddings.append(embeddings[i][:len_article])
            auth_embeddings.append(
                embeddings[i][len_article: len_article + len_auth])
            word_embeddings.append(
                embeddings[i][len_article + len_auth:len_article + len_auth + len_word])
            lab_embeddings.append(
                embeddings[i][len_article + len_auth + len_word:])

        return article_embeddings, auth_embeddings, word_embeddings, lab_embeddings


    article_embeddings, auth_embeddings, word_embeddings, lab_embeddings = decompose_embeddings(
        reducer.embeddings_)








.. GENERATED FROM PYTHON SOURCE LINES 609-610

We will add values to embeddings, so that we can visualize article title, author name, word or lab name when the entity is hovered on the plot.

.. GENERATED FROM PYTHON SOURCE LINES 610-627

.. code-block:: Python


    article_embeddings_names = update_embeddings_with_entities(
        article_embeddings, scores_article
    )

    auth_embeddings_names = update_embeddings_with_entities(
        auth_embeddings, filtered_auth_scores
    )

    word_embeddings_names = update_embeddings_with_entities(
        word_embeddings, words_scores
    )

    lab_embeddings_names = update_embeddings_with_entities(
        lab_embeddings, filtered_lab_scores
    )








.. GENERATED FROM PYTHON SOURCE LINES 628-629

We will create a colormap that will map each entity (article, author, word, lab) to same color in every plot.

.. GENERATED FROM PYTHON SOURCE LINES 629-643

.. code-block:: Python


    from matplotlib import cm   # noqa


    def create_color_map(max_list):
        cmap = cm.get_cmap('gist_ncar', len(max_list))
        return cmap


    cmap_article = create_color_map(scores_article[-1])
    cmap_auth = create_color_map(filtered_auth_scores[-1])
    cmap_word = create_color_map(words_scores[-1])
    cmap_lab = create_color_map(filtered_lab_scores[-1])





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /builds/2mk6rsew/0/hgozukan/cartolabe-data/examples/workflow_aligned_lisn_lsa_kmeans.py:634: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
      cmap = cm.get_cmap('gist_ncar', len(max_list))




.. GENERATED FROM PYTHON SOURCE LINES 644-645

Now we can plot all the maps.

.. GENERATED FROM PYTHON SOURCE LINES 645-735

.. code-block:: Python


    import matplotlib.pyplot as plt  # noqa
    import numpy as np   # noqa
    import mplcursors    # noqa


    plt.close('all')

    def axis_bounds(embedding):
        left, right = embedding.T[0].min(), embedding.T[0].max()
        bottom, top = embedding.T[1].min(), embedding.T[1].max()
        adj_h, adj_v = (right - left) * 0.1, (top - bottom) * 0.1
    
        return [left - adj_h, right + adj_h, bottom - adj_v, top + adj_v]


    ax_bound = axis_bounds(np.vstack(reducer.embeddings_))
    ax_bound



    def plot(title, embedding_article, embedding_auth, embedding_word, embedding_lab,
             colors_article, colors_auth, colors_word, colors_lab, 
             c_scores=None, c_umap=None, n_clusters=None):

        colors_article_list = list(colors_article.values())
        colors_auth_list = list(colors_auth.values())
        colors_word_list = list(colors_word.values())
        colors_lab_list = list(colors_lab.values())

        fig, ax = plt.subplots(1, 1, figsize=(10, 9))

        ax.set_title(title)
        ax.axis(ax_bound)

        article = ax.scatter(embedding_article[0], embedding_article[1], c=colors_article_list, cmap=cmap_article,
                             vmin=0, vmax=len(scores_article[-1]), marker='x', label="article")

        auth = ax.scatter(embedding_auth[0], embedding_auth[1], c=colors_auth_list, cmap=cmap_auth,
                          vmin=0, vmax=len(filtered_auth_scores[-1]), label="auth")

        word = ax.scatter(embedding_word[0], embedding_word[1],
                          vmin=0, vmax=len(words_scores[-1]), marker='+', label="word")

        lab = ax.scatter(embedding_lab[0], embedding_lab[1], c=colors_lab_list, cmap=cmap_lab,
                         vmin=0, vmax=len(filtered_lab_scores[-1]), marker='s', label="lab")

        # from https://matplotlib.org/stable/gallery/event_handling/legend_picking.html
        leg = ax.legend((article, auth, word, lab), ("article",
                                                     "auth", "words", "labs"), fancybox=True, shadow=True)

        if n_clusters is not None:
            for i in range(n_clusters):
                ax.annotate(c_scores.index[i], (c_umap[0, i], c_umap[1, i]))
    
    #    crs_article = mplcursors.cursor(article, hover=True)

    #    @crs_article.connect("add")
    #    def on_add(sel):
    #        sel.annotation.set(text=embedding_article[2][sel.target.index])

        crs_auth = mplcursors.cursor(auth, hover=True)

        @crs_auth.connect("add")
        def on_add(sel):
            sel.annotation.set(text=embedding_auth[2][sel.target.index])

    #    crs_word = mplcursors.cursor(word, hover=True)

    #    @crs_word.connect("add")
    #    def on_add(sel):
    #        sel.annotation.set(text=embedding_word[2][sel.target.index])

        crs_lab = mplcursors.cursor(lab, hover=True)

        @crs_lab.connect("add")
        def on_add(sel):
            sel.annotation.set(text=embedding_lab[2][sel.target.index])

        plt.show()


    for i, (embedding_article, embedding_auth, embedding_word, embedding_lab, 
            color_article, color_auth, color_word, color_lab) in enumerate(zip(
        article_embeddings_names, auth_embeddings_names, word_embeddings_names, lab_embeddings_names, 
        colors_articles, colors_auth, colors_words, colors_lab)):
    
        plot(f"Year {2010 + (i * 2) + 2}", embedding_article, embedding_auth, embedding_word, embedding_lab, 
             color_article, color_auth, color_word, color_lab)




.. rst-class:: sphx-glr-horizontal


    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_001.png
         :alt: Year 2012
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_001.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_002.png
         :alt: Year 2014
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_002.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_003.png
         :alt: Year 2016
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_003.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_004.png
         :alt: Year 2018
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_004.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_005.png
         :alt: Year 2020
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_005.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_006.png
         :alt: Year 2022
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_006.png
         :class: sphx-glr-multi-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /usr/local/lib/python3.9/site-packages/mplcursors/_pick_info.py:55: UserWarning: No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
      paths = scatter.__wrapped__(*args, **kwargs)




.. GENERATED FROM PYTHON SOURCE LINES 736-742

Clustering
==========

In order to identify clusters, we use the KMeans clustering technique on the
articles. We'll also try to label these clusters by selecting the most
frequent words that appear in each cluster's articles.

.. GENERATED FROM PYTHON SOURCE LINES 742-777

.. code-block:: Python


    from cartodata.clustering import create_kmeans_clusters  # noqa
    n_clusters = 8


    def find_clusters(n_clusters, article_embeddings, word_embeddings, 
                      words_matrices, words_scores, matrices):

        c_scores_all = []
        c_umap_all = []

        for i in range(6):
            cluster_labels = []
            c_lda, c_umap, c_scores, c_knn, _, _, _ = create_kmeans_clusters(n_clusters,  # number of clusters to create
                                                                    # 2D matrix of articles
                                                                    article_embeddings[i].T,
                                                                    # the 2D matrix of words
                                                                    word_embeddings[i].T,
                                                                    # the articles to words matrix
                                                                    words_matrices[i],
                                                                    # word scores
                                                                    words_scores[i],
                                                                    # a list of initial cluster labels
                                                                    cluster_labels,
                                                                    # space matrix of words
                                                                    matrices[i][2])
            c_scores_all.append(c_scores)
            c_umap_all.append(c_umap)
        
        return c_scores_all, c_umap_all


    c_scores_all, c_umap_all = find_clusters(n_clusters, article_embeddings, word_embeddings, 
                                             words_matrices, words_scores, lsa_matrices)








.. GENERATED FROM PYTHON SOURCE LINES 778-779

Let's plot with the clusters.

.. GENERATED FROM PYTHON SOURCE LINES 779-788

.. code-block:: Python


    for i, (embedding_article, embedding_auth, embedding_word, embedding_lab, 
            color_article, color_auth, color_word, color_lab, c_scores, c_umap) in enumerate(zip(
        article_embeddings_names, auth_embeddings_names, word_embeddings_names, lab_embeddings_names, 
        colors_articles, colors_auth, colors_words, colors_lab, c_scores_all, c_umap_all)):
    
        plot(f"Year {2010+ (i*2) + 2}", embedding_article, embedding_auth, embedding_word, embedding_lab, 
             color_article, color_auth, color_word, color_lab, c_scores, c_umap, n_clusters)




.. rst-class:: sphx-glr-horizontal


    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_007.png
         :alt: Year 2012
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_007.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_008.png
         :alt: Year 2014
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_008.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_009.png
         :alt: Year 2016
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_009.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_010.png
         :alt: Year 2018
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_010.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_011.png
         :alt: Year 2020
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_011.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_012.png
         :alt: Year 2022
         :srcset: /auto_examples/images/sphx_glr_workflow_aligned_lisn_lsa_kmeans_012.png
         :class: sphx-glr-multi-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /usr/local/lib/python3.9/site-packages/mplcursors/_pick_info.py:55: UserWarning: No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
      paths = scatter.__wrapped__(*args, **kwargs)




.. GENERATED FROM PYTHON SOURCE LINES 789-794

The 8 clusters that we created give us a general idea of what the big
clusters of data contain. But we'll probably want a finer level of detail if
we start to zoom in and focus on smaller areas. So we'll also create a second
bigger group of clusters. To do this, simply increase the number of clusters
we want.

.. GENERATED FROM PYTHON SOURCE LINES 794-801

.. code-block:: Python


    n_clusters = 32

    mc_scores_all, mc_umap_all = find_clusters(
        n_clusters, article_embeddings, word_embeddings, words_matrices, words_scores, lsa_matrices
    )








.. GENERATED FROM PYTHON SOURCE LINES 802-819

Nearest neighbors
------------------------

One more thing which could be useful to appreciate the quality of our data
would be to get each point's nearest neighbors. If our data processing is
done correctly, we expect the related articles, labs, words and authors to be
located close to each other.

Finding nearest neighbors is a common task with various algorithms aiming to
solve it. The `get_neighbors` method uses one of these algorithms to find the
nearest points of each type. It takes an optional weight parameter to tweak
the distance calculation to select points that have a higher score but are
maybe a bit farther instead of just selecting the closest neighbors.

Because we want to find the neighbors of each type (articles, authors, words,
labs) for all of the entities, we call the `get_neighbors` method in a loop
and store its results in an array.

.. GENERATED FROM PYTHON SOURCE LINES 819-853

.. code-block:: Python


    from cartodata.neighbors import get_neighbors  # noqa


    def find_neighbors(articles_scores, authors_scores, words_scores, 
                       labs_scores, matrices):

        weights = [0, 0.5, 0.5, 0]

        all_neighbors = []
        all_scores = []
    
        for i in range(6):
            scores = [articles_scores[i], authors_scores[i], 
                      words_scores[i], labs_scores[i]]
            neighbors = []

            matrix = matrices[i]
            for idx in range(len(matrix)):
                neighbors.append(get_neighbors(matrix[idx],
                                               scores[idx],
                                               matrix,
                                               weights[idx])
                                     )
            all_neighbors.append(neighbors)
            all_scores.append(scores)
        
        return all_neighbors, all_scores


    all_neighbors, all_scores = find_neighbors(scores_article, filtered_auth_scores, 
                                               words_scores, filtered_lab_scores, lsa_matrices)









.. GENERATED FROM PYTHON SOURCE LINES 854-858

Exporting
-----------------

We now have sufficient data to create a meaningfull visualization.

.. GENERATED FROM PYTHON SOURCE LINES 858-899

.. code-block:: Python


    from cartodata.operations import export_to_json  # noqa

    natures = ['articles',
               'authors',
               'words',
               'labs',
               'hl_clusters',
               'ml_clusters']


    def export(from_year, struct, article_embeddings, authors_embeddings, word_embeddings, lab_embeddings, 
               c_umap, mc_umap, c_scores, mc_scores, all_neighbors, all_scores):
    

        for i in range(6):
            export_file = f"../datas/{struct}_workflow_lsa_aligned_{from_year}_{from_year + (2 + 2 * i)}.json"
        
            # add the clusters to list of 2d matrices and scores
            umap_matrices = [article_embeddings[i].T, 
                             authors_embeddings[i].T, 
                             word_embeddings[i].T, 
                             lab_embeddings[i].T, 
                             c_umap[i], 
                             mc_umap[i]]
        
            all_scores[i].extend([c_scores[i], mc_scores[i]])


            # create a json export file with all the infos
            export_to_json(natures,
                           umap_matrices,
                           all_scores[i],
                           export_file,
                           neighbors_natures=natures[:4],
                           neighbors=all_neighbors[i])

    
    export(2010, "lisn", article_embeddings, auth_embeddings, word_embeddings, lab_embeddings, 
               c_umap_all, mc_umap_all, c_scores_all, mc_scores_all, all_neighbors, all_scores)








.. GENERATED FROM PYTHON SOURCE LINES 900-913

This creates the files:


- `lisn_workflow_aligned_lsa_2010_2012.json`
- `lisn_workflow_aligned_lsa_2010_2014.json`
- `lisn_workflow_aligned_lsa_2010_2016.json`
- `lisn_workflow_aligned_lsa_2010_2018.json`
- `lisn_workflow_aligned_lsa_2010_2020.json`
- `lisn_workflow_aligned_lsa_2010_2022.json` 


each of which contains a list of points ready to be imported into Cartolabe. Have a look at it to check that it
contains everything.

.. GENERATED FROM PYTHON SOURCE LINES 913-922

.. code-block:: Python


    import json  # noqa

    export_file = "../datas/lisn_workflow_lsa_aligned_2010_2022.json"

    with open(export_file, 'r') as f:
        data = json.load(f)

    data[1]




.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    {'position': [-4.62356424331665, 1.2929304838180542], 'score': 1.0, 'rank': 1, 'nature': 'articles', 'label': 'Bandit-Based Genetic Programming with Application to Reinforcement Learning', 'neighbors': {'articles': [1, 216, 48, 443, 952, 1596, 2623, 78, 323, 1552], 'authors': [3044, 3034, 3026, 3046, 3152, 2987, 3086, 3088, 3043, 3028], 'words': [3767, 5394, 5395, 5396, 3768, 6285, 6340, 6339, 6303, 6230], 'labs': [7032, 6944, 7011, 6989, 7018, 7017, 6985, 7016, 7030, 7076]}}




.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (15 minutes 59.159 seconds)


.. _sphx_glr_download_auto_examples_workflow_aligned_lisn_lsa_kmeans.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: workflow_aligned_lisn_lsa_kmeans.ipynb <workflow_aligned_lisn_lsa_kmeans.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: workflow_aligned_lisn_lsa_kmeans.py <workflow_aligned_lisn_lsa_kmeans.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: workflow_aligned_lisn_lsa_kmeans.zip <workflow_aligned_lisn_lsa_kmeans.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_