.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/workflow_lisn_lda_kmeans.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_workflow_lisn_lda_kmeans.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_workflow_lisn_lda_kmeans.py:


Extracting and processing LISN data for Cartolabe (LDA projection)
=======================================================================

In this example we will:

- extract entities (authors, articles, labs, words) from a collection of
  scientific articles
- project those entities in 2 dimensions
- cluster them
- find their nearest neighbors.

.. GENERATED FROM PYTHON SOURCE LINES 16-20

Download data
=============

We will first download the CSV file that contains all articles from HAL (https://hal.archives-ouvertes.fr/) published by authors from LISN (Laboratoire Interdisciplinaire des Sciences du Numérique) between 2000-2022. 

.. GENERATED FROM PYTHON SOURCE LINES 20-28

.. code-block:: Python


    from download import download

    csv_url = "https://zenodo.org/record/7323538/files/lisn_2000_2022.csv"

    download(csv_url, "../datas/lisn_2000_2022.csv", kind='file',
                             progressbar=True, replace=False)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Downloading data from https://zenodo.org/records/7323538/files/lisn_2000_2022.csv (6.3 MB)


    file_sizes:   0%|                                   | 0.00/6.59M [00:00<?, ?B/s]
    file_sizes:  39%|██████████▏               | 2.59M/6.59M [00:00<00:00, 25.3MB/s]
    file_sizes: 100%|██████████████████████████| 6.59M/6.59M [00:00<00:00, 44.2MB/s]
    Successfully downloaded file to ../datas/lisn_2000_2022.csv

    '../datas/lisn_2000_2022.csv'



.. GENERATED FROM PYTHON SOURCE LINES 29-31

Load data to dataframe
===================

.. GENERATED FROM PYTHON SOURCE LINES 31-38

.. code-block:: Python


    import pandas as pd  # noqa

    df = pd.read_csv('../datas/lisn_2000_2022.csv', index_col=0)

    df.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>structId_i</th>
          <th>authFullName_s</th>
          <th>en_abstract_s</th>
          <th>en_keyword_s</th>
          <th>en_title_s</th>
          <th>structAcronym_s</th>
          <th>producedDateY_i</th>
          <th>producedDateM_i</th>
          <th>halId_s</th>
          <th>docid</th>
          <th>en_domainAllCodeLabel_fs</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>[2544, 92966, 411575, 441569]</td>
          <td>Frédéric Blanqui</td>
          <td>In the last twenty years, several approaches t...</td>
          <td>Higher-order rewriting,Termination,Confluence</td>
          <td>Termination and Confluence of Higher-Order Rew...</td>
          <td>LRI,UP11,CNRS,LISN</td>
          <td>2000</td>
          <td>7.0</td>
          <td>inria-00105556</td>
          <td>105556</td>
          <td>Logic in Computer Science,Computer Science</td>
        </tr>
        <tr>
          <th>1</th>
          <td>[2544, 92966, 411575, 441569]</td>
          <td>Sébastien Tixeuil</td>
          <td>When a distributed system is subject to transi...</td>
          <td>Self-stabilization,Distributed Systems,Distrib...</td>
          <td>Efficient Self-stabilization</td>
          <td>LRI,UP11,CNRS,LISN</td>
          <td>2000</td>
          <td>1.0</td>
          <td>tel-00124843</td>
          <td>124843</td>
          <td>Networking and Internet Architecture,Computer ...</td>
        </tr>
        <tr>
          <th>2</th>
          <td>[1167, 300340, 301492, 564132, 441569, 2544, 9...</td>
          <td>Michèle Sebag,Céline Rouveirol</td>
          <td>One of the obstacles to widely using first-ord...</td>
          <td>Bounded reasoning,First order logic,Inductive ...</td>
          <td>Resource-bounded relational reasoning: inducti...</td>
          <td>LMS,X,PSL,CNRS,LRI,UP11,CNRS,LISN</td>
          <td>2000</td>
          <td>NaN</td>
          <td>hal-00111312</td>
          <td>2263842</td>
          <td>Mechanics,Engineering Sciences,physics</td>
        </tr>
        <tr>
          <th>3</th>
          <td>[994, 15786, 301340, 303171, 441569, 34499, 81...</td>
          <td>Philippe Balbiani,Jean-François Condotta,Gérar...</td>
          <td>This paper organizes the topologic forms of th...</td>
          <td>Temporal reasoning,Constraint handling,Computa...</td>
          <td>Reasoning about generalized intervals : Horn r...</td>
          <td>LIPN,UP13,USPC,CNRS,IRIT,UT1,UT2J,UT3,CNRS,Tou...</td>
          <td>2000</td>
          <td>NaN</td>
          <td>hal-03300321</td>
          <td>3300321</td>
          <td>Artificial Intelligence,Computer Science</td>
        </tr>
        <tr>
          <th>4</th>
          <td>[1315, 25027, 59704, 564132, 300009, 441569, 4...</td>
          <td>Roberto Di Cosmo,Delia Kesner,Emmanuel Polonovski</td>
          <td>We refine the simulation technique introduced ...</td>
          <td>Linear logic,Proof nets,Lambda-calculus,Explic...</td>
          <td>Proof Nets and Explicit Substitutions</td>
          <td>LIENS,DI-ENS,ENS-PSL,PSL,Inria,CNRS,CNRS,LRI,U...</td>
          <td>2000</td>
          <td>NaN</td>
          <td>hal-00384955</td>
          <td>384955</td>
          <td>Logic in Computer Science,Computer Science</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 39-40

The dataframe that we just read consists of 4262 articles as rows.

.. GENERATED FROM PYTHON SOURCE LINES 40-43

.. code-block:: Python


    print(df.shape[0])





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    4262




.. GENERATED FROM PYTHON SOURCE LINES 44-45

And their authors, abstract, keywords, title, research labs and domain as columns.

.. GENERATED FROM PYTHON SOURCE LINES 45-48

.. code-block:: Python


    print(*df.columns, sep="\n")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    structId_i
    authFullName_s
    en_abstract_s
    en_keyword_s
    en_title_s
    structAcronym_s
    producedDateY_i
    producedDateM_i
    halId_s
    docid
    en_domainAllCodeLabel_fs




.. GENERATED FROM PYTHON SOURCE LINES 49-62

Creating correspondance matrices for each entity type
==============================================================

From this table of articles, we want to extract matrices that will map the
correspondance between these articles and the entities we want to use.

Authors
-------------

Let's start with the authors for example. We want to create a matrix where
the rows represent the articles and the columns represent the authors. Each
cell (n, m) will have a 1 in it if the *nth* article was written by the *mth*
author.

.. GENERATED FROM PYTHON SOURCE LINES 62-67

.. code-block:: Python


    from cartodata.loading import load_comma_separated_column  # noqa

    authors_mat, authors_scores = load_comma_separated_column(df, 'authFullName_s')








.. GENERATED FROM PYTHON SOURCE LINES 68-77

The `load_comma_separated_column` function takes in a dataframe and the name
of a column and returns two objects:

- a sparse matrix
- a pandas `Series`

Each column of the sparce matrix `authors_mat`, corresponds to an author and
each row corresponds to an article. We see that there are 7348 distict
authors for 4262 articles.

.. GENERATED FROM PYTHON SOURCE LINES 77-80

.. code-block:: Python


    authors_mat.shape





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (4262, 7348)



.. GENERATED FROM PYTHON SOURCE LINES 81-85

The series, which we named `authors_scores`, contains the list of authors
extracted from the column `authFullName_s` with a score that is equal to the
number of rows (articles) that this value was mapped within the `authors_mat`
matrix.

.. GENERATED FROM PYTHON SOURCE LINES 85-88

.. code-block:: Python


    authors_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Frédéric Blanqui       4
    Sébastien Tixeuil     47
    Michèle Sebag        137
    Céline Rouveirol       2
    Philippe Balbiani      2
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 89-92

If we look at the *2nd* column of the matrix, which corresponds to the author
**Sébastien Tixeuil**, we can see that it has 47 non-zero rows, each row
indicating which articles he authored.

.. GENERATED FROM PYTHON SOURCE LINES 92-95

.. code-block:: Python


    print(authors_mat[:, 1])





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

      (1, 0)        1
      (7, 0)        1
      (22, 0)       1
      (60, 0)       1
      (128, 0)      1
      (136, 0)      1
      (150, 0)      1
      (179, 0)      1
      (205, 0)      1
      (212, 0)      1
      (233, 0)      1
      (238, 0)      1
      (241, 0)      1
      (246, 0)      1
      (262, 0)      1
      (282, 0)      1
      (294, 0)      1
      (356, 0)      1
      (358, 0)      1
      (359, 0)      1
      (363, 0)      1
      (371, 0)      1
      (372, 0)      1
      (409, 0)      1
      (498, 0)      1
      (501, 0)      1
      (536, 0)      1
      (541, 0)      1
      (542, 0)      1
      (878, 0)      1
      (893, 0)      1
      (1600, 0)     1
      (1717, 0)     1
      (2037, 0)     1
      (2075, 0)     1
      (2116, 0)     1
      (2222, 0)     1
      (2373, 0)     1
      (2449, 0)     1
      (2450, 0)     1
      (2611, 0)     1
      (2732, 0)     1
      (2976, 0)     1
      (2986, 0)     1
      (3107, 0)     1
      (3221, 0)     1
      (3791, 0)     1




.. GENERATED FROM PYTHON SOURCE LINES 96-101

Labs
--------

Similarly, we can create matrices for the labs by simply passing the
`structAcronym_s` column to the function.

.. GENERATED FROM PYTHON SOURCE LINES 101-107

.. code-block:: Python


    labs_mat, labs_scores = load_comma_separated_column(df,
                                                        'structAcronym_s',
                                                        filter_acronyms=True)
    labs_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    LRI      4789
    UP11     6271
    CNRS    10217
    LISN     5203
    LMS         1
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 108-110

Checking the number of columns of the sparse matrix `labs_mat`, we see that
there are 1818 distict labs.

.. GENERATED FROM PYTHON SOURCE LINES 110-113

.. code-block:: Python


    labs_mat.shape[1]





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    1818



.. GENERATED FROM PYTHON SOURCE LINES 114-123

Filtering low score entities
----------------------------------------

A lot of the authors and labs that we just extracted from the dataframe have
a very low score, which means they're only linked to one or two articles. To
improve the quality of our data, we'll filter the authors and labs by
removing those that appear less than 4 times.

To do this, we'll use the `filter_min_score` function.

.. GENERATED FROM PYTHON SOURCE LINES 123-144

.. code-block:: Python


    from cartodata.operations import filter_min_score  # noqa

    authors_before = len(authors_scores)
    labs_before = len(labs_scores)

    authors_mat, authors_scores = filter_min_score(authors_mat, 
                                                   authors_scores, 
                                                   4)
    labs_mat, labs_scores = filter_min_score(labs_mat, 
                                             labs_scores, 
                                             4)

    print(f"Removed {authors_before - len(authors_scores)} authors with less "
          f"than 4 articles from a total of {authors_before} authors.")
    print(f"Working with {len(authors_scores)} authors.\n")

    print(f"Removed {labs_before - len(labs_scores)} labs with less than "
          f"4 articles from a total of {labs_before}.")
    print(f"Working with {len(labs_scores)} labs.")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Removed 6654 authors with less than 4 articles from a total of 7348 authors.
    Working with 694 authors.

    Removed 1255 labs with less than 4 articles from a total of 1818.
    Working with 563 labs.




.. GENERATED FROM PYTHON SOURCE LINES 145-153

Words
----------

For the words, it's a bit trickier because we want to extract n-grams (groups
of n terms) instead of just comma separated values. We'll call the
`load_text_column` which uses scikit-learn's
`CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
to create a vocabulary and map the tokens.

.. GENERATED FROM PYTHON SOURCE LINES 153-172

.. code-block:: Python


    from cartodata.loading import load_text_column  # noqa
    from sklearn.feature_extraction import text as sktxt  # noqa

    with open('../datas/stopwords.txt', 'r') as stop_file:
        stopwords = sktxt.ENGLISH_STOP_WORDS.union(
            set(stop_file.read().splitlines()))

    df['text'] = df['en_abstract_s'] + ' ' \
        + df['en_keyword_s'].astype(str) + ' ' \
        + df['en_title_s'].astype(str) + ' ' \
        + df['en_domainAllCodeLabel_fs'].astype(str)

    words_mat, words_scores = load_text_column(df['text'],
                                               4,
                                               10,
                                               0.05,
                                               stopwords=stopwords)








.. GENERATED FROM PYTHON SOURCE LINES 173-175

Here `words_scores` contains a list of all the n-grams extracted from the
documents with their score,

.. GENERATED FROM PYTHON SOURCE LINES 175-178

.. code-block:: Python


    words_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    abilities     21
    ability      164
    absence       53
    absolute      19
    abstract     174
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 179-181

and the `words_mat` matrix counts the occurences of each of the 4282 n-grams
for all the articles.

.. GENERATED FROM PYTHON SOURCE LINES 181-184

.. code-block:: Python


    words_mat.shape





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (4262, 4682)



.. GENERATED FROM PYTHON SOURCE LINES 185-192

To get a better representation of the importance of each term, we'll also
apply a TF-IDF (term-frequency times inverse document-frequency)
normalization on the matrix.

The `normalize_tfidf` simply calls scikit-learn's
`TfidfTransformer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer>`_
class.

.. GENERATED FROM PYTHON SOURCE LINES 192-197

.. code-block:: Python


    from cartodata.operations import normalize_tfidf  # noqa

    words_mat = normalize_tfidf(words_mat)








.. GENERATED FROM PYTHON SOURCE LINES 198-202

Articles
-------------

Finally, we need to create a matrix that simply maps each article to itself.

.. GENERATED FROM PYTHON SOURCE LINES 202-208

.. code-block:: Python


    from cartodata.loading import load_identity_column  # noqa

    articles_mat, articles_scores = load_identity_column(df, 'en_title_s')
    articles_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Termination and Confluence of Higher-Order Rewrite Systems                                    1.0
    Efficient Self-stabilization                                                                  1.0
    Resource-bounded relational reasoning: induction and deduction through stochastic matching    1.0
    Reasoning about generalized intervals : Horn representability and tractability                1.0
    Proof Nets and Explicit Substitutions                                                         1.0
    dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 209-231

Dimension reduction
===================

One way to see the matrices that we created is as coordinates in the space of
all articles. What we want to do is to reduce the dimension of this space to
make it easier to work with and see.

LDA projection
----------------------

We use LDA (Latent Dirichlet Allocation) technique to identify keywords in
our data and thus reduce the number of rows in our matrices. The
`lda_projection` method takes three arguments:

- the number of dimensions you want to keep
- the id of the documents/words matrix in the 3rd parameter list
- a list of matrices to project

It returns a list of the same length containing the matrices projected in the
latent space.

We also apply an l2 normalization to each feature of the projected matrices.

.. GENERATED FROM PYTHON SOURCE LINES 231-240

.. code-block:: Python


    from cartodata.projection import lda_projection  # noqa
    from cartodata.operations import normalize_l2  # noqa

    lda_matrices = lda_projection(50,
                                  2,
                                  [articles_mat, authors_mat, words_mat, labs_mat])
    lda_matrices = list(map(normalize_l2, lda_matrices))








.. GENERATED FROM PYTHON SOURCE LINES 241-243

We've reduced the number of rows in each of `articles_mat`, `authors_mat`,
`words_mat` and `labs_mat` to just 50.

.. GENERATED FROM PYTHON SOURCE LINES 243-249

.. code-block:: Python


    print(f"articles_mat: {lda_matrices[0].shape}")
    print(f"authors_mat: {lda_matrices[1].shape}")
    print(f"words_mat: {lda_matrices[2].shape}")
    print(f"labs_mat: {lda_matrices[3].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    articles_mat: (50, 4262)
    authors_mat: (50, 694)
    words_mat: (50, 4682)
    labs_mat: (50, 563)




.. GENERATED FROM PYTHON SOURCE LINES 250-261

This makes it easier to work with them for clustering or nearest neighbors
tasks, but we also want to project them on a 2D space to be able to map them.

UMAP projection
---------------------------

The `UMAP <https://github.com/lmcinnes/umap>`_ (Uniform Manifold Approximation
and Projection) is a dimension reduction technique that can be used for
visualisation similarly to t-SNE.

We use this algorithm to project our matrices in 2 dimensions.

.. GENERATED FROM PYTHON SOURCE LINES 261-266

.. code-block:: Python


    from cartodata.projection import umap_projection  # noqa

    umap_matrices = umap_projection(lda_matrices)








.. GENERATED FROM PYTHON SOURCE LINES 267-269

Now that we have 2D coordinates for our points, we can try to plot them to
get a feel of the data's shape.

.. GENERATED FROM PYTHON SOURCE LINES 269-301

.. code-block:: Python


    import matplotlib.pyplot as plt  # noqa
    import numpy as np  # noqa
    import seaborn as sns  # noqa
    # %matplotlib inline

    sns.set(style='white', rc={'figure.figsize': (12, 8)})

    labels = ('article', "auth", "words", "labs")
    colors = ['g', 'r', 'b', 'y']
    markers = ['o', 's', '+', 'x']

    def plot(matrices):
        plt.close('all')
        fig, ax = plt.subplots()

        axes = []

        for i, m in enumerate(matrices):
            axes.append(ax.scatter(m[0, :], m[1, :], 
                                   color=colors[i], marker=markers[i],
                                   label = labels[i]))
                           
    
        leg = ax.legend((axes[0], axes[1], axes[2], axes[3]), 
                        labels, 
                        fancybox=True, shadow=True)
    
        return fig, ax

    fig, ax = plot(umap_matrices)




.. image-sg:: /auto_examples/images/sphx_glr_workflow_lisn_lda_kmeans_001.png
   :alt: workflow lisn lda kmeans
   :srcset: /auto_examples/images/sphx_glr_workflow_lisn_lda_kmeans_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 302-313

On the plot above, articles are shown in green, authors in red, words in blue
and labs in yellow. Because we don't have labels for the points, it doesn't
make much sense as is. But we can see that the data shows some clusters which
we could try to identify.

Clustering
==========

In order to identify clusters, we use the KMeans clustering technique on the
articles. We'll also try to label these clusters by selecting the most
frequent words that appear in each cluster's articles.

.. GENERATED FROM PYTHON SOURCE LINES 313-339

.. code-block:: Python


    from cartodata.clustering import create_kmeans_clusters  # noqa

    cluster_labels = []
    c_lda, c_umap, c_scores, c_knn, _, _, _ = create_kmeans_clusters(8,  # number of clusters to create
                                                            # 2D matrix of articles
                                                            umap_matrices[0],
                                                            # the 2D matrix of words
                                                            umap_matrices[2],
                                                            # the articles to words matrix
                                                            words_mat,
                                                            # word scores
                                                            words_scores,
                                                            # a list of initial cluster labels
                                                            cluster_labels,
                                                            # LDA space matrix of words
                                                            lda_matrices[2])
    c_scores

    ""
    fig, ax = plot(umap_matrices)

    for i in range(8):
        ax.annotate(c_scores.index[i], (c_umap[0, i], c_umap[1, i]), 
                    color='red')




.. image-sg:: /auto_examples/images/sphx_glr_workflow_lisn_lda_kmeans_002.png
   :alt: workflow lisn lda kmeans
   :srcset: /auto_examples/images/sphx_glr_workflow_lisn_lda_kmeans_002.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 340-345

The 8 clusters that we created give us a general idea of what the big
clusters of data contain. But we'll probably want a finer level of detail if
we start to zoom in and focus on smaller areas. So we'll also create a second
bigger group of clusters. To do this, simply increase the number of clusters
we want.

.. GENERATED FROM PYTHON SOURCE LINES 345-355

.. code-block:: Python


    mc_lda, mc_umap, mc_scores, mc_knn, _, _, _ = create_kmeans_clusters(32,
                                                                umap_matrices[0],
                                                                umap_matrices[2],
                                                                words_mat,
                                                                words_scores,
                                                                cluster_labels,
                                                                lda_matrices[2])
    mc_scores





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    fault tolerant, fault tolerance                             72
    discrete event systems, touch                              262
    approximate bayesian, adjacency                             88
    evolutionary robotics, deductive program verification      157
    natural language processing, reinforcement learning        153
    means, architectures                                       101
    verification, floating point                               152
    belief propagation, number searchers                        76
    black optimization, distributed algorithm                   86
    documents, ontologies                                      251
    internet architecture, wireless networks                   177
    genomes, materialized views                                161
    adaptation, displays                                       202
    cognitive, visualization techniques                        130
    compiler, automata                                         226
    tangible, modulo                                           174
    mobile robots, lower bounds                                100
    computational complexity, population protocols              82
    numerical simulations, fluid mechanics                     113
    large scale, sequences                                     106
    neural evolutionary computing, large hadron collider        80
    analytics, social networks                                 196
    social sciences, internet                                  180
    cloud radio access networks, cloud radio access network     24
    metabolic, ontology alignment                               92
    regulatory network, gesture                                 49
    secondary structures, cloud computing                      196
    challenge, matter                                          182
    monte carlo search, black                                  110
    interfaces, supported                                      131
    molecular biology, protein protein                          63
    query, semantics                                            90
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 356-373

Nearest neighbors
----------------------------

One more thing which could be useful to appreciate the quality of our data
would be to get each point's nearest neighbors. If our data processing is
done correctly, we expect the related articles, labs, words and authors to be
located close to each other.

Finding nearest neighbors is a common task with various algorithms aiming to
solve it. The `get_neighbors` method uses one of these algorithms to find the
nearest points of each type. It takes an optional weight parameter to tweak
the distance calculation to select points that have a higher score but are
maybe a bit farther instead of just selecting the closest neighbors.

Because we want to find the neighbors of each type (articles, authors, words,
labs) for all of the entities, we call the `get_neighbors` method in a loop
and store its results in an array.

.. GENERATED FROM PYTHON SOURCE LINES 373-386

.. code-block:: Python


    from cartodata.neighbors import get_neighbors  # noqa

    scores = [articles_scores, authors_scores, words_scores, labs_scores]
    weights = [0, 0.5, 0.5, 0]
    all_neighbors = []

    for idx in range(len(lda_matrices)):
        all_neighbors.append(get_neighbors(lda_matrices[idx],
                                           scores[idx],
                                           lda_matrices,
                                           weights[idx]))








.. GENERATED FROM PYTHON SOURCE LINES 387-391

Exporting
-----------------

We now have sufficient data to create a meaningfull visualization.

.. GENERATED FROM PYTHON SOURCE LINES 391-416

.. code-block:: Python


    from cartodata.operations import export_to_json  # noqa

    natures = ['articles',
               'authors',
               'words',
               'labs',
               'hl_clusters',
               'ml_clusters'
               ]
    export_file = '../datas/lisn_workflow_lda.json'

    # add the clusters to list of 2d matrices and scores
    matrices = list(umap_matrices)
    matrices.extend([c_umap, mc_umap])
    scores.extend([c_scores, mc_scores])

    # Create a json export file with all the infos
    export_to_json(natures,
                   matrices,
                   scores,
                   export_file,
                   neighbors_natures=natures[:4],
                   neighbors=all_neighbors)








.. GENERATED FROM PYTHON SOURCE LINES 417-420

This creates the `lisn_workflow_lda.json` file which contains a list of points
ready to be imported into Cartolabe. Have a look at it to check that it
contains everything.

.. GENERATED FROM PYTHON SOURCE LINES 420-427

.. code-block:: Python


    import json  # noqa

    with open(export_file, 'r') as f:
        data = json.load(f)

    data[1]['position']




.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    [2.274456024169922, 9.282912254333496]




.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (2 minutes 0.487 seconds)


.. _sphx_glr_download_auto_examples_workflow_lisn_lda_kmeans.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: workflow_lisn_lda_kmeans.ipynb <workflow_lisn_lda_kmeans.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: workflow_lisn_lda_kmeans.py <workflow_lisn_lda_kmeans.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: workflow_lisn_lda_kmeans.zip <workflow_lisn_lda_kmeans.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_