.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/workflow_wikipedia_lsa_kmeans.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_workflow_wikipedia_lsa_kmeans.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_workflow_wikipedia_lsa_kmeans.py:


Creating a corpus from Wikipedia
================================

In this example, we'll describe how to run a text processing workflow on a dump
of Wikipedia. We'll be working with a dump of the `Simple English Wikipedia
<https://simple.wikipedia.org/wiki/Main_Page>`_ because it is both much smaller
than the full English Wikipedia (it has approximately 140000 articles) and the
articles have a simpler syntax.

.. GENERATED FROM PYTHON SOURCE LINES 14-24

Prepare Simple English Wikipedia Data
======================================

Download data
------------------------

To begin, you must download the latest dump of the Simple English Wikipedia
from `here
<https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2>`_
(~188M).

.. GENERATED FROM PYTHON SOURCE LINES 24-30

.. code-block:: Python


    from download import download  # noqa

    download("https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2",
             "../datas/simplewiki-latest-pages-articles.xml.bz2")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Downloading data from https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2 (286.7 MB)


    file_sizes:   0%|                                    | 0.00/301M [00:00<?, ?B/s]
    file_sizes:   2%|▌                          | 6.28M/301M [00:00<00:06, 46.7MB/s]
    file_sizes:   6%|█▋                         | 18.9M/301M [00:00<00:03, 85.6MB/s]
    file_sizes:  10%|██▋                        | 29.4M/301M [00:00<00:03, 79.2MB/s]
    file_sizes:  13%|███▌                       | 39.8M/301M [00:00<00:02, 87.8MB/s]
    file_sizes:  17%|████▌                      | 50.3M/301M [00:00<00:02, 84.2MB/s]
    file_sizes:  20%|█████▍                     | 60.8M/301M [00:00<00:02, 81.2MB/s]
    file_sizes:  23%|██████▏                    | 69.2M/301M [00:00<00:02, 81.2MB/s]
    file_sizes:  26%|██████▉                    | 77.6M/301M [00:00<00:02, 76.4MB/s]
    file_sizes:  30%|████████                   | 90.2M/301M [00:01<00:02, 87.0MB/s]
    file_sizes:  33%|█████████▎                  | 101M/301M [00:01<00:02, 87.7MB/s]
    file_sizes:  37%|██████████▎                 | 111M/301M [00:01<00:02, 83.2MB/s]
    file_sizes:  40%|███████████▎                | 122M/301M [00:01<00:02, 83.6MB/s]
    file_sizes:  44%|████████████▎               | 132M/301M [00:01<00:02, 78.2MB/s]
    file_sizes:  48%|█████████████▍              | 145M/301M [00:01<00:01, 86.1MB/s]
    file_sizes:  52%|██████████████▍             | 155M/301M [00:01<00:01, 79.4MB/s]
    file_sizes:  56%|███████████████▌            | 168M/301M [00:02<00:01, 87.3MB/s]
    file_sizes:  59%|████████████████▌           | 178M/301M [00:02<00:01, 87.1MB/s]
    file_sizes:  63%|█████████████████▌          | 189M/301M [00:02<00:01, 82.4MB/s]
    file_sizes:  66%|██████████████████▌         | 199M/301M [00:02<00:01, 86.3MB/s]
    file_sizes:  70%|███████████████████▌        | 210M/301M [00:02<00:01, 78.0MB/s]
    file_sizes:  74%|████████████████████▋       | 222M/301M [00:02<00:00, 89.0MB/s]
    file_sizes:  77%|█████████████████████▋      | 233M/301M [00:02<00:00, 87.9MB/s]
    file_sizes:  81%|██████████████████████▋     | 243M/301M [00:02<00:00, 84.2MB/s]
    file_sizes:  84%|███████████████████████▋    | 254M/301M [00:03<00:00, 85.1MB/s]
    file_sizes:  88%|████████████████████████▌   | 264M/301M [00:03<00:00, 82.2MB/s]
    file_sizes:  91%|█████████████████████████▌  | 275M/301M [00:03<00:00, 87.3MB/s]
    file_sizes:  95%|██████████████████████████▌ | 285M/301M [00:03<00:00, 87.7MB/s]
    file_sizes:  98%|███████████████████████████▌| 296M/301M [00:03<00:00, 82.0MB/s]
    file_sizes: 100%|████████████████████████████| 301M/301M [00:03<00:00, 83.8MB/s]
    Successfully downloaded file to ../datas/simplewiki-latest-pages-articles.xml.bz2

    '../datas/simplewiki-latest-pages-articles.xml.bz2'



.. GENERATED FROM PYTHON SOURCE LINES 31-41

Extract the text from Wikipedia template files
------------------------------------------------------------------

The file we just downloaded is a single dump of all the articles in the
simple wikipedia, with templates and markup tags. We want to extract plain
text from that, discarding any other information or annotation present in
Wikipedia pages, such as images, tables, references and lists.

To do this, we'll use the
`WikiExtractor <https://github.com/attardi/wikiextractor>`_ Python package.

.. GENERATED FROM PYTHON SOURCE LINES 41-56

.. code-block:: Python


    import wikiextractor.WikiExtractor as w   # noqa

    w.expand_templates = 0
    w.Extractor.keepLinks = False
    w.Extractor.to_json = True

    w.process_dump(input_file="../datas/simplewiki-latest-pages-articles.xml.bz2",
                   template_file=None,
                   out_file="../datas/simple_wikipedia",
                   file_size=(500 * 1024 * 1024),
                   file_compress=False,
                   process_count=2,
                   html_safe=False)








.. GENERATED FROM PYTHON SOURCE LINES 57-61

This create `datas/simple_wikipedia/AA/wiki_00
<../datas/simple_wikipedia/AA/wiki_00>`_ `json` file.

We can now read that file and start working on it.

.. GENERATED FROM PYTHON SOURCE LINES 61-74

.. code-block:: Python


    import pandas as pd   # noqa
    import json   # noqa

    with open('../datas/simple_wikipedia/AA/wiki_00', 'r') as f:
        docs = f.readlines()

    data = list(map(lambda x: json.loads(x), docs))
    df = pd.DataFrame(data)
    df['text'] = df['text'].str.replace('\\n', ' ', regex=True)

    df.head()






.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>id</th>
          <th>revid</th>
          <th>url</th>
          <th>title</th>
          <th>text</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>1</td>
          <td>22027</td>
          <td>https://simple.wikipedia.org/wiki?curid=1</td>
          <td>April</td>
          <td>April (Apr.) is the fourth month of the year i...</td>
        </tr>
        <tr>
          <th>1</th>
          <td>2</td>
          <td>1477024</td>
          <td>https://simple.wikipedia.org/wiki?curid=2</td>
          <td>August</td>
          <td>August (Aug.) is the eighth month of the year ...</td>
        </tr>
        <tr>
          <th>2</th>
          <td>6</td>
          <td>863768</td>
          <td>https://simple.wikipedia.org/wiki?curid=6</td>
          <td>Art</td>
          <td>Art is a creative activity. It produces a prod...</td>
        </tr>
        <tr>
          <th>3</th>
          <td>8</td>
          <td>9712910</td>
          <td>https://simple.wikipedia.org/wiki?curid=8</td>
          <td>A</td>
          <td>A is the first letter of the English alphabet....</td>
        </tr>
        <tr>
          <th>4</th>
          <td>9</td>
          <td>1288841</td>
          <td>https://simple.wikipedia.org/wiki?curid=9</td>
          <td>Air</td>
          <td>Air is the Earth's atmosphere. Air is a mixtur...</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 75-83

Creating correspondance matrices for each entity type
================================================================

The dataframe that we just read consists of articles as rows and their id,
revid, url, title and text as columns.

From this table of articles, we want to extract two matrices representing the
articles and their words.

.. GENERATED FROM PYTHON SOURCE LINES 83-96

.. code-block:: Python


    from cartodata.loading import load_text_column   # noqa
    from sklearn.feature_extraction import text as sktxt   # noqa

    with open('../datas/stopwords.txt', 'r') as stop_file:
        stopwords = sktxt.ENGLISH_STOP_WORDS.union(
            set(stop_file.read().splitlines()))

    words_mat, words_scores = load_text_column(df['text'],
                                               4,
                                               10,
                                               0.05,
                                               stopwords=stopwords)







.. GENERATED FROM PYTHON SOURCE LINES 97-99

Here `words_scores` contains a list of all the n-grams extracted from the
documents with their score,

.. GENERATED FROM PYTHON SOURCE LINES 99-102

.. code-block:: Python


    words_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    000th                 30
    10000                 43
    1000th                23
    100th                192
    100th anniversary     51
    dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 103-105

and the `words_mat` matrix counts the occurences of each of the 3457 n-grams
for all the articles.

.. GENERATED FROM PYTHON SOURCE LINES 105-108

.. code-block:: Python


    words_mat.shape





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (357729, 124472)



.. GENERATED FROM PYTHON SOURCE LINES 109-116

To get a better representation of the importance of each term, we'll also
apply a TF-IDF (term-frequency times inverse document-frequency)
normalization on the matrix.

The `normalize_tfidf` simply calls scikit-learn's
`TfidfTransformer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer>`_
class.

.. GENERATED FROM PYTHON SOURCE LINES 116-121

.. code-block:: Python


    from cartodata.operations import normalize_tfidf  # noqa

    words_mat = normalize_tfidf(words_mat)








.. GENERATED FROM PYTHON SOURCE LINES 122-126

Articles
----------------

Finally, we need to create a matrix that simply maps each article to itself.

.. GENERATED FROM PYTHON SOURCE LINES 126-132

.. code-block:: Python


    from cartodata.loading import load_identity_column  # noqa

    articles_mat, articles_scores = load_identity_column(df, 'title')
    articles_scores.head()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    April     1.0
    August    1.0
    Art       1.0
    A         1.0
    Air       1.0
    dtype: float64



.. GENERATED FROM PYTHON SOURCE LINES 133-155

Dimension reduction
===================

One way to see the matrices that we created is as coordinates in the space of
all articles. What we want to do is to reduce the dimension of this space to
make it easier to work with and see.

LSA projection
-----------------------

We'll start by using the LSA (Latent Semantic Analysis) technique to identify
keywords in our data and thus reduce the number of rows in our matrices. The
`lsa_projection` method takes three arguments:

- the number of dimensions you want to keep
- the matrix of documents/words frequency
- a list of matrices to project

It returns a list of the same length containing the matrices projected in the
latent space.

We also apply an l2 normalization to each feature of the projected matrices.

.. GENERATED FROM PYTHON SOURCE LINES 155-166

.. code-block:: Python


    from cartodata.projection import lsa_projection  # noqa
    from cartodata.operations import normalize_l2  # noqa


    lsa_matrices = lsa_projection(80,
                                  words_mat,
                                  [articles_mat, words_mat])
    lsa_matrices = list(map(normalize_l2, lsa_matrices))









.. GENERATED FROM PYTHON SOURCE LINES 167-169

We've reduced the number of rows in each of `articles_mat`, `words_mat` to
just 80.

.. GENERATED FROM PYTHON SOURCE LINES 169-173

.. code-block:: Python


    print(f"articles_mat: {lsa_matrices[0].shape}")
    print(f"words_mat: {lsa_matrices[1].shape}")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    articles_mat: (80, 357729)
    words_mat: (80, 124472)




.. GENERATED FROM PYTHON SOURCE LINES 174-185

This makes it easier to work with them for clustering or nearest neighbors
tasks, but we also want to project them on a 2D space to be able to map them.

UMAP projection
---------------------------

The `UMAP <https://github.com/lmcinnes/umap>`_ (Uniform Manifold Approximation
and Projection) is a dimension reduction technique that can be used for
visualisation similarly to t-SNE.

We use this algorithm to project our matrices in 2 dimensions.

.. GENERATED FROM PYTHON SOURCE LINES 185-190

.. code-block:: Python


    from cartodata.projection import umap_projection  # noqa

    umap_matrices = umap_projection(lsa_matrices)








.. GENERATED FROM PYTHON SOURCE LINES 191-193

Now that we have 2D coordinates for our points, we can try to plot them to
get a feel of the data's shape.

.. GENERATED FROM PYTHON SOURCE LINES 193-225

.. code-block:: Python


    import matplotlib.pyplot as plt   # noqa
    from mpl_toolkits.mplot3d import Axes3D   # noqa
    import numpy as np   # noqa
    import seaborn as sns   # noqa
    # %matplotlib inline
    sns.set(style='white', rc={'figure.figsize': (12, 8)})

    labels = ('article', "words")
    colors = ['g', 'r']
    markers = ['x','+']

    def plot(matrices):
        plt.close('all')
        fig, ax = plt.subplots()

        axes = []

        for i, m in enumerate(matrices):
            axes.append(ax.scatter(m[0, :], m[1, :], 
                                   color=colors[i], marker=markers[i],
                                   label = labels[i]))
                           
    
        leg = ax.legend((axes[0], axes[1]), 
                        labels, 
                        fancybox=True, shadow=True)
    
        return fig, ax

    fig, ax = plot(umap_matrices)




.. image-sg:: /auto_examples/images/sphx_glr_workflow_wikipedia_lsa_kmeans_001.png
   :alt: workflow wikipedia lsa kmeans
   :srcset: /auto_examples/images/sphx_glr_workflow_wikipedia_lsa_kmeans_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 226-234

On the plot above, articles are shown in green and words in red. Because we don't have labels for the points, it doesn't make much sense as is. But we can see that the data shows some clusters which we could try to identify.

Clustering
==========

In order to identify clusters, we use the KMeans clustering technique on the
articles. We'll also try to label these clusters by selecting the most
frequent words that appear in each cluster's articles.

.. GENERATED FROM PYTHON SOURCE LINES 234-259

.. code-block:: Python


    from cartodata.clustering import create_kmeans_clusters  # noqa

    cluster_labels = []
    c_lda, c_umap, c_scores, c_knn, _, _, _ = create_kmeans_clusters(8,  # number of clusters to create
                                                            # 2D matrix of articles
                                                            umap_matrices[0],
                                                            # the 2D matrix of words
                                                            umap_matrices[1],
                                                            # the articles to words matrix
                                                            words_mat,
                                                            # word scores
                                                            words_scores,
                                                            # a list of initial cluster labels
                                                            cluster_labels,
                                                            # LDA space matrix of words
                                                            lsa_matrices[1])
    c_scores

    ""
    fig, ax = plot(umap_matrices)

    for i in range(8):
        ax.annotate(c_scores.index[i], (c_umap[0, i], c_umap[1, i]), 
                    color='blue')



.. image-sg:: /auto_examples/images/sphx_glr_workflow_wikipedia_lsa_kmeans_002.png
   :alt: workflow wikipedia lsa kmeans
   :srcset: /auto_examples/images/sphx_glr_workflow_wikipedia_lsa_kmeans_002.png
   :class: sphx-glr-single-img






.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (23 minutes 19.103 seconds)


.. _sphx_glr_download_auto_examples_workflow_wikipedia_lsa_kmeans.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: workflow_wikipedia_lsa_kmeans.ipynb <workflow_wikipedia_lsa_kmeans.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: workflow_wikipedia_lsa_kmeans.py <workflow_wikipedia_lsa_kmeans.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: workflow_wikipedia_lsa_kmeans.zip <workflow_wikipedia_lsa_kmeans.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_