Processing visual data (Fashion-MNIST) for Cartolabe

In this example we’ll create a 2D representation of an image dataset. We use the fashion MNIST dataset. Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples.

Loading the data

We will download the training set images and labels (train-images-idx3-ubyte.gz, train-labels-idx1-ubyte.gz) from the github page and save it in the datas directory of the cartolabe-data project. Then use the load_mnist function below to load the train set.

from download import download  # noqa

download("http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz",
         "../datas/train-images-idx3-ubyte.gz")

download("http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz",
         "../datas/train-labels-idx1-ubyte.gz")

""
import os
import gzip
import numpy as np

def load_mnist(path, kind='train'):

    """Load MNIST data from `path`"""
    labels_path = os.path.join(path, f"{kind}-labels-idx1-ubyte.gz")
    images_path = os.path.join(path, f"{kind}-images-idx3-ubyte.gz")

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8, offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels
Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz (25.2 MB)


file_sizes:   0%|                                   | 0.00/26.4M [00:00<?, ?B/s]
file_sizes:  32%|████████▏                 | 8.38M/26.4M [00:00<00:00, 65.9MB/s]
file_sizes:  63%|████████████████▌         | 16.8M/26.4M [00:00<00:00, 62.0MB/s]
file_sizes:  95%|████████████████████████▊ | 25.2M/26.4M [00:00<00:00, 63.3MB/s]
file_sizes: 100%|██████████████████████████| 26.4M/26.4M [00:00<00:00, 62.3MB/s]
Successfully downloaded file to ../datas/train-images-idx3-ubyte.gz
Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz (29 kB)


file_sizes:   0%|                                   | 0.00/29.5k [00:00<?, ?B/s]
file_sizes: 100%|██████████████████████████| 29.5k/29.5k [00:00<00:00, 30.0MB/s]
Successfully downloaded file to ../datas/train-labels-idx1-ubyte.gz

Load the Fashion-MNIST dataset

X_train, y_train = load_mnist('../datas', kind='train')

the training set has 60000 sample points and 784 features.

X_train.shape
(60000, 784)

The variable y_train contains the labels of every sample marked as integers from 0 to 9.

Label

Description

0

T-shirt/top

1

Trouser

2

Pullover

3

Dress

4

Coat

5

Sandal

6

Shirt

7

Sneaker

8

Bag

9

Ankle boot

y_train
array([9, 0, 0, ..., 3, 0, 5], dtype=uint8)

Create a 2D projection

To create a 2D projection of the dataset, we’ll first run PCA on the X_train array to reduce the dimensions from 784 to 50. Then, we’ll use UMAP to project the dataset to 2D.

This step can take a while to complete as there are 60000 samples in the dataset.

from sklearn.decomposition import PCA  # noqa
from umap import UMAP  # noqa

pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(X_train)

fashion_pca_umap = UMAP(init='random').fit_transform(pca_result_50)

Visualize the results

The following functions will create a visualization of the results with matplotlib.

import numpy as np  # noqa
import matplotlib.pyplot as plt  # noqa
import matplotlib.patheffects as PathEffects  # noqa

import seaborn as sns  # noqa

sns.set_style('darkgrid')
sns.set_palette('muted')
sns.set_context("notebook", font_scale=1.5,
                rc={"lines.linewidth": 2.5})


""
# Utility function to visualize the outputs of PCA and UMAP.
# https://www.datacamp.com/community/tutorials/introduction-t-sne

def fashion_scatter(x, colors):
    # choose a color palette with seaborn.
    num_classes = len(np.unique(colors))
    palette = np.array(sns.color_palette("hls", num_classes))

    # create a scatter plot.
    f = plt.figure(figsize=(8, 8))
    ax = plt.subplot(aspect='equal')
    sc = ax.scatter(x[:, 0], x[:, 1], lw=0, s=40,
                    c=palette[colors.astype(int)])
    plt.xlim(-25, 25)
    plt.ylim(-25, 25)
    ax.axis('off')
    ax.axis('tight')

    # add the labels for each digit corresponding to the label
    txts = []

    for i in range(num_classes):

        # Position of each label at median of data points.

        xtext, ytext = np.median(x[colors == i, :], axis=0)
        txt = ax.text(xtext, ytext, str(i), fontsize=24)
        txt.set_path_effects([
            PathEffects.Stroke(linewidth=5, foreground="w"),
            PathEffects.Normal()])
        txts.append(txt)

    return f, ax, sc, txts

""
(f, ax, sc, txts) = fashion_scatter(fashion_pca_umap, y_train)
workflow mnist

Export the results for Cartolabe

We’ll take the results of the UMAP projection and build an array of points with their position. We’ll also include for each point an img_data field which contains the base64 encode data uri to visualize the point image in an svg image tag.

import base64  # noqa
from PIL import Image  # noqa
from io import BytesIO  # noqa

labels = ['T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat',
          'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
items = []

for idx, data in enumerate(X_train):
    item = {'nature': 'fashion', 'rank': idx, 'score': 1.0, 'label':
            labels[y_train[idx]], 'position': fashion_pca_umap[idx].tolist()}
    im = Image.fromarray(data.reshape(28, 28))

    with BytesIO() as output:
        im.save(output, format='png')
        encoded_string = base64.b64encode(output.getvalue()).decode()
        item['img_data'] = 'data:image/png;base64,{}'.format(encoded_string)
    items.append(item)

for txt in txts:
    pos = txt.get_position()
    item = {'nature': 'hl_clusters', 'rank': len(items), 'score': 1.0, 'label':
            labels[int(txt.get_text())], 'position': [float(pos[0]),
                                                      float(pos[1])]}
    im = Image.fromarray(X_train[np.random.choice(np.argwhere(
        y_train == int(txt.get_text())).reshape(-1), 1)[0]].reshape(28, 28))

    with BytesIO() as output:
        im.save(output, format='png')
        encoded_string = base64.b64encode(output.getvalue()).decode()
        item['img_data'] = 'data:image/png;base64,{}'.format(encoded_string)
    items.append(item)

""
import pprint  # noqa

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(items[10])
{   'img_data': 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABwAAAAcCAAAAABXZoBIAAACA0lEQVR4nFWQu24TQRiFz1x31+u1TWQ7IQZZEVESRUghUgouAolIFJQgwQNQIKBAKA1vAS2vwgsgCqChSxGBAnGcZLHsOBvvzM4Mxa5v08yv8+nMmf8A+Qk/fH1TjM8+vysmkl+fHrDu5tnh/mDhrqwclQ9fHkzh7vu4QoJG6fj7jt/vNv/Vzp8AAAUAPPqltDvrHKSbo7jX0vQv7k3hsoK2nop7g9iWknPD3H0A4ABAK/2+D87LXAnKqbiEXZs4V2ige6pEJfN9DWJdctGaOJdSS36zIQkzq3liVacUxY3TwlnveI/ZUepGyWWaWN29fTPBxvjZRvlhe6ffsIpaGfiLKtnalluTPdsf375oRQMJSrJw0Tzfa71Kpw0BT1//UZxwimSF7RYaBQDCBH4OHRWwlrikCjA2gc5kGEKJjIE6/0ID1kwbAoPINDecEBvY/XEaHWe2tI8wgxOa4No8dLijJU09GGEQNAu5gBY3MinBGeBrrMPNQOp403MEHnXUUL0OOwMJKnGgpKbghHG+Cktm4fXIedJJj7gAmRK5Pv7tRjUpK84DqNBzfhVkFi54uppxR0VS5UwsYS5zRZGwL1ExKgooieadJg2UdmHDSlDD3HymMuLEpsL03JBajPJdeAHXarp2RdYrq83tLxFRuciLgr7VT9JTs3z1h9d2ya3j3Pkfp1jNMwZc6WUAAAAASUVORK5CYII=',
    'label': 'T-shirt',
    'nature': 'fashion',
    'position': [7.637689113616943, -3.356983184814453],
    'rank': 10,
    'score': 1.0}

All that’s left to do is to save the items in a JSON file :

import json  # noqa

export_file_name = '../datas/mnist.json'

with open(export_file_name, 'w') as f:
    json.dump(items, f)

Total running time of the script: (0 minutes 52.182 seconds)

Gallery generated by Sphinx-Gallery