Spoiler: They are squiggly!
Dimension Reduction, why use 3 numbers when 2 will do?
What does the encoded data look like as the structure of the autoencoder changes?
1. Shallow ANN
2. Deep ANN
3. More Deep ANN

Spoiler: They are squiggly!

Take me to the python code

Dimension Reduction, why use 3 numbers when 2 will do?

Artificial neural nets (ANN) can be used to compress data. These ANN’s are referred to as “autoencoders”.

Here is a diagram of a basic autoencoder from wikipedia:

If an autoencoder can compress 10 numbers into 9 and then reconstruct the original 10 from these 9 numbers, there is redundant information among the 10 original.

There will be errors, it is not perfectly lossless compression, but we can bear this to an extent.

Cool stuff! Why do we care?

This compression can be utilized for dimmension reduction, a technique in the realm of statistical learning. Dimension reduction is a very useful in understanding underlying (latent) relationships of many individual things.

Anyone would be confused to understand the relationships of 100 different aspects of a concept.

But if I could boil those 100 down to a representative 2 or 3, the concept would be a lot easier to understand.

This is the power of dimension reduction!

This framework can be applied to many different types of data, basically as long as you can represent the data as a vector.

A vector is one dimensional, -> [0010011011] with a bit of thought lots of things can be represented as a vector.

A classic example is reducing the 28 by 28 pixel image of hand written digits (784 value long vector), into just two values.

With just two numbers we can preserve a lot of the information of the 784 dimension vector!

(this image is the result from a similar but more advanced technique: variational ANN) No my image, link to source

The field of dimension reduction is an open and burgeoning endeavor, an ANN autoencoder is just one way of approaching it.

To approach a dimension reduction problem, a simple ANN autoencoder is likely not the best way to approach the problem.

This write-up is an exploration of curiosity rather than looking for state of the art techniques.

What does the encoded data look like as the structure of the autoencoder changes?

This is the thought that inspired this exploration.

How would differently structured autoencoders encode the same data?

From other work, I had a data set of how much two countries trade 96 goods for a given year.

These trade baskets tend to be sparse, two countries trade a small number of all 96 possible goods.

This can make dimension reduction more difficult, which is why I started exploring autoencoders after trying some other methods such as PCA, tSNE, and UMAP.

I tried three different autoencoders in increasing number of hidden layers (depth).

96 -> 8 -> 96
96 -> 48 -> 24 -> 8 -> 24 -> 48 -> 96
96 -> 88 -> 80 -> 72 -> 64 -> 56 -> 48 -> 40 -> 32 -> 24 -> 16 -> 8 (and back to 96 symmetrically)

All three are symmetrical, I wanted to keep that consistent to only test the difference between depth of the networks.

Testing symmetric vs asymmetric would be interesting as well.

Creating an autoencoder than went from 96 dimensions to 2 incurred fairly high decoding errors.

I found 8 was small enough that the dimensionality was greatly reduced while the decoding error was palatable.

I’d like to see the encoded data in scatterplots, so to get to 2 dimensions PCA and tSNE were used to go from 8 to 2 dimensions.

PCA and tSNE are other dimension reduction techniques, whose explanations are outside the scope of this brief blog post.

Technical detials: each autoencoder was trained for 10 epochs, the activation functions were relu’s to a final sigmoid.

Loss didn’t improve after 10 epochs and I don’t have the hardware/funds to grok.

Mean Absolute Error (MAE) from first epoch to last for each model: 1. 0.0249 -> 0.0082 2. 0.0129 -> 0.0069 3. 0.0121 -> 0.0104

While the shallow ANN started with double the loss it came down significantly.

Conversely the deepest ANN started with less loss but decreased less.

After training each ANN I show the distributions of the 8 encoded variables.

THe distributions gives us a sense of how the ANN is packaging the information from 96 to 8 distributions.

I also show the scatterplots of the 8 to 2 dimension reduction for PCA and tSNE.

Each point is a trade basket. These scatterplots communicates the similarity between the trade baskets.

1. Shallow ANN

2. Deep ANN

3. More Deep ANN

Some interesting take-aways I see.

The shape of the distributions across the ANN’s are different.

ANN 1
- has pointy (leptokurtic) distributions with long tails
- fairly similar looking and looks like a dimension isn’t used much
ANN 2
- only 5 seem to be used, 3 distributions are almost completely uniform at 0
- they are multi-model, having two or three distinct peaks
ANN 3
- Similar to a gamma distribution, right or positive skewed- smaller mode than mean
- spike on long right tail, seemingly bi-modal

PCA has trouble further reducing dimensions - small amount of information (variance) explained in first two components
tSNE for deep ANN creates clearer groups than the shallow network, and the deeper ANN is very odd - the deep ANN’s tSNE results creates some intelligible clusters which is promising - the most deep ANN creates these squiggles

After some googleing it appears using tSNE on latent spaces of ANN’s tend to be squiggly: