The Future of Big Data is Small Data

TABLE OF CONTENTS

Step 1. the title of the step goes here as expected

Imagine you want to create software to do a task. There are two main ways to do it and one intermediate way.

The code method

The first way is to program the software to do the task, which I will call the “code method”. In this case, you know all the interactions that can occur and anticipate them in your program. The most basic example of this is the “If then” sequence, in which you tell the computer, if this happens, then do this. For example, if this button is clicked then show this screen.

The code method uses no data and involves no training. The programmed sequence happens deterministically, regardless of the data. To be clear, programs can be created using the code method written with actual code or using visual or similar tools.

The train method

The second way is to train the software to do the task using big data, which I will call the “train method”. In this case, you write an algorithm, such as a neural network, to allow the software to be trained on the data. You then unleash the software on the data and the computer learns, via a feedback loop, how to do the task. For example, you can teach the software to recognize cats by training it on cat pictures.

The train method is, of course, the essence of artificial intelligence. It is known as the big data approach because the more high-quality training data available, the better the result will be. There is a minimum of training data required to get reasonable results.

In its purest form, the train method can use unstructured data and the learning itself happens in an unsupervised way. This means that humans do not help the software with the training at all.

The algorithm will learn from the implicit context (not from explicit labeling) whether a given picture is of a cat or not. For example, ordinary users of a platform might in an ad hoc and spontaneous way put the word cat in the description of their photo, the word cat may be in the article in which the picture appears, or if it’s a video, people might say the word cat when the cat appears. All of this user data is of course totally unstructured (which means messy), and the algorithm would need to figure out what a cat is from this messy data.

Of course, there are potentially big advantages to an algorithm being unsupervised. This potentially means a huge amount of effort will be saved on the part of humans in terms of labeling and categorizing the data. It is not trivial (or stimulating for that matter) to accurately label one hundred thousand pictures of cats.

The problems

One major issue with the unstructured train method is that it needs a lot more data. If the data isn’t available, it can’t be trained this way. Supervised approaches also need a lot of data, so suffer from the same problem.

This is, of course, the reason why people are looking for opportunities to apply AI rather than applying AI to everything. The AI algorithms work best when a lot of data is available for training (or they can generate a lot of data - in the case of games).

Another problem with the unstructured, unsupervised approach is that it is much harder to write and test the algorithm on the data at hand. The algorithm needs to be more sophisticated to deal with unstructured data than it would need to be where the data is neat and categorized.

A supervised approach adds human intentionality to the process in terms of how the data is categorized however it is still very much a train method, a big data approach. Humans with an understanding of the algorithms can label the data and by doing so, reduce the work of the algorithm.

The small data method

There is a method which is a blend between the code and train methods which I will call the “small data method”. This is the small data approach that I alluded to in the title. It is possible to combine both the code and train techniques to massively reduce the amount of data needed to train an algorithm.

For the small data method a developer codes up a model of interactions but then this model is trained on a much smaller data set than would be required for big data approached. This results in the model being trained much faster than would be needed with the pure train method.

Of course this small data approach would make sense if the time needed to code up the model and train the data is much less than the time it would take to gather the data and train the algorithm.

There are scenarios where the small data solution would allow us to do things that are not currently feasible. The small data method is the only method available if the data required for the train method doesn’t exist in the first place. In this case the algorithm needs to be trained on data that is manually created. It is not normally practical to create tens of thousands of records by hand.

The small data approach is currently being researched by AI companies including botpress.io, and I expect it will become a mainstream technique in the years to come.

Share this on: