Absolutely everything. Data is the driver of predictive models. In my opinion, the field of science goes through various stages, eventually culminating in a quantitative phase. Specifically, the scientific method comprises of the following:
· Observation, where interesting behaviour is identified via monitoring. This allows one to build a hypothesis and to test it;
· Experimentation, once a hypothesis is put forward an experiment is devised to obtain evidence/results towards or against its validity; and
· Recordation, where the result of experiments is meticulously persisted. This also involves quantification where the natural world is converted to values to gather the necessary data.
Data recordation enables analysis but also prediction. In the world of machine learning, the wealth of gathered data is used to train models to predict certain events based on a set of input values. However, prediction is only as good as the data used to train such models. This implies that a sufficient amount of data is required to have a model that performs well when predicting outcomes. Sufficiency is gauged on a case by case basis, but let’s consider the following example to illustrate what happens when there is not enough data.
Suppose you devise an experiment where you perform a certain set of actions (input data). You perform the experiment using two different input values. This provides you with a response of two data output values and at this point, a model is then required to create a relationship between the two. Faced with limited data, the only real option here is to connect the two points with a straight line. However, as illustrated in the figure below, subsequent experiments can provide you with additional values, and a new (and more accurate)model is needed to capture this relationship.
Now the above may be a simple example when considering today’s types of experiments, especially those in the AI space, but it nonetheless illustrates the need for a wealth of data.
Today’s AI models take as input perhaps hundreds of data values and perform thousands of computations to provide a predictive response. As such, there is an even greater demand to have data to create such models.
Thankfully, volunteers and organisations are helping with this by providing and maintaining open access to data repositories around every thinkable subject, from textbooks to cutting-edge scientific experiments. Examples include all open data initiatives.
Do you have an important problem you want to solve? Even if you don’t have the data for it now, knock on our door and we can help you get the answers you need.
April 2018 Published in Forbes