Identifying and collecting all relevant data available internally and externally.

Data Sources

Careful consideration should be given to the data that has been earmarked as useful for analysis - regardless if this is for a new project or already used for analysis. If data is already used for analysis then there is a good opportunity to determine if the true value of the data is extracted. For example if it is used for diagnostic analysis then it can explain why something happened, but it may hold enough information to be used for predictive analysis so it can also predict what might happen next.

All available data sources that have been identified should be assessed to determine how they can be used. This type of analysis frequently leads to new business questions that the data can answer. It is also a good measure of a company's culture for data-driven decision making.

If we consider the data mentioned above, it can be classified as the known knowns. In many cases there are unknown unknowns:

  • transient usage - data sources used once and then discarded because it is thought to have reached its end of usefulness
  • data leakage - data immediately discarded such as automatically truncated logs
  • new data sources - possible sources of data that are not even ingested or stored


As briefly mentioned above, assessments provide structure and process to the data discovery process. These assessments have to be repeatable and repeated often because environments remain in constant change. In most cases starting with an external party helps because the initial approach cannot be based on internal bias. By asking some 'obvious' questions during an assessment, often the not-so-obvious is discovered.

Data Leakage

The most important data sources to identify are those with useful data, that are not currently captured and stored indefinitely. Even those captured and stored for a while need to be locked down. Discarding any data as unfit for use, has to be a very specific and analytical process.

The transient use of data is a typical example of internal company bias as the only reason for labelling the data as useless. Data is often discarded after use, because it has always been done this way because it is thought to have no further value.

Identifying the data leakage (by accident or on purpose) and starting to store it as soon possible will start to build history as early as possible. It will still take some time before enough data is captured to see longer term trends, but the sooner the better.

New Sources

Any undiscovered new sources of data could hold the most value. In the new world of IoT there are so many potential sources of new data that special consideration has to be given to what value it may hold.

If you consider electronic messages as virtual IoT devices, then a new way to analyse communication is exposed, for example. Examples like this exist where it is not traditionally identified as IoT but the same principles apply.

It should be considered as given in larger organisations, that data exists in silos. Information used by one department could very well be useful to another, without their knowledge.