AI Data Done Differently

By Phil Tee, EVP AI Innovations at Zscaler.

4 months ago Posted in

No matter the region or industry, every business today is looking for their latest AI use case – for the best way they can deploy AI to unlock efficiencies or gain business advantage. And while the focus of each use case may differ, the one thing it will have in common is its reliance on data. The adage of “rubbish in, rubbish out” has found a new lease of life when applied to AI scenarios – correctly inferring that AI is only as good as the data that trains and then continues to feed it.

Until now, the prevailing school of thought was that the more data you captured to feed your AI models, the better. But as data sets climb into their trillions, we may have reached a turning point in this attitude. After all, when you are talking about 15 trillion versus 5 trillion data points, the difference in size becomes irrelevant compared to the quality of the data you have, and what you do with it. Given this, is it time for us to rethink how we’re approaching data for AI?

The Rise of Agentic Workflows & SLMs

After several years where Large Language Models (LLMs) were the primary AI pursuit, one of the key industry trends we’re now witnessing is a move towards using agentic workflows and Small Language Models (SLMs). Unlike their multi-functional LLM counterparts, SLMs can be trained on more focused datasets, making them highly effective for specific tasks or domains.

In part, this shift comes in recognition of the cost and latency issues inherent with LLMs – as well as the security implications. With an LLM chatbot, for example, people expect to have their question answered in a matter of seconds. When you consider, however, that doing so requires the entirety of an LLM’s hardware asset being thrown at the question – you can understand how matching 11,000 logs per second to a few seconds latency could prove a tall ask. Instead, the latest thinking is that if you want to use AI in production, you need smaller models – whether off the shelf or fine tuned.

SLMs’ rise also reflects a more targeted approach by companies to AI queries – one where rather than starting with a question and collecting everything that might possibly relate to it, you consider what answer you need, and then form a workflow to bring back only the necessary data in order of usefulness.

A Focus on Depth of Data

This strategic shift toward targeted data acquisition naturally leads us to reconsider the quality versus quantity argument as it relates to data. Indeed, not all data is created equal. Its value stems not from its volume, but from a combination of its depth and relevance – plus how you condition it.

Machine data in the form of logs is a classic example of volume being the enemy of quality. A log file is usually a collection of unstructured debug messages created by engineers, who have since moved roles / companies. As a result they are highly sparse, and information light. Simply put, most of the content is junk, but buried in that junk is AI gold. Clearly “wasting parameters” on the junk is not a great choice, so a pre-conditioning step that “densifies” the logs by removing the junk is a far better strategy.

Of course, the ideal scenario would be to have a high volume of quality data. But even then, you don’t want to overly train your models on a massive sample set, as this can actually work against you in the sense of overfitting. Describing the negative impact of trying to connect too many data dots, overfitting can end with your AI results becoming less accurate and more random. This is familiar to all data scientists as the bias-variance trade-off, where endlessly refining the model against training data causes novel data point “shock”.

And when it comes to source information, too many dots is definitely where we are heading. To put it into perspective, my guess would be that in a small number of years the total amount of data traffic on the network will be more than the entire production of data on planet earth to date.

Sustainability Side-Effects of a Lean into Quality

As many of today’s leading thinkers will tell you, technology is an extractive economy. We tend to think of technology as clean and value-add – about moving things that don’t exist and creating magical outputs. But this is sadly not at all the case. In fact, AI in particular is massively data (and compute) hungry – requiring huge amounts of power and water to collect, process, train and store the data that feeds its models. To give you a sense of what this means in terms of storage alone: if you keep a terabyte of data in the cloud for a year, it has a bigger carbon footprint than a single plane ticket from Schiphol to New York. And a terabyte is nothing.

Because of this, anything you can do to recycle data or extract more value from it during the AI process has significant implications from a sustainability perspective. Returning to the push for quality versus quantity, part of the log densifier process turns data from ‘at rest’ to ‘in motion’ – you extract the metadata, and discard the rest – meaning you can then throw it away once used instead of having to store it.

Beyond storage implications, this huge data reduction exercise will also reduce latency – helping your engine deal with those tens of thousands of logs per second to deliver a 3-4 second GenAI response. And tie into the coming challenge of data sovereignty – with more and more companies expressing concern about data being moved and stored outside their home country, the less data being used and kept, the potentially lower this issue.

Here, data classification – the process of identifying and categorizing sensitive data based on predefined criteria – has a vital role to play in helping organizations avoid sending too much data to AI tools unnecessarily (or worse wrongly). It will also, of course, give you a sense of what data you have to work with in the first place.

AI Data Done Differently

As data volumes continue to grow exponentially, the companies that thrive in the AI economy will be those that master the art of refinement—extracting maximum value from minimal data. This approach delivers a powerful combination of benefits: improved response times, reduced operational costs, enhanced sustainability, stronger data sovereignty, and better security posture.

By embracing this "AI data done differently" philosophy, organizations can position themselves at the forefront of the next generation of AI innovation, while simultaneously addressing some of today’s most pressing technological challenges.

AI Data Done Differently

By Phil Tee, EVP AI Innovations at Zscaler.

The hidden cost of ‘Shadow AI’

AI isn’t as exciting as the Premiership, but it can kick-start your ESG strategy

AI on the Factory Floor: How a Houston Manufacturer Cut Waste and Boosted Output

Why AI will bring a sustainable tech revolution

Incident readiness is the key to operational resilience

AI adoption starts and stalls in the boardroom

Why the automotive industry is the perfect use case for AI

Reinventing Data & AI: Unlocking the Potential of Generative AI