Seeing, hearing, speaking - envisioning VAST Data’s Thinking Machines roadmap

Building Databricks-like software and geo-site linking

Mar 29, 2022

Comments made by VAST Data co-founders CEO Renen Hallak and CMO Jeff Denworth say the company’s going to develop data infrastructure software; data lake software, that can help computing fulfil the Thinking Machines speaking, hearing and seeing vision.

This is a long way on from just storing bits of data in the clever way that VAST’s Universal Storage accomplishes. Let’s look at the statements the two have made, examine Thinking Machines’ ideas and then look at what they suggest.

Hallak’s thinking In May last year Hallak told a writer from Protocol Enterprise that VAST wants to run its own data science platform, with a trajectory that he says will pit it against vendors like Databricks. He said: "We think that five years from now ... that infrastructure stack needs to be very different. It needs to enable AI supercomputers rather than the applications that we have in the past. … Vertical integration adds to simplicity. But more than that, it allows you to take full advantage of the underlying technology.” Hallak implied that Vast Data would seek to build most of the platform itself: "It would not be possible for us to just buy someone else … and strap it on top of our system. We always lean towards doing the critical parts ourselves. And if there are any peripherals that aren't as important … then maybe there would be room for acquisitions.” He said: “There is a massive opportunity to compile different data services into one product suite.” Denworth’s prognostications

Denworth told a Computer Weekly writer in November last year that: “In the next 20 years we’ll see a new class of application. It won’t be about transactional data, it won’t be about digital transformation. What we’ll see is computing coming to humans; seeing, hearing and analysing that natural data.”

He said: “We are conscious of not computing in one data center, and using unstructured and structured data. We are also conscious that data has gravity, but so also does compute when you get to the high end.”

This will mean a new computing framework with “very ambitious products” to be announced by VAST.

In January Denworth told The Next Platform: “Thinking Machines “was a very bespoke supercomputing company that endeavored to make some really interesting systems and over time.” He said: “That’s ultimately what we’re going to aim to make: a system that can ultimately think for itself.”

He added: ”We realized that we could take that far beyond the classic definitions of a file system, but the realization was that the architecture that has the most intimate understanding of data can make the best decisions about what to do with that data. First by determining what’s inside of it. Second of all, by either moving the data to where the compute is or the compute to where the data is, depending upon what the most optimized decision is at any given time.”

Denworth talked with Brian Beeler on a Storage Review podcast, and said: “The next 20 years could [see] something that we call the natural transformation, or computers start to adapt to people.” He explained: “Our realisation is if you rethink everything at the infrastructure level, there are gains that can be realised higher up the stack that we will take the world to over the next couple years.

And then he said: “Computers are definitely at a point where they can now do the sensory part of what humans could do before; they can see, they can hear, they can probably not smell so much, but understand natural information closer and closer to the way that humans understand it. And I think the leap again, from that to having thinking machines, may be a big one, maybe a smaller one. But once you get to a thinking machine, it's game over, you don't need anything beyond that.

“And so I think it's justified, that we're putting all of our resources at building infrastructure that is enabling that next wave. And I think we will be surprised at how far we can take this in terms of what's possible.”

He talked about organisations working in different parts of the stack: “We have, obviously the hardware vendors working on GPUs, we have vendors like us working on that middle of the sandwich, infrastructure, part and software, we have the application vendors working on life science, genomics, medical imaging, we have financial institutions, taking advantage of all types of information coming into their systems, it's really exciting.”

Data arrival is going to be driving activity: “I think things are getting flipped on their head if before you had an application, and it was reading data, either from memory or from storage, in order to manipulate it, and then it was writing the result that it understood to be the case, I think the more and more we're going to see data driven applications, the data itself as it flows into the system will trigger functions that need to be run on it based on different characteristics of that information.

“And then you'll have recursion of more and more functions that need to be run as a result of what we understand on this specific piece of information as we compare it to the rest of the data that we already have stored specifically with respect to GPUs.” He said: “I think the fact that we're called VAST data is a big clue. We are trying to build that next generation of data infrastructure.”

What we are going to see “over the next few years … people will see us expand in the storage space and get closer and closer to realising the true vision of universal storage of our customers not needing to think about where they placed their data and how much access they have to it, and what can be done with it anymore.

And in parallel, you'll see more and more not necessarily storage parts coming from us as well, based on feedback that we get from customers.“

VAST will “essentially work to help customers solve the whole of their data processing, machine learning deep learning problem in a hybrid cloud world, in a way, where we just take not just the complexity of tiering and things like that as considerations and take them off the table. … And this seems to become becoming more and more popular, as people start to understand some of these natural language processing models, some of these new computer vision, or computer audio models. And so that's, that's pretty exciting. We've got a lot that we're doing with Nvidia.”

Note the overall repetition of references to computers seeing, hearing and speaking and to Thinking Machines and also to Databricks.

Thinking Machines and Databricks

Thinking Machines was a supercomputing company started up in 1983 to build highly parallel systems using that era’s artificial intelligence technology. The aim was to chew through masses of data very much more quickly that serial computing and so arrive at decisions involving processing massive amounts of data in seconds or minutes instead of days or weeks. The company over-reached itself and crashed, with parts being bought by Sun Microsystems. Its architecture typically required a frontend server, and backend Sparc CPUs and vector processors

In February last year Blocks & Files wrote: “Databricks enables fast SQL querying and analysis of data lakes without having to first extract, transform and load data into data warehouses. The company claims its “Data Lakehouse” technology delivers 9X better price performance than traditional data warehouses. Databricks supports AWS and Azure clouds and is typically seen as a competitor to Snowflake. …Databrick’s open source Delta Lake software is built atop Apache Spark.”

The VAST future

My conclusion is that VAST Data is going to build a data infrastructure layer vertically integrated with its existing storage platform. This layer will provide data lake capabilities and be able to initiate analytics processing itself; the data itself as it flows into the system will trigger functions that need to be run on it.

VAST CTO Sven Breuner previously confirmed this, saying VAST will link customer's separate VAST systems together: "It’s now time to start moving up the stack by integrating more layers around database-like features and around seamlessly connecting geo-distributed datacenters.”

We think that VAST will use a lot of Apache open source software, such as Spark, like Databricks, Druid like Imply, and Kafka like Confluent.

VAST is looking at hearing, speech and vision applications and will use Nvidia hardware, such as the Grace and Hopper chip systems. I am sure that penta-level cell flash and the CXL bus will play a part in VAST’s roadmap

It will present its IT infrastructure systems both on-premises in geo-linked clusters and in the public cloud - working to help customers solve the whole of their data processing, machine learning deep learning problem in a hybrid cloud world. We think that VAST will not port is Universal Storage software to the public cloud. The Cnode SW could be ported easily but the Dnode structure (Storage Class Memory drive front end with NVMe QLC SSD backend drives) could be hard to replicate with the appropriate storage instances in the public cloud.

We think it’s likelier there will be a VAST system in a public cloud with it available to the CSP's customers directly or indirectly.

My understanding is that VAST will announce its 10-year roadmap direction at an event later this year.

The Storage Stack

Discussion about this post