TileDB raises $4 million in Seed funding led by Nexus Venture Partners to reshape data management for next generation data science

We are thrilled to announce that TileDB, Inc., an advanced storage and analytics platform, has raised $4 million in Seed funding to support continued technology development and marketing. A new approach is needed to organize, store, and access massive quantities of structured data that is being generated in fields such as genomics, imaging, finance and IoT. The TileDB technology originated from research at Intel Labs and MIT to solve this problem for scientists, developers and businesses.

The funding was led by Nexus Venture Partners, with participation from Intel Capital and Big Pi Ventures. We are pleased to welcome Abhishek Sharma, Principal at Nexus Venture Partners, along with Sam Madden, Professor of Big Data and Computer Science at MIT, to our board.

“The explosion of data is posing challenges around how to optimally manage it. TileDB is a robust new technology that seeks to disrupt scalable data organization and analysis, in cloud and on-prem settings, for a wide variety of big data applications. We have had a great partnership with the TileDB team for the past eighteen months and we are excited to continue to work with them and share their vision.” — Abhishek Sharma, Principal, Nexus Venture Partners

The Problem
Database management systems (DBMS) have been the de facto solution for managing data for decades, and SQL is the most popular language to query and analyze data. Users would typically ingest and access/process their data exclusively inside the DBMS.

The advent of “big data” transformed the data management landscape. Ingestion into traditional databases increasingly became a bottleneck, and these systems were unable to handle the variety of data and analytics users wanted to perform. Data scientists started to utilize a wide range of tooling in addition to a database, such as popular Python/R libraries, Spark, Tensorflow, and more. Enormous volumes of data called for cheaper and more scalable ways to provide data storage and access, pushing users and companies towards cloud storage solutions (e.g., AWS S3). All these factors motivated the decoupling of storage from processing in the database, and led to storing data in formats that can be accessed and operated on by multiple different computational libraries and systems. Over time, new SQL engines were developed to query and process this externally stored data directly, whereas established databases evolved to adapt to this new reality.

Popular formats like Parquet or Orc were developed to store data in an efficient way that multiple “big data” systems could understand. Unfortunately, these formats were designed around write-once relational data and single-dimensional storage and access. These formats were not designed to model multidimensional array data ubiquitous in scientific applications where data is ordered over more than one dimension. Moreover, these formats were not created with the limitations of cloud object storage in mind, which mandated these formats to rely on external services for handling important metadata information and deal with the implications of eventually consistent storage backends. Traditional scientific data formats for multidimensional data storage (including HDF5) are poorly adapted to the new cloud storage landscape and do not generalize to “sparse” data being generated in vast quantities in genomics, time series, and geospatial applications. Most importantly, binary data formats themselves do not address fundamental issues around data management, such as access control, logging and auditing. Users who need these features have to build them into their higher level applications.

It is clear that a new storage solution is needed for users generating and operating on massive quantities of structured data. Such a solution must be efficient on both on-prem and cloud architectures, it must offer data management features such as access control and logging, and must efficiently integrate with the rapidly changing data science ecosystem.

What is TileDB?
TileDB offers a new, modern solution to structured data storage. It relies on the fact that structured data can be modeled as either dense or sparse multi-dimensional arrays (where sparsity indicates the majority of array domain values are undefined). From genomics, imaging, sensing, tabular and time series data, arrays can capture the majority of today’s big data workloads. TileDB is fundamentally a sparse and dense array data storage engine. It adopts the best ideas from columnar and spatial database research and introduces a novel format supporting parallelism, fast/concurrent updates and excellent compression. TileDB’s design leveraging immutability and log-structured updates is perfectly adapted to take advantage of cheap and scalable cloud storage backends such as AWS S3.

TileDB offers high-level APIs to analysis environments such as Python, R, and Spark. It allows executing SQL queries by interfacing with query engines such as PrestoDB. TileDB functions as an efficient substrate for these environments to store, access, and compute on multi-dimensional array data. Through TileDB, advanced analytics can be scaled out to solve larger problems.

TileDB handles all access control and logging at the storage level. This enables data management, security and access control features of a traditional DBMS to be shared across the applications that access data via the TileDB storage engine, eliminating the need for re-implementing those features at the application layer.

“TileDB provides a unique combination of high performance reads for analytics workloads with excellent update performance. Coupled with its generic format for structured data and rich APIs, it can be used in a large variety of applications running both on-prem and in the cloud.” — Sam Madden, Professor of Computer Science at MIT, Co-Founder of Vertica and Cambridge Mobile Telematics.

History and vision
TileDB started its life as a research project under the umbrella of the Intel Science and Technology Center (ISTC) for Big Data at MIT. It was a project to scratch an itch. The original goal was to perform computations on large sparse arrays, but there was no existing off-the-shelf solution to persist and query large sparse datasets. From that point onwards, TileDB was generalized as a system to store data that could be modeled as sparse or dense arrays, and was used to achieve state of the art performance for array workloads, and to accelerate genomic variant access at the Broad Institute. After the sunset of the ISTC, TileDB, Inc. the company was spun out as an independent entity to continue developing the TileDB open source project.

Over the past 18 months we have been hard at work, transitioning the project from a research prototype to production software, adapting TileDB to work with cloud object store backends, and adding many exciting features. Along the way the company has attracted talented team members with diverse backgrounds who help realize TileDB’s nascent potential.

With this new round we hope to build out a solution to enhance cloud interoperability, compute on TileDB data more easily, and improve areas such as data consistency, access control and sharing. We plan to deliver a service to enable users and organizations to easily share and monetize access to their data, and to provide a platform for unique capabilities such as serverless computation and distributed SQL queries as a service.

Try TileDB today!
The TileDB library is open-sourced under the permissive MIT license, and you can get up and running with examples quickly. A growing list of high level APIs in Python, R, Java and Go are available. In development are several optimized connectors to scalable SQL engines (pushing down operations to the TileDB storage engine) such as PrestoDB and Spark-SQL. This makes it easy to query and do out-of-core computation on your data with SQL, Python, R, Spark, C, C++ or any combination of the above in a familiar environment.

Today marks an important milestone in the trajectory of the company. We are now far from the early days of an academic research prototype and are entirely focused on building out exciting new features and capabilities to empower our users and customers. We hope to use this blog to share some of the technical aspects of array data storage and management.

We’re hiring!
We are looking to hire highly skilled and passionate individuals. For those interested in the ideas and ecosystem surrounding TileDB, apply today at https://tiledb.workable.com.

About Nexus Venture Partners
Nexus Venture Partners is a leading early stage venture capital firm that operates as a single team with operations in the US and India. With decades of experience in building and funding globally leading companies, Nexus’ footprint in the world’s two leading markets positions it uniquely with global insights and ability to serve entrepreneurs. Nexus family includes H2O.ai, Druva, Headspin, Aryaka, Kaltura, Postman, Pubmatic, Infoworks, Cloud.com, Minio, Mezi, Quandl, Gluster, Biz2Credit, Rancher, Helpshift, Clover, Delhivery, Snapdeal, Shopclues, OLX, and Unacademy. For more information, visit https://nexusvp.com/.

About Intel Capital
Intel Capital invests in innovative startups targeting artificial intelligence, autonomous vehicles, datacenter and cloud, 5G, next-generation compute and a wide range of other disruptive technologies. Since 1991, Intel Capital has invested US $12.3 billion in 1,544 companies worldwide, and more than 660 portfolio companies have gone public or participated in a merger. Intel Capital curates thousands of business development introductions each year between its portfolio companies and the Global 2000. For more information on what makes Intel Capital one of the world’s most powerful venture capital firms, visit www.intelcapital.com or follow @Intelcapital.

About Big Pi Ventures
Big Pi is a new early-stage venture capital firm based in Luxembourg and Greece. The team consists of successful technology entrepreneurs and seasoned investors, with local and international experience. They have been involved in most of the success stories of the Greek ecosystem, such as Beat, Workable, Upstream and Persado. Big Pi Ventures invests in science-based and deep-tech ventures that use Greek talent to capture a global market. For more information, visit https://bigpi.vc/.

Note: It was originally published on the TileDB blog here.