Trino: the open source data search engine that separates from Facebook

0

Internet users generate 2.5 quintillion bytes of data every day. Are your organization’s data management tools up to scratch? Trino, an open source distributed SQL query engine, can give you better data processing and analysis. This can speed up your queries. Maybe you’ve heard of Trino but want to learn more before you switch.

Is Trino a database? Is it OLAP? How does it improve query processing? What should I know about Trino? Learn more here about Trino and how it can improve query performance.

What is Trino?

Trino is an open source distributed SQL query engine. Engineers designed Trino for ad hoc and batch ETL queries on multiple types of data sources.

Trino supports both relational and non-relational sources. Trino can handle standard and semi-structured data types.

Some people mistakenly think that Trino is a database. You’re using Trino to run SQL on data, but that doesn’t make it a database. Trino does not actually store any data.

Facebook Presto Trino Split

Trino and Presto are closely related. Trino parted ways with Facebook’s Presto project. Facebook engineers developed Presto to process the petabytes of data that Facebook was trying to analyze. The creators wanted Presto to remain open source and community-based. Facebook wanted more control, which caused a split.

Facebook has filed a trademark application for the name Presto. The original creators have been renamed Trino. Now that Trino has parted ways with Presto, the functionality of each system has begun to differ.

How does Trino speed up queries?

Trino started as a way to manage the incredibly large data sets that Facebook needed to analyze. Trino queries are processed faster than queries using other engines. Several factors contribute to this acceleration.

Three-part architecture

The Trino architecture is similar to massively parallel processing (MPP) databases. A coordinator node manages multiple worker nodes to process all the work.

A user executes his SQL, which goes to the coordinator. The coordinator analyzes, schedules and schedules a distributed query.

This partitions the data into smaller chunks to distribute across nodes. When blocks of data arrive at a particular machine, the machine processes them in parallel. Processing occurs on multiple threads within a particular node.

Trino supports standard ANSI SQL, such as complex queries, join aggregations, and outer joins. Users can perform more complex operations such as JSON and MAP transformations and parsing.

Reduced latency

One of the factors of Trino’s speed is that it does not rely on checkpointing and fault tolerance methods. Fault tolerance adds resiliency, but it also creates a large amount of latency.

The removal of the fault tolerance requirement is a major change from older big data systems. This makes Trino ideal for queries where the cost of recovering from a failure is less than the cost of checkpoints.

Procedure

Trino can push query processing to the connected data source. The operation goes to the source system where custom indexes on the data already exist.

Pushdown improves overall query performance. It reduces the amount of data read from storage files. Trino can use several types of backflow, including:

  • Predicate

  • Projection

  • Reference

  • Aggregation

  • To rejoin

  • Limit

  • Top-N

These forms of pushback reduce network traffic between Trino and the data source. They also reduce the load on the remote data source. Pushdown support depends on each connector and the underlying database or storage system.

Cost based optimizer

Trino’s cost-based optimizer (CBO) uses table and column statistics to create the most efficient query plan. It takes into account the three main factors that contribute to the duration of a request:

  1. CPU time

  2. Memory requirements

  3. Network Bandwidth Usage

The CBO balances the different requests for requests, namely:

You can really only optimize for one of these priorities. The CBO creates and compares different variants of a query execution plan to find the option with the lowest overall cost.

Agile approach to data access

Trino manages storage and computing separately. It works easily in cloud environments. The Trino cluster does not store your data, so it can scale automatically based on load without losing data.

Use cases for Trino

Trino is an online analytical processing system (OLAP). Trino extends the traditional OLAP data warehouse solution by running as a query engine for a data lake or data mesh.

You can run queries interactively against various data sources. Unlike a traditional data warehouse, you don’t need to move data in advance.

Trino’s power and flexibility make it well suited for many use cases. You can use it for all or to solve a particular problem. As Trino users in your organization gain experience with its benefits and features, you’ll likely discover other uses as well.

Ad hoc queries and reports

End users can use SQL to run ad hoc queries where the data resides. You don’t need to move the data to a separate system. You can quickly access the datasets analysts need.

You can query data from many sources to create reports and dashboards for business intelligence. Data scientists and analysts can build queries without having to rely on data operations engineers.

Data lake analysis

A common use case for Trino is to directly query data on a data lake without requiring transformation. You can query structured or semi-structured data from multiple sources. This streamlines the process of creating operational dashboards.

Trino can use the Hive connector against HDFS and other object storage systems. You can get SQL based analytics on your data lake, but it stores the data.

Batch ETL

Trino is a great engine for your batch extract, transform, and load (ETL) queries. It can quickly process a large amount of data. It can import data from different sources without always needing to pull it from sources like MySQL.

An ETL via Trino is a standard SQL statement. ETL is easy to implement. End users can perform other ad hoc transformations.

The extensive Trino connector framework means that any connector can be the source of an ETL. Most connectors can also be a sink.

Improve query performance with Trino

If you’re wondering how Trino can speed up your queries, you’re considering two state-of-the-art query engines. Several features of Trino can improve the performance of your queries.

The Trino architecture uses massively parallel processing. Reduced latency, pushback, and cost-based optimizer also speed up the query lifecycle.


Source link

Share.

Comments are closed.