Big Data
8 min read

Converting Hive UDTFs to BigQuery: A Practical Guide with BigQuery SQL and UNNEST

Published on
February 6, 2024
Author
Csaba Kassai
Head of Data
Subscribe to our newsletter
Subscribe

In the evolving landscape of cloud data warehousing, migrating from Apache Hive to Google BigQuery presents numerous challenges and opportunities, especially when it comes to user-defined functions. While previous posts in this series have delved into migrating Hive UDFs and UDAFs to BigQuery, this installment focuses on User-Defined Table Functions (UDTFs) and how they can be transitioned using BigQuery's UDFs and the UNNEST operator.

The challenge with Hive UDTFs

Just to recap: Hive UDTFs are functions that produce multiple rows of output for each input row, an essential feature for transforming and expanding datasets in flexible ways. For example, a UDTF might take a single row with a delimited string and expand it into multiple rows with individual split elements. You can find more information about the different types of Hive custom functions and their migration approach here.

The challenge with this is that BigQuery doesn't directly support UDTFs in the same way Hive does. However, with a combination of a native or a User-Defined Function (UDF) and the UNNEST operator, similar functionality can be achieved, allowing for the transformation and expansion of rows within BigQuery's SQL environment.

Hive UDTF Example: Split String with Index

This UDTF takes a string and a delimiter as input and outputs each element of the split string along with its index in the original string.

You can find the full Java class here. ( Disclaimer: I did not test this Hive function, it is just for demonstration purposes) 

You can use this function in Hive query like this:

Moving to the BigQuery side

BigQuery offers a powerful alternative through SQL UDFs and the UNNEST function. By combining these features, we can mimic the row-expanding capabilities of Hive UDTFs. Migrating the Hive UDFT described above involves a two-step process:

Step 1: Replicating Functionality with BigQuery UDF

We start by creating a SQL UDF that splits the input string and retains the index of each element:

This UDF, SPLIT_STRING_WITH_INDEX, efficiently divides the input string into an array of structures, each comprising the split element (part) and its corresponding index. You can find the full terraform version here.

Step 2: Expanding the Array into Rows

With our UDF in place, the next step involves expanding the resulting array back into rows, akin to the original UDTF's functionality:

Here, UNNEST combined with CROSS JOIN unfolds the array produced by our UDF into individual rows, each containing the element and its index. The result will look like this, you can see how we generated 7 rows from the 2 input rows:

Next Up: Migrating to BigQuery Remote Functions

Migrating from Hive to BigQuery requires rethinking how we approach custom functions like UDTFs. By leveraging BigQuery's SQL UDFs and UNNEST, we can successfully replicate the functionality of Hive UDTFs, ensuring a smooth transition to BigQuery's scalable and efficient data warehousing environment. As we continue to navigate the nuances of transitioning from Hive to BigQuery, our journey into the realm of custom function migration is far from over. In the next installment of this series, we will explore the exciting territory of BigQuery Remote Functions. I'll provide a working example to demonstrate how we can replace more complex Hive UDFs with BigQuery’s powerful remote function capabilities, offering insights into leveraging cloud-based solutions for even the most intricate data processing needs. Stay tuned for more practical solutions and expert tips to enhance your data migration strategy.

Author
Csaba Kassai
Head of Data
Subscribe to our newsletter
Subscribe