One of the first tasks during a Hive to BigQuery Data Warehouse migration project is to transfer Hive’s custom functions, including UDFs (User-Defined Functions), UDTFs (User-Defined Table-Generating Functions), and UDAFs (User-Defined Aggregate Functions) to the BigQuery side. Since there are significant differences between Hive and BigQuery in this area this migration step is far from trivial.
I share battle-tested strategies and best practices for this crucial process in this post.
What is UDFs, UDAFs and UDTFs in Hive?
Apache Hive enhances its querying capabilities with custom user-defined functions. These functions allow you to extend Hive's standard functionality to meet specific, custom data processing requirements. In Hive typically these custom functions are written in Java by implementing specific interfaces and abstract classes.
The main differentiator among these functions is the cardinality of the input and output rows.
A UDF that takes a timestamp column and extracts the hour component or a UDF that converts text to uppercase.
A function that calculates the median of a set of numerical values or a custom aggregation that combines various columns into a single weighted score.
A function that takes a column containing comma-separated values and splits it into multiple rows, one for each value.
What BigQuery can offer to add custom functionality?
If we are talking about UDFs Google BigQuery presents a versatile and powerful platform for adding custom functionality, and adapting to various data processing needs. Unlike Hive, where custom functions are predominantly Java-based, BigQuery offers three primary ways to extend its capabilities:
Flexibility: you can implement the function using any language that is supported by Cloud Run or Cloud Functions
Standard maintenance: your function is a REST API backed up by a microservice. You can test, do CI/CD, etc… as you would normally do with a normal microservice
ARRAY, STRUCT, INTERVAL, and GEOGRAPHY types are not supported out of the box
One more moving part: you need to add Cloud Functions or Cloud Run to your architecture with everything that this brings
Extra cost: you will need to pay the extra cost of Cloud Run or Cloud Functions
With the right setup, all of these options can be used seamlessly from SQL, the same way as you would use a native BQ function.
UDAFs and UDTFs
At the time of the creation of this blog post, BigQuery does not give you the ability to implement custom UDAFs or UDTFs, so if you have some of these custom functions on the Hive side, you will need to be a little creative to replace them.
Which option to use?
During our migration projects the decision tree below serves as a guide to identify the most suitable path for migrating Hive UDFs to BigQuery, considering the capabilities and limitations of BigQuery in comparison to Hive's custom functions. The goal is to ensure that all necessary transformations and computations performed by the original Hive UDFs are preserved and optimized in the new BigQuery environment.
The journey begins with the fundamental question of whether the UDF is needed in the new BigQuery data warehouse. We check during the assessment if the function is used in any transformation that will be migrated to the BQ side. If it's not, we're met with a simple resolution—no migration necessary, and we can bask in the simplicity of this outcome. However, if the UDF is indeed needed, we next explore BigQuery's native function library to see if there's an existing feature that can replicate our Hive UDF's capabilities. When a native BigQuery function is available, we adopt it, embracing BigQuery’s built-in efficiency.
In instances where BigQuery doesn't offer a native alternative, we delve into the type of function we're dealing with. For a standard UDF, we evaluate whether it's feasible to reimplement it in BigQuery SQL; if so, we proceed with this SQL-centric approach. If not, we turn to Google Cloud's serverless solutions and leverage the BQ Remote UDFs feature. This option is appealing because we can use Java and keep the core function code as is. If for some reason Remote Functions are not available in our case we can always fall back to using JS UDFs.
When it comes to UDAFs, the decision hinges on the volume of data—specifically, whether the aggregation operates within a bounded scope. For manageable data groupings, we can craft custom aggregations using BigQuery's ARRAY_AGG function. For more unwieldy aggregations, we may need to refactor our approach entirely or shift processing to Google Cloud Dataflow or Dataproc, ensuring scalability and performance.
Lastly, for UDTFs, if we would like to stay in the realm of SQL the path is straightforward: we transition the function to generate elements as an array, utilizing BigQuery’s UNNEST function to flatten the arrays into multiple rows. If this approach does not work we can always go back to using Cloud Dataflow or Dataproc to implement our functionality.
This decision tree not only aids in methodically migrating Hive's custom functions to BigQuery but also ensures that each step taken is in alignment with BigQuery’s optimal practices and architecture, guaranteeing a smooth and efficient transition.
And that's a wrap on our guide to shifting Hive's custom functions over to BigQuery. It’s not exactly a walk in the park, but with the roadmap we've laid out, it's definitely doable. Think of the decision tree as your trusty GPS, helping you navigate the twists and turns of migration without getting lost.
In the posts to come, I'll be breaking down real migration stories—no fluff, just the straight goods on how these tools work in the trenches. We’ll share the hits, the misses, and the hacks that made each migration a win.
So, if you're gearing up for the big move to BigQuery, stay locked in. We've got the insider tips that’ll help you cut through the complexity and make the most of Google Cloud's powerhouse.
Aliz is a proud Google Cloud Partner with specializations in Infrastructure, Data Analytics, Cloud Migration and Machine Learning. We deliver data analytics, machine learning, and infrastructure solutions, off the shelf, or custom-built on GCP using an agile, holistic approach.