Interview multiple candidates
Lorem ipsum dolor sit amet, consectetur adipiscing elit proin mi pellentesque lorem turpis feugiat non sed sed sed aliquam lectus sodales gravida turpis maassa odio faucibus accumsan turpis nulla tellus purus ut cursus lorem in pellentesque risus turpis eget quam eu nunc sed diam.
Search for the right experience
Lorem ipsum dolor sit amet, consectetur adipiscing elit proin mi pellentesque lorem turpis feugiat non sed sed sed aliquam lectus sodales gravida turpis maassa odio.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
- Porttitor nibh est vulputate vitae sem vitae.
- Netus vestibulum dignissim scelerisque vitae.
- Amet tellus nisl risus lorem vulputate velit eget.
Ask for past work examples & results
Lorem ipsum dolor sit amet, consectetur adipiscing elit consectetur in proin mattis enim posuere maecenas non magna mauris, feugiat montes, porttitor eget nulla id id.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
- Netus vestibulum dignissim scelerisque vitae.
- Porttitor nibh est vulputate vitae sem vitae.
- Amet tellus nisl risus lorem vulputate velit eget.
Vet candidates & ask for past references before hiring
Lorem ipsum dolor sit amet, consectetur adipiscing elit ut suspendisse convallis enim tincidunt nunc condimentum facilisi accumsan tempor donec dolor malesuada vestibulum in sed sed morbi accumsan tristique turpis vivamus non velit euismod.
“Lorem ipsum dolor sit amet, consectetur adipiscing elit nunc gravida purus urna, ipsum eu morbi in enim”
Once you hire them, give them access for all tools & resources for success
Lorem ipsum dolor sit amet, consectetur adipiscing elit ut suspendisse convallis enim tincidunt nunc condimentum facilisi accumsan tempor donec dolor malesuada vestibulum in sed sed morbi accumsan tristique turpis vivamus non velit euismod.
Often there is a need to check and monitor data quality in your data warehouse. Given that in most of our projects, BigQuery is used as a DWH solution, the requirements for the data quality checks toolkit are as follows:
- BigQuery should be supported.
- Quality checks should be flexible. It’s preferable to have an option to define rules in SQL.
- We need open, detailed, historical quality results that can be used in third party integrations.
- Alerting feature should not only be in the format of an email notification; it should be possible to build automation for alerts.
- We also need some dashboarding features.
This case study uses Dataplex data quality tasks to fulfill these requirements.
Architecture
The Dataplex data quality uses open source software CloudDQ under the hood. Validation results are stored in a target BigQuery data set so that they can be easily accessed.
Dashboards in a Data Studio may be used to present results in a visualized way. Cloud Monitoring can send alerts to various channels, including emails for employee notifications and PubSub for automation.
CloudDQ config
We used a subset of publicly available NYC taxi trips dataset.
CloudDQ uses YAML config files for validation rules configuration.
Please check a reference.
In this example, we created a config file to find the following issues:
- Taxi trips where the pick-up date was later than drop-off.
- Journeys where the fare was zero or a negative number.
- The records where a pick-up location equaled drop-off location, but the trip distance was not zero
Scheduling
A scheduled task can be created in Dataplex to perform regular data quality validation (e.g., daily). Please refer to this section of the documentation regarding task creation.
There is also an option to trigger a Dataplex task from Cloud Composer. This may be useful if there is a requirement to implement a quality gate instead of monitoring data quality with a schedule.
Review results
Validation results are stored in a summary table in BigQuery.
Data Studio can be used to build dashboards on top of it:
Alerts
There is no alerting functionality in Dataplex data quality tasks. However, you can use Cloud Monitoring to send notifications to various channels.
One workaround is to create a scheduled query that regularly checks dq_summary for the data check failures for which we want to send an alert. Once identified, the query will append a row to the alerting table.
Then you can create a log-based alert policy with the following log filter:
Another option is to use --summary_to_stdout key in a Dataplex DQ task to publish output to stdout. Then you can again create a log-based alerting policy.
Conclusion
Dataplex DQ is a simple yet powerful option to implement data quality checks and quality gates. In addition, it is flexible and provides easily accessible results in BigQuery. Another benefit is that it’s serverless, so you don’t have to worry about managing infrastructure.
Ready for the future?
Let’s talk!
REACH OUT, AND LET’S TAKE YOUR BUSINESS TO THE NEXT LEVEL.