Niall's Data Blog

A Data Engineer / Architect writing about Tech, Data and the Community

This is my New Blog!

Migrating from WordPress to Hugo

Migrating from WordPress to Hugo I’ve had a blog for years at (no link, its dead now) that I very occasionally wrote on. I was on WordPress, and every time I wanted to just write a post, there seemed to be a hundred things that needed fixing or updating. Most of the time it honestly felt like it was more effort than it was worth. My blog had always been hosted by an old colleague, and when that was coming to and end it was either move it or lose it time.

Associative Grouping using Spark - Part 3

This is part of series of posts about associative grouping: Part 1 - Associative Grouping using tSQL Recursive CTE’s Part 2 - Associative Grouping using tSQL Graph In the first two parts of this series we looked at how we could use recursive CTE’s and SQL Server’s graph functionality to find overlapping groups in two columns in a table, in order to put them into a new super group of associated groups.

Introducing AzureDataPipelineTools

A few months ago my friend Richard Swinbank posted a blog, More Get Metadata in ADF, about the limitations of using the Get Metadata activity in ADF to get information about files in a data lake. This to a twitter conversation as a bunch of other data engineers had been building the same tools for different companies. Due to "popular" demand I've released the definition of my #Azure #DataFactory pipeline to Get Metadata recursively https://t.

Azure Data Factory: Dev Mode vs Published Code

I’ve worked with quite a few people new to Azure Data Factory, and one thing that seems to confuse new users is the difference between the developer sandbox where we build pipelines, and the published/deployed code. Understanding this is key to working with Git and using CI/CD pipelines to deploy your code, and getting other Azure services to integrate nicely to call your pipelines. Connecting to ADF A good first place to start is to understand the different ways we can interact with a data factory.

Azure DevOps: SqlPackage Deployment Timeouts

I recently looked at an Azure DevOps pipeline for a client that was timing out whenever an index change or other long running task was deployed using a DacPac. All deployments that ran within a few minutes successfully completed, but those taking longer than 10 minutes were failing. A quick google for the issue shows a few helpful pages, a SqlPackage.exe bug from 2016 & a blog post from 2018. These suggested using either a couple of parameters when running SQLPackage, or setting a registry setting.