Category Archives: Data Quality

Why is Data Quality important

In recent years, Data quality is becoming more and more crucial for companies and business around the world, since it directly impacts decision-making, operational efficiency, and business outcomes. High-quality data ensures that businesses make informed decisions based on accurate and reliable information. Poor data can lead to errors, misinterpretations, and costly mistakes. Also, clean and consistent data reduces time spent correcting errors, resolving discrepancies, and reprocessing transactions. This improves productivity and reduces costs.

In regard to the Customer Satisfaction, accurate data helps businesses understand and serve their customers better, leading to improved customer experiences and stronger relationships. In the area of Regulatory Compliances, many industries are subject to strict regulations regarding data usage and reporting. High-quality data ensures compliance and reduces the risk of legal penalties or reputational damage.

Last but not least, the most recent revolution of AI is heavily related to Data Quality as well. AI models and machine learning algorithms depend on quality data for training and prediction. Poor data quality can lead to inaccurate models and flawed outputs.

By investing in data quality, organizations ensure their data is accurate, complete, consistent, and up to date, enabling them to operate efficiently and achieve their goals.

Finding duplicates in SQL Server

In a single column:

To get duplicates on a specific column, you can use the GROUP BY and COUNT(*) statements. For example, to find duplicates on the “ModifiedDate” column of the “Sales.SalesOrderDetails” table from the “AdventureWorks” sample database, you can use the following code:

SELECT ModifiedDate, COUNT(*) duplicates 
 FROM Sales.SalesOrderDetail
GROUP BY ModifiedDate
HAVING COUNT(*) > 1
ORDER BY COUNT(*) desc

Result:

In multiple columns:

Similarly, GROUP BY and COUNT(*) can be used too, but, in this case, we could convert each column to VARCHAR (so that they have all the same data type) and then concatenating them. For example, let’s use the “ModifiedDate” and “ProductID” fields from the same table:

select CONVERT(VARCHAR(100), ModifiedDate) + ' ' + CONVERT(VARCHAR(100),ProductID) columns,
  count(*) duplicates
from Sales.SalesOrderDetail
group by CONVERT(VARCHAR(100), ModifiedDate) + ' ' + CONVERT(VARCHAR(100),ProductID)
having count(*) > 1
order by count(*) desc

Result:

How to get the current running queries in SQL Server

To get the list of current running queries, you can use the following script:

SELECT sqltext.TEXT,
req.session_id,
req.status,
req.command,
req.cpu_time,
req.total_elapsed_time,
req.total_elapsed_time/1000 AS total_elapsed_time_in_secs,
req.total_elapsed_time/60000 AS total_elapsed_time_in_mins,
users.name
FROM sys.dm_exec_requests req
INNER JOIN sys.sysusers users on users.uid = req.user_id
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext
order by sqltext.text