
I was reading up on database engine Always Encrypted feature while preparing for 70-473 exam, and bump into these binary collations which I somehow never heard of before.
While configure Always Encrypted one of the choices you have to made is whether to use Deterministic or Randomized encryption (and you have to know the differences very well for the exam). One of the caveats when using Deterministic encryption is that it have to use a column collation with a binary2 sort order for character columns. More specifically documentation states that: Deterministic encryption requires a column to have one of the binary2 collations. If you will be using SSMS Encrypt Columns wizard it will be converting your column collation into binary2 case sensitive collation.
These statement required me to investigate the topic of binary collation a little bit.
First of all you may have different collation settings on a SQL Server instance level (i.e. on its system databases) on your databases and on specific columns and expression level.
To list all the collations available on your instance of SQL Server you can issue the following SQL statement:
On Azure SQL Database you will get back 3955 rows or possible collations. Understanding collation requires you to understand set of related terms, such as Collation, Locale, Code page, Sort order. You should also know that there are three major sets of collations available to you:
- Windows collations
- Binary collations
- SQL Server collations
These collation groups sort data differently. In the past my standard answer/explanation about Windows VS SQL collation was that Windows one is more frequently updated, more compatible and hence more preferable over SQL one. Technically speaking it is more about how sorting works, but as per MSFT documentation: “SQL Server supports supports a limited number (<80) of collations called SQL Server collations which were developed before SQL Server supported Windows collations. SQL Server collations are still supported for backward compatibility, but should not be used for new development work.” So what I was saying/writing seems to be correct.
I won’t be covering all the details and differences related to these sets of collations as I only want to focus here on Binary collations which are requirement for Always Encrypted Deterministic encryption.
Binary collations sort data based on the sequence of coded values that are defined by the locale and data type. They are case sensitive. A binary collation in SQL Server defines the locale and the ANSI code page that is used. This enforces a binary sort order. Because they are relatively simple, binary collations help improve application performance. For non-Unicode data types, data comparisons are based on the code points that are defined in the ANSI code page. For Unicode data types, data comparisons are based on the Unicode code points. For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN yield identical sorting results when they are used on Unicode data.
There are two types of binary collations in SQL Server; the older BIN collations and the newer BIN2 collations. In a BIN2 collation all characters are sorted according to their code points. In a BIN collation only the first character is sorted according to the code point, and remaining characters are sorted according to their byte values. (Because the Intel platform is a little endian architecture, Unicode code characters are always stored byte-swapped.)
So for Always Encrypted Deterministic encryption any of the collations returned by the query below will do:
This leaves you with 133 collations to choose from 🙂 Generally speaking BIN and BIN2 collations use different sorting algorithms and BIN2 is more preferable in general, not only for Always Encrypted Deterministic encryption columns. Another interesting question is why we have BIN/BIN2 collations for different languages? Like Arabic_BIN2, French_BIN2 etc. The reason is that each of those uses different code page for encoding the characters sorting in the varchar type so linguistic collation type is very important and comes into play only for varchar data as this will be sorted based on the language selected (this is not applicable to nvarchar where it has no difference).
All collations which are not binary collation are linguistic collations. For example, Latin1_General_CI_AS is a linguistic collation and it uses a sorting algorithm compatible with several of English language and many Western European languages. Please don’t be confused by the name of Latin1_General, as it actually can sort all Unicode characters defined in Unicode 3.2 characters set and it can also sort many other languages correctly as well (if the language has no sorting conflict with the latin1_general sorting rule).
Binary collations have better performance than linguistic collations, and that is the main advantage of using then. A binary collation is always case sensitive and accent sensitive. BIN2 collation is generally more preferable than BIN collation.