Character composition conflicts on NAS volumes

This documentation is for an older version of CCC. You can find the latest version here.
Last updated on 21 dicembre 2022

If you copy folders to your NAS device from a Windows system or via SSH (e.g. using rsync) that have accented characters in their names (e.g. é, ö), then you can later run into file or folder name conflicts when you try to access those folders via SMB file sharing. When these conflicts affect a CCC backup task, you'll see errors in CCC suggesting that there is a permissions problem on the NAS volume, or that you should try restarting the the NAS device. This article explains how these conflicts arise, how to spot them in the Finder, and how to ultimately resolve them to achieve error-free backups.

Some brief background about character encoding

The "ASCII" character set is composed of 255 1-byte characters — all of the characters that you'd find in any English word. Non-English languages have numerous other characters, however, that can't possibly fit in a set of just 255 characters. These other characters are defined in the Unicode standard, and typically consume 2 or 3 bytes per "code unit". Most modern filesystems support the Unicode standard, however there are some characters within the Unicode standard that can present challenges to filesystems, and can lead to conflicts when transferring content between filesystems or across a network filesystem protocol.

Let's take the character "é" as an example that can lead to conflicts. This character is described as "Latin small letter e with acute". In the UTF-8 standard, this character can be presented as a single two-byte code unit (0xC3A9), or it can be generated by composition, i.e. by combining "Latin small letter e" (ASCII, 0x65) code unit with a "combining acute accent" (0xCC81) code unit. What individual filesystems do when faced with these ambiguous characters is a potential source of conflict. Some filesystems normalize the characters (i.e. choose one variant when storing file names, e.g. HFS+), some accept both but treat the characters as identical (composition-preserving [usually], composition-insensitive, e.g. APFS), and other filesystems accept both and treat the variants as unique/different characters (composition-sensitive, e.g. EXT4, a common format used on NAS devices).

Network filesystems (AFP, SMB) are in an awkward middle place — they can't dictate how the underlying filesystem behaves, so composition conflicts can place them in an unsupportable position.

Creating conflict

Let's suppose you have a folder named Beyoncé in your Music library. Long ago (e.g. prior to macOS High Sierra), your library was on an HFS+ filesystem, so that é character was stored in the composed form, 0x65CC81. Way back then, let's suppose you used rsync to copy this library directly to your NAS via SSH. On the NAS, the backend filesystem is EXT4, which is composition-sensitive. The EXT4 filesystem stored the folder name using the same encoding as on the source — the composed variant. Fast-forward many years later. You have a new Mac and your startup disk is now APFS formatted. You migrated content from an HFS+ volume to an APFS volume, and the é in that Beyoncé folder name was "normalized" to the two-byte, single character variant. You still have the same NAS, but now you're preparing to use CCC to make the backups to that NAS via SMB. Many factors have changed!

If you were to navigate to this Beyoncé folder on the SMB-mounted volume in the Finder, you might be surprised to find that the folder appears to be empty. In fact, the Finder is failing to query the content of that folder, because the macOS SMB client queries the content of the folder using the normalized variant of the name (which the NAS correctly reports as "not there"). If you try to copy content into that folder, Finder will ask you to authenticate, then present an error indicating that you don't have permission to make the change. This is not actually a permissions problem! It's not necessarily a Finder bug either, rather it is an unsupportable configuration — that folder can't be effectively accessed by SMB or AFP. You'll see the same problem if you try to delete that folder in the Finder.

Resolving character encoding conflicts

The correct solution in a case like this is to delete the "old" folder from the NAS. You won't be able to do this in the Finder (nor CCC for that matter), though, because the macOS SMB client normalizes folder names when it makes requests to the NAS. So despite that the SMB client can see the composed variant of a name in the parent folder listing, if we subsequently ask the SMB volume to remove the composed variant of a folder, the SMB client relays that request to the NAS using the normalized variant of the folder name, which doesn't exist on the NAS.

Solution: Log in to the NAS device's web admin interface, or connect to it via SSH to remove the affected folders.

Workaround: Alternatively, you can configure CCC to back up to a new folder on the NAS. This alternative approach is ideal if you have non-Mac clients that access the content in the original folders (and therefore tend to just re-introduce the same problem).

For the Terminally-curious

Here is what a pair of composition-conflicting folder names would look like on the backend EXT4 filesystem (i.e. logged in to the NAS via SSH):

admin@baltar:/volume2/SynBackup6TB/FunWithEncoding$ ls -li
total 16
30421978 drwxrwxrwx+ 2 admin users 4096 Dec 20 17:31 Beyoncé
30421986 drwxrwxrwx+ 2 admin users 4096 Dec 20 17:31 Beyoncé

This would appear to be illegal — two folders cannot coexist in the same folder having the same name. But if we pipe the listing to xxd to see the hexadecimal representation of the characters, we can see that the é characters do actually differ (note, this output is slightly massaged for easier reading):

admin@baltar:/volume2/SynBackup6TB/FunWithEncoding$ ls | xxd
4265 796f 6e63 65cc 81  Beyonce..
4265 796f 6e63 c3a9     Beyonc..

The first item has the composed é character, the second item has the single-character-two-byte code point. Now suppose each of these folders has a different file within it. Here is the NAS perspective:

admin@baltar:/volume2/SynBackup6TB/FunWithEncoding$ ls -l Beyonc*
Beyoncé:
total 0
-rwxrwxrwx+ 1 admin users 0 Dec 20 17:31 composed

Beyoncé:
total 0
-rwxrwxrwx+ 1 admin users 0 Dec 20 17:31 single

But the macOS SMB client normalizes the folder listing result and requests, so we see different results from the macOS perspective:

[bombich:/Volumes/SynBackup6TB/FunWithEncoding] ls | xxd
4265 796f 6e63 65cc 81   Beyonce..
4265 796f 6e63 65cc 81   Beyonce..

[bombich:/Volumes/SynBackup6TB/FunWithEncoding] ls -l Beyonc*
Beyoncé:
total 0
-rwx------ 1 bombich staff 0 Dec 20 17:31 single

Beyoncé:
total 0
-rwx------ 1 bombich staff 0 Dec 20 17:31 single

 

This last result is the most curious. We can see from the parent folder that two separate "Beyoncé" folders exist here, but when we ask for details about each folder and a folder listing of each folder, we only get results pertaining to the folder that has the normalized name. Yet stranger, Finder only presents one of these (although you might catch a glimpse of both folders right before Finder removes one from view!). This is why requests to add files to the folder named with the composed character will fail, and it's also why attempts to delete the folder with the composed character will fail — the SMB client simply will not make the request correctly using the composed variant of the character.