SVN: Invalid UTF-8 Sequence

We maintain SVN copies of all of our production websites. Many of these sites use WordPress and get update several times a day including images and pdf documents being added. The process is fairly simple: take a snapshot of the database and copy all the files from the production server over the top of an SVN working copy – and let SVN figure which files are new and which have changed.

While doing this for a particulalry large and troublesome site, SVN reported the error below during the “svn add” operation.

svn: Valid UTF-8 data
(hex: 77 69 6e 2d 61 6e 64 2d 68 6f 70 65 73 2d 66 6f 72 2d 73 74 6f 6e 65 72)
followed by invalid UTF-8 sequence
(hex: d4 c7 d6 73)

Essentially some of the image filenames uploaded onto the production server, running on Windows, contained non-UTF-8 characters. The import to SVN was being done on a Linux machine.

The hex code above is actually part of the filename. Identifying which file was causing the problem was as simple as converting the hex to ASCII. It takes all of five minutes to write a script to do the conversion or 10 seconds Googling to find an online Hex to ASCII converter.

The ASCII may or may not be the first part of the filename. Usually SVN will have reported the directory where the problem file is. Using

ls -il *

will gives the full name of the file and its inode number. As the filename contains ‘illegal’ characters rm is unlikely to be able to delete it directly; a file can be deleted by inode number using find and piping the output to rm:

find . -inum -exec rm -i {} \;

I found┬áseveral┬áversions of the troublesome files on the server with variations of their filenames. It seems that the versions with the ‘broken’ filenames were not able to be used by the web server and no reference to them was found in the database meaning that they were safe to delete.