# The Magic 0xC2 *Written 2017-04-09* I built a web application with file upload functionality. Some Vue.js in the front and a CouchDB in the back. Everything should be pretty simple and straigt forward. But… When I uploaded image files, they somehow got mangled. The uploaded file was bigger than the original and the new "file format" was not readable by any means. I got intrigued. What is it, that happens to the files? The changes seemed very random but reproducible, so I created a few test files to see what exactly changes and when. My first file looked like this: ``` 0123456789 ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz ``` To my surprise, the file stayed the same! My curiosity grew. In the meantime I found a very intriguing pattern in uploads hexdump: `C3 BF C3`. It was everywhere. In another file, I found similar patterns with `C2`. So I wrote my next test file. This time a binary file: ``` 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 |................| 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |.... !"#$%&'()01| 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |23456789@ABCDEFG| 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |HIPQRSTUVWXY`abc| 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |defghipqrstuvwxy| 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |................| 96 97 98 99 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab |................| ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb |................| 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| ``` **EDIT**: As you probably already noticed, I counted up like in Base10 but it is actually Base16. So I skipped A-F until reaching A0. This might look weird but didn't affect the test. The result after uploading was ``` 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 |................| 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |.... !"#$%&'()01| 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |23456789@ABCDEFG| 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |HIPQRSTUVWXY`abc| 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |defghipqrstuvwxy| c2 80 c2 81 c2 82 c2 83 c2 84 c2 85 c2 86 c2 87 |................| c2 88 c2 89 c2 90 c2 91 c2 92 c2 93 c2 94 c2 95 |................| c2 96 c2 97 c2 98 c2 99 c2 a0 c2 a1 c2 a2 c2 a3 |................| c2 a4 c2 a5 c2 a6 c2 a7 c2 a8 c2 a9 c2 aa c2 ab |................| c2 ac c2 ad c2 ae c2 af c2 b0 c2 b1 c2 b2 c2 b3 |................| c2 b4 c2 b5 c2 b6 c2 b7 c2 b8 c2 b9 c2 ba c2 bb |................| 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| ``` There it was again: The magic **0xC2**! So all bytes with a value higher than *0x79* got followed by a *0xC2*. *0x79* is the ASCII code for *y*. This is at least what I thought. It actually is the other way around: All bytes with value *0x80* or higher got prefixed by a *0xC2*! — there the scales fell from my eyes: **UTF-8 encoding**! In *UTF-8* all characters after *0x7F* are at least two bytes long. They get prefixed with *0xC2* until *0xC2BF* (which is the inverted question mark `¿`), which is then followed by *0xC380*. So what happened is, that on the way to the server, the file got encoded to UTF-8 ¯\\\_(ツ)\_/¯ **EDIT:** Corrected some mistakes after some comments on [Hackernews](https://news.ycombinator.com/item?id=14089827)