A bit about file encoding¶

When we are reading in a file, we are using f = open(filename, encoding="latin1")

Here's what encoding="latin1" means. The file we are reading, like any file, is just a bunch of 0's and 1's:

1011100110111100011111010000011101101011001100111000100001111001101110101000100100100101000101010100111010110001001001110000101001011110000110011001001011100001000000110111100010100110000001101001011111111111010100100001010111010111100111001110010111110101111001100000111010011101111010111110010010000101110101100111011011001010010011100101100011100010011001111100110101101011010111100011110001011110101011010111110011011011110101111111000001000111010010011011011000000101010001011000101011000010111

1011100110111100011111010000011101101011001100111000100001111001101110101000100100100101000101010100111010110001001001110000101001011110000110011001001011100001000000110111100010100110000001101001011111111111010100100001010111010111100111001110010111110101111001100000111010011101111010111110010010000101110101100111011011001010010011100101100011100010011001111100110101101011010111100011110001011110101011010111110011011011110101111111000001000111010010011011011000000101010001011000101011000010111

When we specify that the file is encoded using the "latin1" encoding, Python reads the file 8 digits (bits) at a time:

Each 8 bits correspond to a character (so that there are 256 characters in total), You can read about which 8-bit sequences correspond to which characters here.

Obviously, all the world languages cannot be expressed using 256 characters. For example, there are many tens of thousands Chinese characters. In order to encode them, more complex encoding schemes are needed. Just one of them is Unicode, which also is able to encode the alphabets of languages such as Japanese, Arabic, Hebrew, Russian, Korean, etc.