Strings
Page content
- Encodings convert code points into bytes.
- There are many different character encodings.
- UTF-8 is the current gold standard of Unicode encoding.
ASCII
- It stores English characters as numbers ranging from 0 to 127.
- A := 65
- a:= 97, etc.
Unicode
- It is a system designed to represent every character from every language.
- It represents each letter, character, or ideograph as a 4-byte number.
- An ideogram or ideograph is a graphic symbol that represents an idea or concept, independent of any particular language, and specific words or phrases.
UTF-32
- 32 bits:= 4 bytes per each character
- Finding the Nth character in constant time.
UTF-16
- 16 bits:= 2 bytes per each character
- Finding the Nth character in constant time.
UTF-8
- A variable-length encoding system for Unicode.
- Different characters take up a different number of bytes.
- ASCII:= 1 byte per each character
- Extended Latin:= 2 bytes per each character
- Chinese:= 3 bytes per each character, etc.
- Finding the Nth character in O(N) time.
val = "español"
# Encoding is like a kind of decryption key.
val_utf8 = val.encode('utf-8') # b'espa\xc3\xb1ol'
type(val_utf8) # bytes
# Decoding bytes into characters.
val_utf8.decode('utf-8') # español
string
module
string.ascii_lowercase
- abc…z
Rabin Karp
- A linear expected-time algorithm for string matching
- Based on hashing
- A good example of a randomized algorithm is if we pick $M$ in some random way
String matching
- Some auxiliary information must be maintained to mark the end of a string
- a special end-of-string character OR
- a count n of the characters in the string.