Splitting strings: reminder

Here is how we can split strings:

In [1]:
"abcengsci1234engsciABCabc".split("engsci")
Out[1]:
['abc', '1234', 'ABCabc']

We obtained a list that contains strings that are separated by "engsci" in the original string. We can use str.split() in order to compute the number of words and the number of sentences (approximately) in a text.

Here is how:

In [2]:
def num_words(text):
    return len(text.split(" "))

Here is how this works:

In [3]:
"Engineers rule the world".split(" ")
Out[3]:
['Engineers', 'rule', 'the', 'world']
In [4]:
len("Engineers rule the world".split(" "))
Out[4]:
4
In [5]:
num_words("Engineers rule the world")
Out[5]:
4

It's a little trickier to compute the number of sentences, since they can be separated by all of "!", ".", and "?". Here is an idea: let's replace all the exclamation points and question marks by periods in a text. We'll use that to compute the average number of words per sentence in the first part Marcel Proust's celebrated (and famously difficult to read) Remembrance of Things Past (also known as In Search of Lost Time.)

In [6]:
def num_sentences(text):
    text = text.replace("!", ".")
    text = text.replace("!", ".")
    return len(text.split("."))


f = open("losttime.txt", encoding = "latin1")
text = f.read()
print(num_words(text)/num_sentences(text))
32.97321751719146

We can do the same for the French text of the same novel, and compare the results:

In [7]:
f = open("losttime_fr.txt", encoding = "latin1")
text = f.read()
print(num_words(text)/num_sentences(text))
30.66328358208955

The French text is slightly easier to read!