

Odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.īefore: thumbgreenappleactiveassignmentweeklymetaphor.Īfter: thumb green apple active assignment weekly metaphor.īefore: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen Which you can use with s = "thumbgreenappleactiveassignmentweeklymetaphor" # Backtrack to recover the minimal-cost string. Return min((c + wordcost.get(s, 9e999), k+1) for k,c in candidates)

# Returns a pair (match_cost, match_length).Ĭandidates = enumerate(reversed(cost)) # been built for the i-1 first characters. # Find the best match for the i first characters, assuming cost has """Uses dynamic programming to infer the location of spaces in a string Wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words)) Words = open("words-by-frequency.txt").read().split() # Build a cost dictionary, assuming Zipf"s law and cost = -math.log(probability). Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it"s easy to compute it with dynamic programming. Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. It is reasonable to assume that they follow Zipf"s law, that is the word with rank n in the list of words has probability roughly 1/( n log N) where N is the number of words in the dictionary. Then you only need to know the relative frequency of all words. A good first approximation is to assume all words are independently distributed.

The best way to proceed is to model the distribution of the output. (If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.) Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text. Answer rating: 230Ī naive algorithm won"t give good results when applied to real-world data. Language: python, but main thing is the algorithm itself. Word "cupboard" can be "cup" and "board", select longest. ]įirst thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word) What would be an efficient algorithm to split such text to the list of words and get:

Input: "tableapplechairtablecupboard." many words
