Chinese search engine of the word (of java version)
Read a few days before the mathematical google researcher Wu Jun of the United States series works quite feelings. Just do their own some time ago and the statistics-based model of Chinese language courses all subsystems, and therefore, dengan out and we learn together.
Segmentation technology in the search engines, information extraction, machine translation, and other areas with an important position in the application is not progressive. Entered the question:)
1. Overview
Bencze subsystems statistics corpus is with our schools open their own part, we can download here, about 1.84 million Chinese characters, of course, this is already a good segmentation, can be used to set up a relatively small corpus. Main points of the system I completed the following four steps:
1. Corpus pretreatment
2. Establish 2 - gram (Statistics binary model)
3. Achieving full segmentation
4. Assessment test
Now I have four respectively to 11.
1. Corpus pretreatment
Segmentation has been downloaded corpus is the shape such as "19980131-04-012-001 / m reality / n / u Insight / vn but / d by the / p depiction / out v / v-shaped / Ng to / v. / w, "and some earlier date also retained the number, because these segmentation corpus is the source of the People's Daily. Pretreatment of the main clause is based on punctuation, sentence simple definition (.?:;) At the end of the five punctuation word string, both sentences were added:
Out; / / output streams
In; / / input streams
StringBuffer s1 = new StringBuffer (); / / buffer
Char a = in.read ();
While (a! = -1) / / Has been to determine whether the end -
(
If ((a ==. | | A ==? | | A ==! | | A ==: | | a ==;)) / / an end
(
String s2 = new String (s1);
Out.write ( "
Out.write (s2);
Out.write ( "
Out.write (/ n) / / newline
S1 = new StringBuffer ();
)
Else if (a == /)
S1 = s1.append ((char) 32) / / sub-term position spaces
Else if (a> 256)
S1 = s1.append ((char) a);
A = in.read ();
)
Out.close ();
In.close ();
2. Establish 2 - gram model (Statistics binary model)
Here first briefly talk about n-gram model and 2 - gram model.
According to language samples to estimate the probability distribution P L language called on the language model. On the sentence given w1w2… wn s = (number, n, i have to subscript, wi s sentence for a word). By the chain of rules (Chain rule), P (s) = p (w1) p (w2 | w1) p (w3 | w1w2)…… p (wn | w1w2w3… w (n-1)), p (wi | w1w2… w (i-1)), (w1w2… w (i-1)) shall be wi history. Consider a front n-word which is a historical model of n-gram model. N larger context information is also provided more, but the greater the cost, and need more training corpus; n small, relatively little of the information provided, but the cost of small, and without too much training corpus.
The c (w1,…, wi) that the word string w1, w2… wi in the training corpus in the number from the maximum likelihood estimation, P (wn | w1,…, w (n-1)) = c ( w1,…, wn) / c (w1,…, w (n-1)). empathy, 2 - gram P (wn | w (n-1)) = c (w (n-1), wn) / c (w (n-1)).
For more related knowledge, we look to find relevant information, the University casually that the probability and statistics textbooks to show pointed out repeatedly, mathematics is a good NEE:)
Return:) training corpus a total of more than 50,000 different words. Establish 2 - gram statistical model to keep every word in the training corpus frequency to come, but also to the back of each word and that word composed of 2 - gram in the training corpus in the frequency of statistics out. Time-sharing will be cut because of the frequent in the establishment of the 2 - View gram model relevant data, all, storage, this 2 - gram model data structure of the data must be able to provide efficient lookups. Therefore choose hash table, it can provide the Constant Time View. Java class library provides a HashMap category, based on two data is not very large, it can be directly used to. In storage, a key corresponding value in the training of a corpus of words that have occurred, and every key value of the corresponding value is a HashMap. Hashmap as of the time being. This structure somewhat similar to the structure in the two indexes. Its related code is as follows:
Pretreatment how in the document were read out the word not wordy, Methods: read every line, into space by an array of String, with a match of a regular expression that can be.
/ / This method imported two words to form a 2 - gram, a former prewd word currwd followed for the word
Public static void add (String prewd, String currwd) (
String key = prewd;
String curr = currwd;
Boolean bb = HMap.containsKey (key); / / Hmap is a HashMap already in existence, used to store 2 - gram statistical model. Here preword judge whether in the main map
If (bb == false) (/ / if none of the main map, add
HashMap hm = new HashMap (); / / First, the new structure of a MAP
Hm.put (key, new Integer (1)) / / main storage KEY frequency hm.put (curr, new Integer (1)) / / KEY storage behind the words that followed the frequency
HMap.put (key, hm); / / and the corresponding sub-KEY MAP Add in the MAP
)
Else / / if it contains the map
(
HashMap temp = (HashMap) HMap.get (key); / / to return to the corresponding sub-KEY MAP, a value of the revised
Int count = ((Integer) temp.get (key)). IntValue () + 1; / / map of the key number will increase 1
Temp.put (key, new Integer (count));
If (temp.containsKey (curr)) / / map of judgement in the presence of the term
(
Int value = ((Integer) temp.get (curr)). IntValue () + 1; temp.put (curr, new Integer (value));
)
Else
Temp.put (curr, new Integer (1)) / / if not, will be deposited into its sub-map
HMap.put (key, temp); / / son revised map will be re Add to the main map
)
)
)
Language because the majority of low-frequency words are words, sparse problems certainly exist. The MLE (maximum likelihood estimate) to the training corpus did not appear in the 2 - gram assign the probability of 0. So must the 2 - gram model data smoothing, with a view to gaining a better parameter. Smoothing more current technology, such as Add-one, Add-delta, Witten-Bell, held-such as smoothing out retirement. The system is mainly used Add-delta and held-out in two smooth, below the Add-delta smoothing technique as an example, the 2 - to smooth gram. On the 2 - gram model, the smoothing formula:
P (wn | w (n-1)) = [c (w (n-1), wn) + delta] / (N + delta * V), here to the delta of 0.5
The N: all of the training corpus 2 - the number of gram
V: all the different possible 2 - the number of gram
Smoothing ideas: 1. Generated iteration of the main hashmap iterator, followed by Reading key;
2. Every key, and read out its value, that is, a subset of hashmap;
3. Smoothing formula pairs according to the map, the value of Laws
Algorithm framework:
Iterator it hashmap.keySet = main (). Iterator ();
While (it.hasNext ())
(
The main key = it.next ();
- Hashmap = (HashMap) main hashmap.get (main key);
Iterator itr of hashmap.keySet = (). Iterator ();
While (itr.hasNext ())
(
According to smoothing formula were revised calculation
)
)
Note: 1. Calculated as the probability values are relatively small, in order to prevent the emergence of the overflow, to take their number to take anti-.
2. Each key corresponding to a master of all, there has never been, that is, zero frequency 2 - gram, with a unity of the keys stored in the corresponding sub-hashmap, can be.
Completed, the target sequence. Use of the system, lazy load will be included in the memory, and then allows it to have survived in memory, which would greatly speed up the pace.
Here, 2 - gram model completed.
3. Achieving full segmentation
Segmentation generally have the greatest match (MM, RMM), rule-based approach, based on statistical methods. On the first two not wordy. All the so-called segmentation according to the dictionary is therefore likely to be the form of segmentation. Ambiguity recognition method are: the rule-based method and based on statistical methods. Here is of course based on the 2 - gram statistical model approach:) In order to avoid ambiguity segmentation analysis can be carried out later wasted time. And here side by side segmentation evaluation methods, that is, at the same time cut the evaluation method.
All of a sentence for the results of segmentation is therefore possible combinations, we can form a solution space tree
Therefore, the search can be backtracking optimal solution
If all of the portfolio to cut out search, and then under 2 - gram choose the best, obviously will be a waste of time because the process may be a lot of duplication of search, and search back to the time complexity of the index for the time
Therefore, in the search process, we need to combine pruning to avoid invalid search can be greatly improved efficiency
By the depth of the tree priority rules. Optimal solution can be found
Specific algorithm are as follows:
Stack.push (BOS) / / tree node
While stack is not empty
X = stack.pop ()
Pos: = x. Pos, w = xw oldvalue: = x.value preword: = x.preword
If m> O then / / m string by the number of words
Forj: = 1 to m do
Fwc FWj for the first element l j
If length (w + FWj) = length (c) and the greatest probability then output w + FWjl and set up as the greatest probability of the sentence
Else
Posl: = pos + length (FWj) l
If probability (w + FWj, posl, newsate)> maxValue (pos1)
Stack.push (x)
Endif
Endfor
Endif
Endwhile
End.
In the algorithm to achieve the need to consider a number of nodes, such as tree preservation, the first word string handling, and other issues.
4. Assessment test
Environment: windows XP2, AMD Athlon 1800 +, Memory 768m, JDK1.5
Delta smoothing: With the value of smaller delta, and an accuracy rate rise 0.5,0.01,0.0001
Recall Rate: 0.9756 0.9826 0.9928
Accuracy rate: 0.9638 0.9710 0.9883
Smoothing retained
Recall Rate: 0.9946
Accuracy rate: 0.9902
Under normal circumstances, should be retained or smooth smooth better than delta
All modeling process and the smooth process can be completed within one minute.
Cut time and efficiency:
N Testing corpus, 17455 characters (Chinese 17287), and the average sentence of 41 characters long, time: 340 ms, an average speed Segmentation: 51,000 / S
N 205,000 test corpus (from Xiaoaojianghu), pretreatment 174600, MS 110 time, the number of text lines sentence 24,945, the average sentence of 7, splitting time MS 1300, the average 134,600 / sec.
N 205,000 test corpus (from Xiaoaojianghu), pretreatment, the average sentence of 239, 40 S segmentation time, the average 5,000 characters / sec
Backtracking algorithm is spending time O (N!), The process of cutting the length of the sentence directly determine the speed of the cut, because the longer the sentence the word more
After pretreatment, short sentences, the average sentence of 7, a short back, it should be much faster speed.
Here, the system basically completed, and come to an end. Feeling very chaotic written huh
Now doing another operation, to be a simple search engine, plans to combine this Dongdong in the search engine inside and realize segmentation functionality:)
Tags: java Chinese, search






