Algorithm of similarity invariant estimation of two string variables by the Three-Sets method

[Russian] Author: Smirnov M.V.
Algorithm of similarity invariant estimation of two string variables by the Three-Sets method
Annotation

An algorithm is proposed for calculating the similarity metric in comparison and text search problems. The algorithm is invariant to the permutation of words. The algorithm is based on the use of relations between three finite sets, and the calculation of the similarity metric of two compared string variables. A comparison of the results of calculations by the "three-sets" method and calculations by known algorithms is presented, such as the Levenshtein metric, the Oliver algorithm, the Jacquard similarity coefficient.

One of the most popular functions in text search systems is the function of comparing two lines [1]. Judging by the posts and publications, many programmers prefer the Levenshtein algorithm, which ensures the implementation of fuzzy search, i.e. search when not strictly specifying the search phrase [2,3,7]. The presence of the corresponding levenshtein () subroutine in the PHP language [4] plays an important role in the choice of the Levenshtein function. The metric or Levenshtein distance is the minimum number (integer) of the operations of inserting, removing, and replacing one character for another, necessary to convert one string to another. One more thing to note is another subroutine in the specification of PHP functions, the function similar_text () , which finds the degree of similarity of two lines by Oliver"s algorithm. In the Oliver algorithm, PHP searches for the number of common symbols in the compared rows. The subroutine similar_text () returns the numerical value of the similarity of two lines in percent [5].

The author of this article used the named subroutines (and a number of others) in the formation of the semantic kernel of multi-page sites. The problem of automated selection of unique or quasi-unique names in the semantics of sites (online stores) is known. The main drawback of the levenshtein () and similar_text () subroutines is the sensitivity to the permutation of words in the compared phrases. Let"s consider a real example. As the first line, we will use the phrase " artistic stone carving ", and in the second line we will do a permutation of the words without changing them - " carving artistic stone . " The result of applying the similar_text () function produced a similarity value of 49.12% , and the return value of the levenshtein () function was 38, which corresponds in relative measure to the length of the lines - 29.63% .

With the latter problem, a binary measure of similarity, proposed by Paul Jaccard, copes well. At the same time, an essential shortcoming of the Jakkard algorithm is the criticality to errors in writing words in the compared texts: so if in one of the lines the word "corresponding" will have an error - "coresponding", the result of calculating the Jacquard coefficient is, on average, 59% , with an accuracy of 0.1 (100 iterations) [8]. As rightly noted in [9] - "the instability of the Jakkard algorithm manifests itself in the presence of spelling errors and synonymous replacements." The
The
The algorithm considered below provides insensitivity (invariance) to word rearrangement in compared phrases and is based on calculating the strict inclusion and the intersection between three finite sets. One set of A is a collection of symbols of the Latin alphabet, which must strictly include the intersection of two sets S 1 and S 2 < / sub> , formed by the character sequences of two compared phrases of the text. In terms of mathematical sets, we can write:
A = {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z},
S1 = {s1, s2, s3 ... sn},
S2 = {s1, s2, s3 ... sm},

wherein, S 1 ⊆ A and S 2 ⊆ A - must always be executed. Direct calculation of the intersection S 1 ∩ S 2 is not enough for us. The desired similarity metric is defined as the power difference of the intersections S 1 ∩ A and S 2 ∩ A : The
Sim (S 1 , S 2 ) = | S 1 ∩ A | - | S 2 ∩ A | .
The
The normalized metric in the positive range of numbers from 0 to 1 takes the form: The
The
Sim (S 1 , S 2 ) = || S 1 ∩ A | - | S 2 ∩ A || / | S 1 | + | S 2 | .
The
For clarity in the figure below, using the Euler-Venn diagram, we show the desired relation of three sets:
The Euler-Venn diagram for the relation of three sets

The Euler-Venn diagram for the relation of three sets.


In a PHP program, the power of the sets | S 1 | and | S 2 | , two compared sequences of symbols str1 and str2 , we compute as hashes of the characters $hschar1 and $hschar2 :
  $str = preg_replace('/[^a-z]/', '', $str);
  $char1 = count_chars($str, 1);
  foreach ($char1 as $key => $val) {
                    $ch = chr($key);
                    $hschar1{$ch} = $val;
                                   }
  $str = preg_replace('/[^a-z]/', '', $str);
  $char2 = count_chars($str, 1);
  foreach ($char2 as $key => $val) {
                    $ch = chr($key);
                    $hschar2{$ch} = $val;
                                    },
where count_chars () is the standard PHP subroutine, and the strings are $str and $str , the character sequences are in lowercase. The last procedure is known as the transliteration of the Russian alphabet in Latin in lowercase, for example: The
$streng = str_replace ($r, $l, strtolower ($str, 'UTF-8')),
where strtolower () is the standard PHP function. The regular expression / [^ a-z] / ensures strict inclusion S 1 ⊆ A and S 2 ⊆ A . The
The
The set of Latin characters A is defined by the variable $alfb . The calculation of the similarity metric $sim is combined with the procedure for calculating the crossing powers | S 1 ∩ A | and | S 2 sub> ∩ A | is represented by the code:
$alfb = "abcdefghijklmnopqrstuvwxyz";
$alfbarr = preg_split("//",$alfb);
$sim = 0;
foreach($alfbarr as $aw){
            if($aw!=""){
                   $val =  $hschar1{$aw};
                   if($val){ 
                            $vv=$val;   
                             }else{ $vv=0; }
                   $val2 =  $hschar2{$aw};
                   if($val2){ 
                             $vv2=$val2;   
                             }else{ $vv2=0;}
                   $sim += abs($vv - $vv2);
            }
}       
$percent = 100*(1 - $sim/$lengtotal),
where $percent is the similarity of the compared rows in percent. A graphical example of a histogram of the power values ​​of sets of two lines in the space of symbols of the Latin alphabet: The
$str1 = "Comparison of two strings with the help of an invariant metric"
and
$str2 = "Comparison of two strings using a metric that is invariant to the permutation of words"
is presented in Table 1:
src = "images / invar_table1.gif" border = 0 alt = "Histogram of power values ​​of sets of two lines" title = "Histogram of power values ​​of sets of two lines">

Table 1. The histogram of power values ​​of sets of text variables in the space of symbols of the Latin alphabet.
The
The upper table graph refers to the line $str1 . The proximity measure is calculated using the invarDistance () subroutine, which returns the following values:
Array
(
    [lengtotal] => 111
    [Error.count] => 17
    [Similarity] => 84.68
),
где lengtotal – total length of lines, Error.count - the number of "erroneous" characters, a measure of proximity Similarity in percent. The graphs are plotted using the invarDistanceGraph () subroutine. You can download the archive of the routines invarDistance () and invarDistanceGraph () , as well as an example of comparing two strings in PHP at three_sets.zip [6]. In the archive, the variant of the set of symbols A is implemented:
$alfb = "abcdefghijklmnopqrstuvwxyz0123456789".
Finally, consider another example of comparing two strings with an arbitrary permutation of words and one " error ":
$str1 = "Search algorithms and text analysis";                     
        &
$str2 = "Algorithm analysis text and search";                    
Result of the calculation of the metric:
Array
(
    [lengtotal] => 55
    [Error.count] => 1
    [Similarity] => 98.18
)
The result shows the registration of one error and almost 100% similarity of compared rows, despite the permutations. It is interesting to compare this result with the calculation of the Jacquard coefficient, which also ensures invariance to permutations of words. The similarity estimate for Jakkard gives a significantly understated result: 0.55-0.63 (accuracy of 0.05, 400 iterations). The
The
Conclusions:
The "three-sets" method provides the invariability to word rearrangements and demonstrates resistance to errors in writing texts. An experimental comparison with the Levenshtein metric, the Oliver algorithm, and the Jacquard similarity coefficient has shown the advantage of the "three-sets" method. To disadvantages of the algorithm can be attributed a decrease in efficiency when comparing string variables of a large length. When working with string variables of more than 200-256 characters, the best results are demonstrated by the algorithm " mirror regular expression " [ 10 ]. The
The
References:
  1. String comparison functions. PHP Reference.
  2. Distance of Levenshtein.
  3. Application of fuzzy search algorithms in PHP
  4. The function levenshtein (). A guide to PHP.
  5. Similar_text () function. A guide to PHP.
  6. Calculation of the metric is invariant to a permutation of words. The program in PHP.
  7. Fuzzy search in the text and in the dictionary
  8. MinHash — we identify similar sets
  9. Н.В. Неелова, Сычугов А. А. The calculation of fuzzy takes by the Jacquard formula taking into account synonymous substitutions and stop words [russian]. Тула: Известия ТулГУ, 2009. To article
  10. Оценка алгоритмов сходства строковых переменных. Зеркальное регулярное выражение
Статьи:
1. Программное обеспечение. Цифровая голография. Скрытые водяные знаки. Распознавание образов.
2. Инвариантная оценка сходства двух строковых переменных методом "Трех-Множеств"
3. An invariant the similarity estimate of two string variables by the "Three-Sets" method
4. On-line вычисление коэффициента Джаккарда (javascript)
5. Регистрация и восстановление цифровых голограмм в традиционных носителях информации на бумажной и пластиковой основе
6. Распределение вероятностей частоты поисковых запросов