目录

1. SequenceMatcher FlowChart

1.1 get_matching_blocks() 

1.2 find_longest_match()

1.3 ratio()

2. 例子说明

3. 项目需求函数更改


参考

SequenceMatcher的基本思想是找到不包含“junk”元素的最长连续匹配子序列(LCS)。这不会产生最小的编辑序列,但是会产生对人“看起来正确”的匹配。 “junk”是不希望算法与之匹配的东西:例如普通文本文件中的空行,或者HTML文件中的“ <P>”行,等等。SequenceMatcher的输出很友好,例如:

Input Strings: my stackoverflow mysteries AND mystery

SequenceMatcher algorithm returns:to anyone, the natural match is "myster" as follows

my stackoverflow mysteries
.................
mystery..

However, the LCS will output "mystery"

my stackoverflow mysteries
my.st.....er......y.......

1. SequenceMatcher FlowChart

                    

Given two input strings a and b,

  • ratio( ) returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences
  • get_matching_blocks( ) return list of triples describing matching subsequences. The last triple is a dummy, (len(a), len(b), 0). It works by repeated application of find_longest_match( )
  • find_longest_match( ) returns a triple containing the longest matching block in a[aLow:aHigh] and b[bLow:bHigh]

                         

1.1 get_matching_blocks() 

该函数可根据自己项目需求进行修改:

def get_matching_blocks(self):
    if self.matching_blocks is not None:
        return self.matching_blocks
    la, lb = len(self.a), len(self.b)

    queue = [(0, la, 0, lb)]
    matching_blocks = []
    while queue:
        alo, ahi, blo, bhi = queue.pop()
        i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
        # a[alo:i] vs b[blo:j] unknown
        # a[i:i+k] same as b[j:j+k]
        # a[i+k:ahi] vs b[j+k:bhi] unknown
        if k:   # if k is 0, there was no matching block
            matching_blocks.append(x)
            if alo < i and blo < j:
                queue.append((alo, i, blo, j))
            if i+k < ahi and j+k < bhi:
                queue.append((i+k, ahi, j+k, bhi))
    matching_blocks.sort()

新建一个队列queue,用包含两个输入字符串上下限索引的四元组初始化。当队列中存在四元组时,将其弹出并传递给find_longest_match()函数,该函数返回描述匹配子序列的三元组。三元组添加到matching_blocks列表中。

 三元组格式为: (i, j, n),即 a[i:i+n] == b[j:j+n]

匹配子序列左侧和右侧的序列片段将进一步添加到队列queue中。重复该过程,直到队列为空。

然后对matching_blocks列表进行排序,并作为输出返回。

1.2 find_longest_match()

   def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):
        a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__
        besti, bestj, bestsize = alo, blo, 0
        # find longest junk-free match
        # during an iteration of the loop, j2len[j] = length of longest
        # junk-free match ending with a[i-1] and b[j]
        j2len = {}    # {位置:重复最长长度}
        nothing = []
        for i in range(alo, ahi):
            # look at all instances of a[i] in b; note that because
            # b2j has no junk keys, the loop is skipped if a[i] is junk
            j2lenget = j2len.get
            newj2len = {}    # newj2len 记录到目前为止匹配的字符串的长度
            for j in b2j.get(a[i], nothing):
                # a[i] matches b[j]
                if j < blo:   # 当前匹配子串[]之前的元素
                    continue
                if j >= bhi:  # 当前匹配子串之后的元素
                    break
                k = newj2len[j] = j2lenget(j-1, 0) + 1   # 当前位置j的前一个位置对应的重复最长长度,使其+1
                if k > bestsize:    # 若当前最长长度 > bestsize,则将bestsize更新为当前最长长度,同时更新besti、bestj
                    besti, bestj, bestsize = i-k+1, j-k+1, k
            j2len = newj2len

输入为包含字符串上下限索引的四元组,输出为包含最长匹配块的三元组。

首先,定义字典b2j,其中对于字符串b中的x,b2j [x]是x出现的索引(到b)的列表。

在外循环中按字符扫描第一个字符串a,我们使用b2j检查字符串b中该字符的出现。如果存在匹配项,我们将更新另一个字典newj2len,这有助于确保到目前为止匹配的字符串的长度。因此,变量besti,bestj和bestsize进行了更新,其中考虑了迄今为止获得的最长的匹配块数据。

在所有最大匹配块中,该算法返回最早在a中开始的那个,而在所有最大匹配中最早在a中开始的那个,它返回最早在b中开始的那个。

1.3 ratio()

    def ratio(self):
        """Return a measure of the sequences' similarity (float in [0,1]).
        Where T is the total number of elements in both sequences, and
        M is the number of matches, this is 2.0*M / T.
        Note that this is 1 if the sequences are identical, and 0 if
        they have nothing in common."""

        matches = sum(triple[-1] for triple in self.get_matching_blocks())
        return _calculate_ratio(matches, len(self.a) + len(self.b))

retio()函数计算序列a和b的相似度,ratio = 2.0*M / T,M为匹配的字符数,T为两个序列的总字符数。相似度的计算可根据实际情况进行修改。

根据经验,ratio()值超过0.6表示序列是紧密匹配的。在这里,根据计算,我们获得了0.8的相似性得分比,因此输入序列对被视为相似。

2. 例子说明

Let’s describe the step by step procedure of the algorithm by implementing it using an input pair of strings.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3. 项目需求函数更改

get_matching_blocks():

  • 计算a在b中的抄袭率,即a序列在b序列中的重复度,b保持不动,遍历a的子序列去跟b比较,即(alo, i, 0, lb)或(i+k, ahi, 0, lb)。
  • 当重复字数k达到一定值时,视为抄袭,即k >= duplicateNumber。
  • 相似度radio:ratio = M / len(a),即重复率 = 重复字符数/a的长度
    def get_matching_blocks(self, duplicateNumber=6):
        """
        Return list of triples describing matching subsequences.
        参数:
            duplicateNumber: 重复多少字数才算抄袭的下限值
        """

        if self.matching_blocks is not None:
            return self.matching_blocks
        la, lb = len(self.a), len(self.b)

        queue = [(0, la, 0, lb)]
        matching_blocks = []
        while queue:
            alo, ahi, blo, bhi = queue.pop()
            i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
            # a[alo:i] vs b[blo:j] unknown
            # a[i:i+k] same as b[j:j+k]
            # a[i+k:ahi] vs b[j+k:bhi] unknown
            if k:   # if k is 0, there was no matching block
                # 匹配的size长度, k > duplicateNumber, duplicateNumber值根据具体情况而定
                if k >= duplicateNumber:
                    matching_blocks.append(x)
                if alo < i:
                    queue.append((alo, i, 0, lb))
                if i+k < ahi:
                    queue.append((i+k, ahi, 0, lb))
        matching_blocks.sort()
    def ratio(self, only_a=True):
        matches = sum(triple[-1] for triple in self.get_matching_blocks())
        return matches/len(self.a) if only_a else _calculate_ratio(matches, len(self.a) + len(self.b))

 

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐