文本查重：difflib.SequenceMatcher

参考：SequenceMatcher in PythondifflibSequenceMatcher的基本思想是找到不包含“垃圾”元素的最长连续匹配子序列（LCS）。这不会产生最小的编辑序列，但是会产生对人“看起来正确”的匹配。

满腹的小不甘_静静

12736人浏览 · 2020-05-11 15:53:49

满腹的小不甘_静静 · 2020-05-11 15:53:49 发布

1. SequenceMatcher FlowChart

1.1 get_matching_blocks()

1.2 find_longest_match()

1.3 ratio()

2. 例子说明

3. 项目需求函数更改

参考

SequenceMatcher的基本思想是找到不包含“junk”元素的最长连续匹配子序列（LCS）。这不会产生最小的编辑序列，但是会产生对人“看起来正确”的匹配。 “junk”是不希望算法与之匹配的东西：例如普通文本文件中的空行，或者HTML文件中的“ <P>”行，等等。SequenceMatcher的输出很友好，例如：

Input Strings: my stackoverflow mysteries AND mystery

SequenceMatcher algorithm returns：to anyone, the natural match is "myster" as follows

my stackoverflow mysteries .................mystery..

However, the LCS will output "mystery"

my stackoverflow mysteries my.st.....er......y.......

1. SequenceMatcher FlowChart

Given two input strings a and b,

ratio( ) returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences
get_matching_blocks( ) return list of triples describing matching subsequences. The last triple is a dummy, (len(a), len(b), 0). It works by repeated application of find_longest_match( )
find_longest_match( ) returns a triple containing the longest matching block in a[aLow:aHigh] and b[bLow:bHigh]

1.1 get_matching_blocks()

该函数可根据自己项目需求进行修改：

def get_matching_blocks(self):
    if self.matching_blocks is not None:
        return self.matching_blocks
    la, lb = len(self.a), len(self.b)

    queue = [(0, la, 0, lb)]
    matching_blocks = []
    while queue:
        alo, ahi, blo, bhi = queue.pop()
        i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
        # a[alo:i] vs b[blo:j] unknown
        # a[i:i+k] same as b[j:j+k]
        # a[i+k:ahi] vs b[j+k:bhi] unknown
        if k:   # if k is 0, there was no matching block
            matching_blocks.append(x)
            if alo < i and blo < j:
                queue.append((alo, i, blo, j))
            if i+k < ahi and j+k < bhi:
                queue.append((i+k, ahi, j+k, bhi))
    matching_blocks.sort()

新建一个队列queue，用包含两个输入字符串上下限索引的四元组初始化。当队列中存在四元组时，将其弹出并传递给find_longest_match()函数，该函数返回描述匹配子序列的三元组。三元组添加到matching_blocks列表中。

三元组格式为： (i, j, n)，即 a[i:i+n] == b[j:j+n]

匹配子序列左侧和右侧的序列片段将进一步添加到队列queue中。重复该过程，直到队列为空。

然后对matching_blocks列表进行排序，并作为输出返回。

1.2 find_longest_match()

   def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):
        a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__
        besti, bestj, bestsize = alo, blo, 0
        # find longest junk-free match
        # during an iteration of the loop, j2len[j] = length of longest
        # junk-free match ending with a[i-1] and b[j]
        j2len = {}    # {位置：重复最长长度}
        nothing = []
        for i in range(alo, ahi):
            # look at all instances of a[i] in b; note that because
            # b2j has no junk keys, the loop is skipped if a[i] is junk
            j2lenget = j2len.get
            newj2len = {}    # newj2len 记录到目前为止匹配的字符串的长度
            for j in b2j.get(a[i], nothing):
                # a[i] matches b[j]
                if j < blo:   # 当前匹配子串[]之前的元素
                    continue
                if j >= bhi:  # 当前匹配子串之后的元素
                    break
                k = newj2len[j] = j2lenget(j-1, 0) + 1   # 当前位置j的前一个位置对应的重复最长长度，使其+1
                if k > bestsize:    # 若当前最长长度 > bestsize，则将bestsize更新为当前最长长度，同时更新besti、bestj
                    besti, bestj, bestsize = i-k+1, j-k+1, k
            j2len = newj2len

输入为包含字符串上下限索引的四元组，输出为包含最长匹配块的三元组。

首先，定义字典b2j，其中对于字符串b中的x，b2j [x]是x出现的索引（到b）的列表。

在外循环中按字符扫描第一个字符串a，我们使用b2j检查字符串b中该字符的出现。如果存在匹配项，我们将更新另一个字典newj2len，这有助于确保到目前为止匹配的字符串的长度。因此，变量besti，bestj和bestsize进行了更新，其中考虑了迄今为止获得的最长的匹配块数据。

在所有最大匹配块中，该算法返回最早在a中开始的那个，而在所有最大匹配中最早在a中开始的那个，它返回最早在b中开始的那个。

1.3 ratio()

    def ratio(self):
        """Return a measure of the sequences' similarity (float in [0,1]).
        Where T is the total number of elements in both sequences, and
        M is the number of matches, this is 2.0*M / T.
        Note that this is 1 if the sequences are identical, and 0 if
        they have nothing in common."""

        matches = sum(triple[-1] for triple in self.get_matching_blocks())
        return _calculate_ratio(matches, len(self.a) + len(self.b))

retio()函数计算序列a和b的相似度，ratio = 2.0*M / T，M为匹配的字符数，T为两个序列的总字符数。相似度的计算可根据实际情况进行修改。

根据经验，ratio（）值超过0.6表示序列是紧密匹配的。在这里，根据计算，我们获得了0.8的相似性得分比，因此输入序列对被视为相似。

2. 例子说明

Let’s describe the step by step procedure of the algorithm by implementing it using an input pair of strings.

3. 项目需求函数更改

get_matching_blocks()：

计算a在b中的抄袭率，即a序列在b序列中的重复度，b保持不动，遍历a的子序列去跟b比较，即（alo, i, 0, lb）或（i+k, ahi, 0, lb）。
当重复字数k达到一定值时，视为抄袭，即k >= duplicateNumber。
相似度radio：ratio = M / len(a)，即重复率 = 重复字符数/a的长度

    def get_matching_blocks(self, duplicateNumber=6):
        """
        Return list of triples describing matching subsequences.
        参数：
            duplicateNumber: 重复多少字数才算抄袭的下限值
        """

        if self.matching_blocks is not None:
            return self.matching_blocks
        la, lb = len(self.a), len(self.b)

        queue = [(0, la, 0, lb)]
        matching_blocks = []
        while queue:
            alo, ahi, blo, bhi = queue.pop()
            i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
            # a[alo:i] vs b[blo:j] unknown
            # a[i:i+k] same as b[j:j+k]
            # a[i+k:ahi] vs b[j+k:bhi] unknown
            if k:   # if k is 0, there was no matching block
                # 匹配的size长度， k > duplicateNumber， duplicateNumber值根据具体情况而定
                if k >= duplicateNumber:
                    matching_blocks.append(x)
                if alo < i:
                    queue.append((alo, i, 0, lb))
                if i+k < ahi:
                    queue.append((i+k, ahi, 0, lb))
        matching_blocks.sort()

    def ratio(self, only_a=True):
        matches = sum(triple[-1] for triple in self.get_matching_blocks())
        return matches/len(self.a) if only_a else _calculate_ratio(matches, len(self.a) + len(self.b))