文本查重:difflib.SequenceMatcher
参考:SequenceMatcher in PythondifflibSequenceMatcher的基本思想是找到不包含“垃圾”元素的最长连续匹配子序列(LCS)。这不会产生最小的编辑序列,但是会产生对人“看起来正确”的匹配。
目录
参考
SequenceMatcher的基本思想是找到不包含“junk”元素的最长连续匹配子序列(LCS)。这不会产生最小的编辑序列,但是会产生对人“看起来正确”的匹配。 “junk”是不希望算法与之匹配的东西:例如普通文本文件中的空行,或者HTML文件中的“ <P>”行,等等。SequenceMatcher的输出很友好,例如:
Input Strings: my stackoverflow mysteries
AND mystery
SequenceMatcher algorithm returns:to anyone, the natural match is "myster"
as follows
my stackoverflow mysteries
.................mystery..
However, the LCS will output "mystery"
my stackoverflow mysteries
my.st.....er......y.......
1. SequenceMatcher FlowChart
Given two input strings a and b,
- ratio( ) returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function get_matching_blocks and calculates the ratio as: ratio = 2.0*M / T , where M = matches , T = total number of elements in both sequences
- get_matching_blocks( ) return list of triples describing matching subsequences. The last triple is a dummy, (len(a), len(b), 0). It works by repeated application of find_longest_match( )
- find_longest_match( ) returns a triple containing the longest matching block in a[aLow:aHigh] and b[bLow:bHigh]
1.1 get_matching_blocks()
该函数可根据自己项目需求进行修改:
def get_matching_blocks(self):
if self.matching_blocks is not None:
return self.matching_blocks
la, lb = len(self.a), len(self.b)
queue = [(0, la, 0, lb)]
matching_blocks = []
while queue:
alo, ahi, blo, bhi = queue.pop()
i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
# a[alo:i] vs b[blo:j] unknown
# a[i:i+k] same as b[j:j+k]
# a[i+k:ahi] vs b[j+k:bhi] unknown
if k: # if k is 0, there was no matching block
matching_blocks.append(x)
if alo < i and blo < j:
queue.append((alo, i, blo, j))
if i+k < ahi and j+k < bhi:
queue.append((i+k, ahi, j+k, bhi))
matching_blocks.sort()
新建一个队列queue,用包含两个输入字符串上下限索引的四元组初始化。当队列中存在四元组时,将其弹出并传递给find_longest_match()函数,该函数返回描述匹配子序列的三元组。三元组添加到matching_blocks列表中。
三元组格式为: (i, j, n),即 a[i:i+n] == b[j:j+n]
匹配子序列左侧和右侧的序列片段将进一步添加到队列queue中。重复该过程,直到队列为空。
然后对matching_blocks列表进行排序,并作为输出返回。
1.2 find_longest_match()
def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):
a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__
besti, bestj, bestsize = alo, blo, 0
# find longest junk-free match
# during an iteration of the loop, j2len[j] = length of longest
# junk-free match ending with a[i-1] and b[j]
j2len = {} # {位置:重复最长长度}
nothing = []
for i in range(alo, ahi):
# look at all instances of a[i] in b; note that because
# b2j has no junk keys, the loop is skipped if a[i] is junk
j2lenget = j2len.get
newj2len = {} # newj2len 记录到目前为止匹配的字符串的长度
for j in b2j.get(a[i], nothing):
# a[i] matches b[j]
if j < blo: # 当前匹配子串[]之前的元素
continue
if j >= bhi: # 当前匹配子串之后的元素
break
k = newj2len[j] = j2lenget(j-1, 0) + 1 # 当前位置j的前一个位置对应的重复最长长度,使其+1
if k > bestsize: # 若当前最长长度 > bestsize,则将bestsize更新为当前最长长度,同时更新besti、bestj
besti, bestj, bestsize = i-k+1, j-k+1, k
j2len = newj2len
输入为包含字符串上下限索引的四元组,输出为包含最长匹配块的三元组。
首先,定义字典b2j,其中对于字符串b中的x,b2j [x]是x出现的索引(到b)的列表。
在外循环中按字符扫描第一个字符串a,我们使用b2j检查字符串b中该字符的出现。如果存在匹配项,我们将更新另一个字典newj2len,这有助于确保到目前为止匹配的字符串的长度。因此,变量besti,bestj和bestsize进行了更新,其中考虑了迄今为止获得的最长的匹配块数据。
在所有最大匹配块中,该算法返回最早在a中开始的那个,而在所有最大匹配中最早在a中开始的那个,它返回最早在b中开始的那个。
1.3 ratio()
def ratio(self):
"""Return a measure of the sequences' similarity (float in [0,1]).
Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.
Note that this is 1 if the sequences are identical, and 0 if
they have nothing in common."""
matches = sum(triple[-1] for triple in self.get_matching_blocks())
return _calculate_ratio(matches, len(self.a) + len(self.b))
retio()函数计算序列a和b的相似度,ratio = 2.0*M / T,M为匹配的字符数,T为两个序列的总字符数。相似度的计算可根据实际情况进行修改。
根据经验,ratio()值超过0.6表示序列是紧密匹配的。在这里,根据计算,我们获得了0.8的相似性得分比,因此输入序列对被视为相似。
2. 例子说明
Let’s describe the step by step procedure of the algorithm by implementing it using an input pair of strings.
3. 项目需求函数更改
get_matching_blocks():
- 计算a在b中的抄袭率,即a序列在b序列中的重复度,b保持不动,遍历a的子序列去跟b比较,即(alo, i, 0, lb)或(i+k, ahi, 0, lb)。
- 当重复字数k达到一定值时,视为抄袭,即k >= duplicateNumber。
- 相似度radio:ratio = M / len(a),即重复率 = 重复字符数/a的长度
def get_matching_blocks(self, duplicateNumber=6):
"""
Return list of triples describing matching subsequences.
参数:
duplicateNumber: 重复多少字数才算抄袭的下限值
"""
if self.matching_blocks is not None:
return self.matching_blocks
la, lb = len(self.a), len(self.b)
queue = [(0, la, 0, lb)]
matching_blocks = []
while queue:
alo, ahi, blo, bhi = queue.pop()
i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
# a[alo:i] vs b[blo:j] unknown
# a[i:i+k] same as b[j:j+k]
# a[i+k:ahi] vs b[j+k:bhi] unknown
if k: # if k is 0, there was no matching block
# 匹配的size长度, k > duplicateNumber, duplicateNumber值根据具体情况而定
if k >= duplicateNumber:
matching_blocks.append(x)
if alo < i:
queue.append((alo, i, 0, lb))
if i+k < ahi:
queue.append((i+k, ahi, 0, lb))
matching_blocks.sort()
def ratio(self, only_a=True):
matches = sum(triple[-1] for triple in self.get_matching_blocks())
return matches/len(self.a) if only_a else _calculate_ratio(matches, len(self.a) + len(self.b))
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)