最近闲着没事学学rust;正好公司之前用来分析账号相似度的模块是用python写的,于是想到用rust重写底层算法提高运行效率,顺便练练手。

稍微翻了一下github,在字符串相似度方面现成的开源rust轮子不多,一个手数的过来,而且质量普遍不算很高(无脑递归,或者直接将jaro_winkler的p因子固定为0.1,前者仅是运行性能层面的不足,而后者则会导致运算结果与预期不一致。另外某高star的项目是不支持中文编码的。)

先把levenshtein、jaro、jaro_winkler的轮子造一下,基本够用,有空再把其他基础算法补上吧。

gitee:https://gitee.com/DontBeProud/str_sim

github:https://github.com/DontBeProud/str_sim

 

levenshtein.rs

use std::cmp::min;


/// levenshtein edit distance
pub fn levenshtein_distance(s1: &str, s2:&str) -> usize{
    let row = s1.chars().count() + 1;
    let col = s2.chars().count() + 1;
    let mut matrix = vec![vec![0; col].into_boxed_slice(); row].into_boxed_slice();
    for i in 0..row{
        for j in 0..col{
            if i * j == 0{
                matrix[i][j] = i + j;
            }
            else {
                let b_not_equal = if s1.chars().nth(i - 1) != s2.chars().nth(j - 1) {1} else {0};
                matrix[i][j] = min(min(matrix[i - 1][j] + 1, matrix[i][j - 1] + 1), matrix[i - 1][j - 1] + b_not_equal);
            }
        }
    }
    matrix[row - 1][col - 1]
}

jaro.rs

use std::cmp;


/// jaro similarity
pub fn sim_jaro(s1: &str, s2: &str) -> f64 {
    let s1_len = s1.chars().count();
    let s2_len = s2.chars().count();
    if s1_len == 0 && s2_len == 0 { return 1.0; }

    let match_distance = cmp::max(s1_len, s2_len) / 2 - 1;
    let mut s1_matches = vec![false; s1_len];
    let mut s2_matches = vec![false; s2_len];
    let mut m: isize = 0;
    for i in 0..s1_len {
        let start = cmp::max(0, i as isize - match_distance as isize) as usize;
        let end = cmp::min(i + match_distance + 1, s2_len);
        for j in start..end {
            if !s2_matches[j] && s1.chars().nth(i) == s2.chars().nth(j) {
                s1_matches[i] = true;
                s2_matches[j] = true;
                m += 1;
                break;
            }
        }
    }
    if m == 0 { return 0.0; }
    let mut t = 0.0;
    let mut k = 0;
    for i in 0..s1_len {
        if s1_matches[i] {
            while !s2_matches[k] { k += 1; }
            if s1.chars().nth(i) != s2.chars().nth(k) { t += 0.5; }
            k += 1;
        }
    }

    let m = m as f64;
    (m / s1_len as f64 + m / s2_len as f64 + (m  - t) / m) / 3.0
}

jaro_winkler.rs

use crate::jaro::sim_jaro;
use std::cmp::min;


/// jaro-winkler similarity
pub fn sim_jaro_winkler(s1: &str, s2: &str, prefix_weight: f64) -> f64 {
    let jaro_sim = sim_jaro(s1, s2);
    let prefix_len = get_longest_prefix_length(s1, s2);
    let mut prefix_factor_coefficient = prefix_weight * prefix_len as f64;
    prefix_factor_coefficient = if prefix_factor_coefficient < 0.0 { 0.0 } else if prefix_factor_coefficient > 1.0 { 1.0 } else { prefix_factor_coefficient };
    jaro_sim + prefix_factor_coefficient * (1.0 - jaro_sim)
}


fn get_longest_prefix_length(s1: &str, s2: &str) -> usize{
    let min_len = min(s1.chars().count(), s2.chars().count());
    let mut longest_prefix_length = min_len;
    for i in 0..min_len{
        if s1.chars().nth(i) != s2.chars().nth(i){
            longest_prefix_length = i;
            break;
        }
    }
    longest_prefix_length
}

 

Logo

瓜分20万奖金 获得内推名额 丰厚实物奖励 易参与易上手

更多推荐