7-4 Document Distance （PAT ADSAA） (24/35)

这道题只拿了24/35分，我猜问题出在stemming函数里，但是暂时不知道该怎么处理。

天天AZ

817人浏览 · 2023-06-21 14:14:33

天天AZ · 2023-06-21 14:14:33 发布

这道题只拿了24/35分，我猜问题出在stemming函数里，但是暂时不知道该怎么处理。

#include <string>
#include <iostream>
#include <cstdio>
#include <map>
#include <set>
#include <vector>
#include <cmath>
const int MAXN = 100;

int N, M, sz, t;
char ch;
double result;
std::string title, tmp, a, b;
std::map<std::string, int> titleNbr;
std::map<std::string, int> wordCnt[MAXN];
std::set<std::string> words;
std::vector<int> frequency[MAXN];
int normSquare[MAXN];

std::string stemming(std::string str){
    if(str.size() >= 2 && str.substr(str.size() - 2) == "ed"){
        return str.substr(0, str.size() - 2);
    }
    if(str.size() >= 3 && str.substr(str.size() - 3) == "ies"){
        return str.substr(0, str.size() - 3) + "y";
    }
    if(str.size() >= 3 && str.substr(str.size() - 3) == "ing"){
        return str.substr(0, str.size() - 3);
    }
    if(str.size() >= 2 && str.substr(str.size() - 2) == "es"){
        return str.substr(0, str.size() - 2);
    }
    return str;
}

double calDistance(int m, int n){
    int sum = 0;
    for(int i = 0; i < sz; ++i){
        sum += frequency[m][i] * frequency[n][i];
    }
    return acos(sum * 1.0 / sqrt(normSquare[m] * 1.0) / sqrt(normSquare[n] * 1.0));
}

int main(){
    scanf("%d", &N);
    for(int i = 0; i < N; ++i){
        std::cin >> title;
        titleNbr[title] = i;
        do{
            scanf("%c", &ch);
            if(isalnum(ch)){
                tmp += std::tolower(ch);
            } else{
                if(!tmp.empty()){
                    tmp = stemming(tmp);
                    wordCnt[i][tmp]++;
                    words.insert(tmp);
                    tmp.clear();
                }
            }
        } while(ch != '#');
    }
    sz = words.size();
    for(int i = 0; i < N; ++i){
        normSquare[i] = 0;
        frequency[i].resize(sz);
        int j = 0;
        for(auto it = words.begin(); it != words.end(); ++it, ++j){
            t = wordCnt[i][*it];
            frequency[i][j] = t;
            normSquare[i] += t * t;
        }
    }
    scanf("%d", &M);
    for(int i = 1; i <= M; ++i){
        std::cin >> a >> b;
        printf("Case %d: %.3f\n", i, calDistance(titleNbr[a], titleNbr[b]));
    }
    return 0;
}

题目如下：

Plagiarism is a form of academic dishonesty. To fight with such misconducts, plagiarism checkers are developed to compare any submitted article with all the articles stored in a database. The key is that given two documents, how to judge their similarity, or in other words, document distance? The document distance can be measured with their word metrics. In this project, you are supposed to write a program to calculate the distance of any given pair of documents.

Some important terms are defined as follows:

Word: is a continuous sequence of alphanumeric characters. For example, the phrase "the course data structure is fun" consists of 6 distinct words. Here we can assume that no word contains more than 20 characters.
Word frequency: denoted by FD(w) is the number of times the word w occurs in a document D.
Document distance metric: is defined to be the inner product of the two vectors containing the word frequencies for all words in two documents. Denote the frequency vector for document Di by F(Di)=(Fi(w1),⋯,Fi(wn))T, then the metric is given by (F(D1),F(D2))=∑i=1nF1(wi)F2(wi). In other words, the document distance metric is the projection of vector F(D1) onto F(D2), or vice versa.
Angle metric: is the angle between the vectors F(D1) and F(D2) which gives an indication of overlap between the two documents D1 and D2. The angle can be computed by the following formula:

θ(D1,D2)=arccos(∣∣F(D1)∣∣⋅∣∣F(D2)∣∣(F(D1),F(D2)))

where θ∈[0,π/2] and ∣∣⋅∣∣ can be any norm of a vector. To be more specific, here we use the 2-norm which is defined by ∣∣F(D)∣∣=(F(D),F(D)).

Input Specification:

Each input file contains one test case. For each case, the first line of input contains a positive integer N (≤100) which is the number of text files to be processed. Then N blocks follow, each contains a file. The first line of each file contains the title of the article (string of no more than 6 characters without any space), and the last line contains a single character #. After the file blocks, there is a line containing a positive integer M (≤100,000), followed by M lines of inquiries. Each inquiry gives a pair of file names separated by a space. The size of the largest test case is about 1 MB.

Output Specification:

For each inquiry, print in a line Case #: *, where # is the test case number starting from 1 and * is the document distance value accurate up to 3 decimal places.

Note: Your stemming algorithm must be able to handle "es", "ed", "ing", "ies" and must be case-insensitive. Stop words are not supposed to be ignored and must be treated as normal words.

Sample Input:

3
A00
A B C
#
A01
B C D
#
A02
A C
D A
#
2
A00 A01
A00 A02

Sample Output:

Case 1: 0.841
Case 2: 0.785

开放原子开发者工作坊

开放原子开发者工作坊旨在鼓励更多人参与开源活动，与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动，如meetup、训练营等，主打技术交流，干货满满，真诚地邀请各位开发者共同参与！

更多推荐

第二届开放原子大赛首批创新成果集结武汉，诚邀广大开发者共鉴开源技术盛宴

开放原子开发者工作坊

诚邀报名 | 开源基础设施能力建设分论坛：打造开源生态的“心脏”

开放原子开发者工作坊

诚邀报名 | 编程语言分论坛：AI时代的技术革新与开源实践

开放原子开发者工作坊

所有评论(0)

查看更多评论

天天AZ

@linh2006

已为社区贡献1条内容