Content de-duplication #1496
-
Hi - I am doing a project for content de-duplication. My document repository has large number of files (only English language) with varying formats (PDF, Excel, Word). The contents are semantically similar. Is it possible to use any of the pre-trained haystack models and its pipelines for de-duplication use case? If I have to train a new model for this use case,what approach would be best? The question can be as simple as: "Show all documents that are semantically similar?" Hope someone in this group can assist. Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 22 replies
-
Hi @sekh77! We have a I hope this answers your question :) |
Beta Was this translation helpful? Give feedback.
-
Great stuff! Thank you @bogdankostic What are "document_ids"? Are these filenames? For the use case that I am developing, only the document repository is passed as input to the system. In other words, the repository is likely to contain thousands of documents in varying formats (PDF, word, excel, plain text). What is expected by the end-user is that the model should automatically take one document (file) at a time from this repository and compare this document against the remaining (n-1) documents. Repeat this for every document in the repository. And then eventually publish a report (in JSON format) showing the following information:
Is it possible to update your code snippet to achieve this? A feature like this will be an excellent addition to Haystack because this is a very common use case for most of the enterprises w.r. to their data migration activity. Such a report will help to migrate only the latest and greatest version of documents ignoring duplicates. Truly appreciate your wonderful support! Cheers, |
Beta Was this translation helpful? Give feedback.
-
Hi @bogdankostic - I managed to get this running for my document store. And could see duplicates being reported. Here's an example of the Document result object. most_similar_docs = [{'text': '<>, 'score': 1.0, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0673.txt'}, 'embedding': None, 'id': 'e1eccfd26a6354b493a601bf966d2b2a'}, 'text': '<>, 'score': 0.93728964, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0781.txt'}, 'embedding': None, 'id': 'ea119020fb1dad657dbbef87e7419894'},}}] I have 10 different entries most_similar_docs[0] to most_similar_docs[10] in the result object. Each entry has top_k=4 - so most_similar_docs[0] has 4 entries. How do I loop through most_similar_docs, and generate a CSV report as follows: File name, Score, Duplicate Files I tried in this way so far: print(list(map(lambda item: item.get('score', 'default value'), most_similar_docs))) Any help would be greatly appreciated? Thanks, |
Beta Was this translation helpful? Give feedback.
-
You can just follow the example in the link but you will need to pass in
your own annotated dataset for a custom evaluation or you can compare
against how your model does with a public dataset. At the bottom of that
link, it describes the dataset should be in the SQuAD format and it
mentions to checkout SquadDataobject inhaystack/squad_data.py.
That should have code examples. There aren’t any other examples to my
knowledge.
…On Fri, Sep 30, 2022 at 12:59 PM Sankalp ***@***.***> wrote:
@JoeREISys <https://github.com/JoeREISys> thanks a lot, I got it working.
🎉
A quick follow-up question - I also want to evaluate the retrieved
sentences (documents) using the mAP and Recall metrics mentioned here
<https://haystack.deepset.ai/guides/evaluation#metrics-retrieval>. How
can I achieve it? Is there an example? Thanks!
—
Reply to this email directly, view it on GitHub
<#1496 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB77SQYMPGVKG6OFTHTRS2TWA4MAVANCNFSM5ETAS72Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hi @sekh77!
We have a
MostSimilarDocumentsPipeline
(see here) that allows you to find the most similar documents given one document. For creating document embeddings, you might want to use a sentence-transformers model (see here for details).I hope this answers your question :)