Content de-duplication #1496

sekh77 · 2021-09-23T07:30:28Z

sekh77
Sep 23, 2021

Hi -

I am doing a project for content de-duplication. My document repository has large number of files (only English language) with varying formats (PDF, Excel, Word). The contents are semantically similar.

Is it possible to use any of the pre-trained haystack models and its pipelines for de-duplication use case? If I have to train a new model for this use case,what approach would be best?

The question can be as simple as: "Show all documents that are semantically similar?"

Hope someone in this group can assist.

Thanks,
Sekhar H.

Answered by bogdankostic

Sep 24, 2021

Hi @sekh77!

We have a MostSimilarDocumentsPipeline (see here) that allows you to find the most similar documents given one document. For creating document embeddings, you might want to use a sentence-transformers model (see here for details).

I hope this answers your question :)

View full answer

bogdankostic · 2021-09-24T09:54:26Z

bogdankostic
Sep 24, 2021

Hi @sekh77!

We have a MostSimilarDocumentsPipeline (see here) that allows you to find the most similar documents given one document. For creating document embeddings, you might want to use a sentence-transformers model (see here for details).

I hope this answers your question :)

2 replies

sekh77 Sep 24, 2021
Author

Thank you @bogdankostic! Appreciate your prompt response. Would you be able to share a working example of MostSimilarDocumentsPipeline usage? I can see a brief documentation here: https://haystack.deepset.ai/docs/latest/apipipelinesmd. But I am unsure about how to use this feature. Sorry, I am fairly new to Haystack.

Cheers,
Sekhar H>

bogdankostic Sep 24, 2021

Sure, MostSimilarDocumentsPipeline could be used for example like this:

from haystack.document_store import ElasticsearchDocumentStore
from haystack.utils import launch_es
from haystack.retriever.dense import EmbeddingRetriever
from haystack import Document
from haystack.preprocessor.utils import convert_files_to_dicts
from haystack.pipeline import MostSimilarDocumentsPipeline

# DocumentStore used to store the documents with their embeddings
launch_es()
document_store = ElasticsearchDocumentStore()

# EmbeddingRetriever used to create embeddings for the documents
# See https://www.sbert.net/docs/pretrained_models.html for available embedding models
embedding_retriever = EmbeddingRetriever(document_store=document_store,
                                         embedding_model="sentence-transformers/all-mpnet-base-v2",
                                         model_format="sentence_transformers")

# Add documents to document_store and create embeddings
docs = convert_files_to_dicts(dir_path=DOCUMENT_DIR)
document_store.write_documents(documents=docs)
document_store.update_embeddings(retriever=embedding_retriever)

# Find most similar documents for a list of given document ids
similar_docs_pipeline = MostSimilarDocumentsPipeline(document_store=document_store)
most_similar_docs = similar_docs_pipeline.run(document_ids=[LIST_OF_DOCUMENT_IDS], top_k=NUMBER_OF_SIMILAR_DOCUMENTS_RETURNED)

sekh77 · 2021-09-25T10:04:31Z

sekh77
Sep 25, 2021
Author

Great stuff! Thank you @bogdankostic

What are "document_ids"? Are these filenames? For the use case that I am developing, only the document repository is passed as input to the system. In other words, the repository is likely to contain thousands of documents in varying formats (PDF, word, excel, plain text). What is expected by the end-user is that the model should automatically take one document (file) at a time from this repository and compare this document against the remaining (n-1) documents. Repeat this for every document in the repository. And then eventually publish a report (in JSON format) showing the following information:

"document name"
"number of documents that are similar to this document"
"names of top 3 similar documents"
"similarity score with the original document for each of the top 3"

Is it possible to update your code snippet to achieve this?

A feature like this will be an excellent addition to Haystack because this is a very common use case for most of the enterprises w.r. to their data migration activity. Such a report will help to migrate only the latest and greatest version of documents ignoring duplicates.

Truly appreciate your wonderful support!

Cheers,
Sekhar H.

9 replies

bogdankostic Sep 30, 2021

I would close this discussion then, feel free to raise new questions if something comes up :)

sekh77 Oct 8, 2021
Author

Hi @bogdankostic -

Would you be able to suggest some dataset in the Open Source domain for content de-duplication model training?

Thanks,
Sekhar H.

Timoeller Oct 8, 2021

Here you find infos on the training data for the sentence transformers model linked in Bogdans answer embedding_model="sentence-transformers/all-mpnet-base-v2".
As you can see, they collected quite some training data. So I suggest you start with using pretrained models and see if they work for your case.

Just as a note in case you weren't aware: transformer based models have limitations on the text length they can use as input (normally around 400 words), so for long documents you might want to look into transformer models that can handle long input sequences like reformer, longformer and co.

sekh77 Oct 12, 2021
Author

Thank you, @Timoeller. Longformer looks interesting! I will try with this.

Cheers,
Sekhar H.

sekh77 Oct 12, 2021
Author

Also, is there a pipeline in Haystack that can display the semantically similar sentences when two documents are found to be similar?

sekh77 · 2021-10-22T12:07:51Z

sekh77
Oct 22, 2021
Author

Hi @bogdankostic - I managed to get this running for my document store. And could see duplicates being reported. Here's an example of the Document result object.

most_similar_docs = [{'text': '<>, 'score': 1.0, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0673.txt'}, 'embedding': None, 'id': 'e1eccfd26a6354b493a601bf966d2b2a'}, 'text': '<>, 'score': 0.93728964, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0781.txt'}, 'embedding': None, 'id': 'ea119020fb1dad657dbbef87e7419894'},}}]

I have 10 different entries most_similar_docs[0] to most_similar_docs[10] in the result object. Each entry has top_k=4 - so most_similar_docs[0] has 4 entries.

How do I loop through most_similar_docs, and generate a CSV report as follows:

File name, Score, Duplicate Files
file_0673.txt, 93.7%, 0781.txt

I tried in this way so far: print(list(map(lambda item: item.get('score', 'default value'), most_similar_docs)))
But I get the error: AttributeError: 'list' object has no attribute 'get'

Any help would be greatly appreciated?

Thanks,
Sekhar H.

11 replies

linedejgaard Mar 21, 2022

Thank you so much for your answer! I'm not sure how to check the settings, and how to see if it works while the program is running. And do you know where the embeddings are saved? Are they saved in 'Embedding' or in 'emb' in meta or another place?:) And I'd like to ask, where I can specify the document that I want the results to be similar to, because I do not get that as well.

sankalp-acl Sep 29, 2022

@jhillhouse92 on a related note, can I use MostSimilarDocumentsPipeline to find the most similar sentences (stored in the document store) for a given query sentence (which is not in the document store)?

JoeREISys Sep 29, 2022

@linedejgaard I completely forgot about this question and missed your question. I suspect you got your answers by now.

JoeREISys Sep 29, 2022

@sankalp-acl (same as @jhillhouse92) - No, you can't use MostSimilarDocumentsPipeline for querying for document (whether paragraph or sentence) that's not in the document store. When you invoke the run method on it you have to give it document_ids which it queries for. If you are just trying to find similarity based on an embedding that may or may not exist in store, query the document store directly.

document_store.query_by_embedding(
    query_emb=document.embedding, return_embedding=False, top_k=top_k
)

The document embedding is the embedding from running your input through the target model.

sankalp-acl Sep 30, 2022

@JoeREISys thanks a lot, I got it working. 🎉

A quick follow-up question - I also want to evaluate the retrieved sentences (documents) using the mAP and Recall metrics mentioned here. How can I achieve it? Is there an example? Thanks!

jhillhouse92 · 2022-09-30T18:28:11Z

jhillhouse92
Sep 30, 2022

You can just follow the example in the link but you will need to pass in your own annotated dataset for a custom evaluation or you can compare against how your model does with a public dataset. At the bottom of that link, it describes the dataset should be in the SQuAD format and it mentions to checkout SquadDataobject inhaystack/squad_data.py. That should have code examples. There aren’t any other examples to my knowledge.

…

On Fri, Sep 30, 2022 at 12:59 PM Sankalp ***@***.***> wrote: @JoeREISys <https://github.com/JoeREISys> thanks a lot, I got it working. 🎉 A quick follow-up question - I also want to evaluate the retrieved sentences (documents) using the mAP and Recall metrics mentioned here <https://haystack.deepset.ai/guides/evaluation#metrics-retrieval>. How can I achieve it? Is there an example? Thanks! — Reply to this email directly, view it on GitHub <#1496 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB77SQYMPGVKG6OFTHTRS2TWA4MAVANCNFSM5ETAS72Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content de-duplication #1496

{{title}}

Replies: 4 comments 22 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Content de-duplication #1496

Replies: 4 comments · 22 replies

sekh77 Sep 24, 2021 Author

sekh77 Sep 25, 2021 Author

sekh77 Oct 8, 2021 Author

sekh77 Oct 12, 2021 Author

sekh77 Oct 12, 2021 Author

sekh77 Oct 22, 2021 Author

Replies: 4 comments 22 replies

sekh77 Sep 24, 2021
Author

sekh77
Sep 25, 2021
Author

sekh77 Oct 8, 2021
Author

sekh77 Oct 12, 2021
Author

sekh77 Oct 12, 2021
Author

sekh77
Oct 22, 2021
Author