A New Approach for Multi-pattern String Matching in Large Text Corpora
Paper ID : 1127-IST
1Mojgan Farhoodi *, 2Ehsan Sherkat, 3alireza yari
1ICT Research Institute (ITRC)
2Education & Research Institute for ICT
3Iran telecom Research Center
Multi-pattern string matching with large set of patterns, is a key issue in nowadays various text retrieval applications. Filtering undesirable URLs, Finding quotes to famous holy books in text, extracting specific patterns from DNA sequences, Antivirus scanning, intrusion detection or even music retrieval is some applications of multi-pattern string matching. As the size of corpora and the number of patterns increase, the necessity for efficiently finding multiple patterns is also increase. In this paper a new approach for multiple pattern string matching is introduced. The proposed approach employs filtering useless parts of the text and indexing technics beside finite state machines, in order to achieve better performance. The proposed approach beside its efficiency, have the advantage of being scalable and accurate even in noisy text corpora. Multiple experiments have been conducted on three real-life datasets, in order to evaluate the efficiency, flexibility and scalability of the proposed approach.
Multiple Pattern Matching; String Matching; Partial Pattern Matching; Text Retrieval