An efficient length-segmented inverted index-based set similarity query algorithm Online publication date: Mon, 07-Nov-2022
by Mengjuan Li; Lianyin Jia; Juntao Hu; Ruiqi Zhang; Shoulin Wei; Mengni Pan
International Journal of Computing Science and Mathematics (IJCSM), Vol. 16, No. 1, 2022
Abstract: Set Similarity query is extensively used in recommendation systems, near duplicate detection and many other fields. Existing approaches usually scan all relevant inverted lists or trie indexes, thus leading to low efficiency. To solve this problem, in this paper, an efficient length segmentation-based inverted index, LenSegII, is designed. LenSegII imposes a length segmentation table (LST) on inverted indexes by exploiting length filtering, thus logically dividing an inverted index into segments. Based on LenSegII, LenSC-SSQ, an efficient set similarity query algorithm is proposed. LenSC-SSQ only needs to access a certain range of inverted index segments and locates the segments efficiently by LST, thus a large number of unqualified sets can be filtered, which significantly decreases query time. Besides, LenSC-SSQ can improve space efficiency by using a more compact count array. Experiments on 2 real and 1 synthetic datasets show LenSC-SSQ outperforms the other competitors significantly and can reach a performance improvement up to 5.8X over ScanCount.
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Computing Science and Mathematics (IJCSM):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email subs@inderscience.com