2021-11-28

Is there a way to optimise this query

I am trying to calculate various tf-idf measures for a database with 4 million documents. I've created three tables:

CREATE TABLE document 
(
        id INTEGER Not NULL,
        title VARCHAR(20) Not NULL,
        body VARCHAR Not NULL
); 
CREATE TABLE token 
(
        id serial Not NULL,
        word VARCHAR Not NULL,
        idf INTEGER Not NULL
);   
CREATE TABLE token_count 
(
        docId  INTEGER Not NULL,
        tokenId  INTEGER Not NULL,
        amount INTEGER Not NULL
) 

I am using the following code to populate the token_count table, it works but it's pretty slow, is there a way to optimise it?

with temp_data as (
    select id , 
           (ts_stat('select to_tsvector(''english'', body) from document where id='||id)).*
    from wikitable
)
insert into token_count (docid, tokenid, amount)
select
    id,
    (select id from token where word = temp_data.word LIMIT 1),
    nentry
from temp_data


from Recent Questions - Stack Overflow https://ift.tt/3nYVLXI
https://ift.tt/eA8V8J

No comments:

Post a Comment