I am focusing on a PostgreSQL 8.1 SQL script which must remove a lot of rows from the table.

Let us the table I have to remove from is Employees (~260K rows). It's primary key named id.

The rows I have to remove out of this table are saved inside a separate temporary table known as EmployeesToDelete (~10K records) having a foreign key mention of the Employees.id known as worker_id.

Can there be a competent method of doing this?

In the beginning, I figured from the following:

FROM    Employees
        SELECT  employee_id
        FROM    EmployeesToDelete

However I heard that while using "IN" clause and subqueries could be inefficient, particularly with bigger tables.

I have checked out the PostgreSQL 8.1 documentation, and there is reference to Remove FROM ... USING however it does not have good examples so I am unsure using it.

I am wondering when the following works and it is more effective?

FROM    Employees
USING   Employees e
        EmployeesToDelete ed
ON      e.id = ed.employee_id

Your surveys are greatly appreciated.

Edit: I went EXPLAIN Evaluate and also the strange factor would be that the first Remove went pretty rapidly (within a few moments), as the second Remove required such a long time (over 20 min) I eventually cancelled it.

Adding a catalog towards the temp table assisted the performance a great deal.

Here is a query plan from the first Remove for anybody interested:

 Hash Join  (cost=184.64..7854.69 rows=256482 width=6) (actual time=54.089..660.788 rows=27295 loops=1)
   Hash Cond: ("outer".id = "inner".employee_id)
   ->  Seq Scan on Employees  (cost=0.00..3822.82 rows=256482 width=10) (actual time=15.218..351.978 rows=256482 loops=1)
   ->  Hash  (cost=184.14..184.14 rows=200 width=4) (actual time=38.807..38.807 rows=10731 loops=1)
         ->  HashAggregate  (cost=182.14..184.14 rows=200 width=4) (actual time=19.801..28.773 rows=10731 loops=1)
               ->  Seq Scan on EmployeesToDelete  (cost=0.00..155.31 rows=10731 width=4) (actual time=0.005..9.062 rows=10731 loops=1)

 Total runtime: 935.316 ms
(7 rows)

At this time, I'll stick to the very first Remove unless of course I'm able to find an easy method of writing it.

Don't guess, measure. Try the different techniques and find out which may be the least to complete. Also, use EXPLAIN to be aware what PostgreSQL is going to do and find out where one can optimize. Very couple of PostgreSQL customers can guess properly the quickest query...

I am unsure concerning the DELETE FROM ... USING syntax, but generally, a subquery should realistically function as the same factor being an INNER JOIN anyway. The database query optimizer ought to be capable (which is only a guess) of performing exactly the same query arrange for both.

I am wondering when the following works and it is more effective?

    FROM    Employees e
    USING   EmployeesToDelete ed
    WHERE   id = ed.employee_id;

This totally rely on your index selectivity.

PostgreSQL has a tendency to employ MERGE IN JOIN for IN predicates, that has stable execution time.

It isn't impacted by the number of rows satisfy this problem, so long as you have an purchased resultset.

An purchased resultset requires whether sort operation or perhaps an index. Full index traversal is extremely inefficient in PostgreSQL in comparison to SEQ SCAN.

The JOIN predicate, however, will benefit by using NESTED LOOPS in case your index is extremely selective, and by using HASH JOIN is it's inselective.

PostgreSQL should choose the correct one by calculating the row count.

Because you have 30k rows against 260K rows, I expect HASH JOIN to become more effective, and gradually alter develop a intend on a DELETE ... USING query.

To make certain, please publish execution arrange for both queries.

Why can't you remove the rows to begin with rather than adding these to the EmployeesToDelete table?

Or if you want to undo, just give a "erased" flag to Employees, so that you can turn back deletion, or make in permanent, all-in-one table?