PostgreSQL: Cuanto más bajo es el límite, más lenta será la consulta

tengo la siguiente consultaPostgreSQL: Cuanto más bajo es el límite, más lenta será la consulta

SELECT translation.id 
FROM "TRANSLATION" translation 
    INNER JOIN "UNIT" unit 
    ON translation.fk_id_unit = unit.id 
    INNER JOIN "DOCUMENT" document 
    ON unit.fk_id_document = document.id 
WHERE document.fk_id_job = 3665 
ORDER BY translation.id asc 
LIMIT 50

Se ejecuta por terribles 110 segundos.

Los tamaños de mesa:

+----------------+-------------+ 
| Table   | Records  | 
+----------------+-------------+ 
| TRANSLATION | 6,906,679 | 
| UNIT   | 6,906,679 | 
| DOCUMENT  |  42,321 | 
+----------------+-------------+

Sin embargo, cuando cambio el parámetro LIMIT de 50 a 1000, la consulta termina en 2 segundos.

Aquí es el plan de consulta para el lento

Limit (cost=0.00..146071.52 rows=50 width=8) (actual time=111916.180..111917.626 rows=50 loops=1) 
    -> Nested Loop (cost=0.00..50748166.14 rows=17371 width=8) (actual time=111916.179..111917.624 rows=50 loops=1) 
     Join Filter: (unit.fk_id_document = document.id) 
    -> Nested Loop (cost=0.00..39720545.91 rows=5655119 width=16) (actual time=0.051..15292.943 rows=5624514 loops=1) 
      -> Index Scan using "TRANSLATION_pkey" on "TRANSLATION" translation (cost=0.00..7052806.78 rows=5655119 width=16) (actual time=0.039..1887.757 rows=5624514 loops=1) 
      -> Index Scan using "UNIT_pkey" on "UNIT" unit (cost=0.00..5.76 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=5624514) 
       Index Cond: (unit.id = translation.fk_id_translation_unit) 
    -> Materialize (cost=0.00..138.51 rows=130 width=8) (actual time=0.000..0.006 rows=119 loops=5624514) 
      -> Index Scan using "DOCUMENT_idx_job" on "DOCUMENT" document (cost=0.00..137.86 rows=130 width=8) (actual time=0.025..0.184 rows=119 loops=1) 
       Index Cond: (fk_id_job = 3665)

y para el rápido uno

Limit (cost=523198.17..523200.67 rows=1000 width=8) (actual time=2274.830..2274.988 rows=1000 loops=1) 
    -> Sort (cost=523198.17..523241.60 rows=17371 width=8) (actual time=2274.829..2274.895 rows=1000 loops=1) 
     Sort Key: translation.id 
     Sort Method: top-N heapsort Memory: 95kB 
     -> Nested Loop (cost=139.48..522245.74 rows=17371 width=8) (actual time=0.095..2252.710 rows=97915 loops=1) 
      -> Hash Join (cost=139.48..420861.93 rows=17551 width=8) (actual time=0.079..2005.238 rows=97915 loops=1) 
       Hash Cond: (unit.fk_id_document = document.id) 
       -> Seq Scan on "UNIT" unit (cost=0.00..399120.41 rows=5713741 width=16) (actual time=0.008..1200.547 rows=6908070 loops=1) 
       -> Hash (cost=137.86..137.86 rows=130 width=8) (actual time=0.065..0.065 rows=119 loops=1) 
        Buckets: 1024 Batches: 1 Memory Usage: 5kB 
        -> Index Scan using "DOCUMENT_idx_job" on "DOCUMENT" document (cost=0.00..137.86 rows=130 width=8) (actual time=0.009..0.041 rows=119 loops=1) 
         Index Cond: (fk_id_job = 3665) 
      -> Index Scan using "TRANSLATION_idx_unit" on "TRANSLATION" translation (cost=0.00..5.76 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=97915) 
       Index Cond: (translation.fk_id_translation_unit = unit.id)

Al parecer, los planes de ejecución son muy diferentes y el segundo resultado en una consulta 50 veces más rápido.

Tengo índices en todos los campos implicados en la consulta y ejecuté ANALYZE en todas las tablas justo antes de ejecutar las consultas.

¿Alguien puede ver lo que está mal con la primera consulta?

UPDATE: definiciones de tabla

CREATE TABLE "public"."TRANSLATION" (
    "id" BIGINT NOT NULL, 
    "fk_id_translation_unit" BIGINT NOT NULL, 
    "translation" TEXT NOT NULL, 
    "fk_id_language" INTEGER NOT NULL, 
    "relevance" INTEGER, 
    CONSTRAINT "TRANSLATION_pkey" PRIMARY KEY("id"), 
    CONSTRAINT "TRANSLATION_fk" FOREIGN KEY ("fk_id_translation_unit") 
    REFERENCES "public"."UNIT"("id") 
    ON DELETE CASCADE 
    ON UPDATE NO ACTION 
    DEFERRABLE 
    INITIALLY DEFERRED, 
    CONSTRAINT "TRANSLATION_fk1" FOREIGN KEY ("fk_id_language") 
    REFERENCES "public"."LANGUAGE"("id") 
    ON DELETE NO ACTION 
    ON UPDATE NO ACTION 
    NOT DEFERRABLE 
) WITHOUT OIDS; 

CREATE INDEX "TRANSLATION_idx_unit" ON "public"."TRANSLATION" 
    USING btree ("fk_id_translation_unit"); 

CREATE INDEX "TRANSLATION_language_idx" ON "public"."TRANSLATION" 
    USING hash ("translation");

CREATE TABLE "public"."UNIT" (
    "id" BIGINT NOT NULL, 
    "text" TEXT NOT NULL, 
    "fk_id_language" INTEGER NOT NULL, 
    "fk_id_document" BIGINT NOT NULL, 
    "word_count" INTEGER DEFAULT 0, 
    CONSTRAINT "UNIT_pkey" PRIMARY KEY("id"), 
    CONSTRAINT "UNIT_fk" FOREIGN KEY ("fk_id_document") 
    REFERENCES "public"."DOCUMENT"("id") 
    ON DELETE CASCADE 
    ON UPDATE NO ACTION 
    NOT DEFERRABLE, 
    CONSTRAINT "UNIT_fk1" FOREIGN KEY ("fk_id_language") 
    REFERENCES "public"."LANGUAGE"("id") 
    ON DELETE NO ACTION 
    ON UPDATE NO ACTION 
    NOT DEFERRABLE 
) WITHOUT OIDS; 

CREATE INDEX "UNIT_idx_document" ON "public"."UNIT" 
    USING btree ("fk_id_document"); 

CREATE INDEX "UNIT_text_idx" ON "public"."UNIT" 
    USING hash ("text");

CREATE TABLE "public"."DOCUMENT" (
    "id" BIGINT NOT NULL, 
    "fk_id_job" BIGINT, 
    CONSTRAINT "DOCUMENT_pkey" PRIMARY KEY("id"), 
    CONSTRAINT "DOCUMENT_fk" FOREIGN KEY ("fk_id_job") 
    REFERENCES "public"."JOB"("id") 
    ON DELETE SET NULL 
    ON UPDATE NO ACTION 
    NOT DEFERRABLE 
) WITHOUT OIDS;

UPDATE: parámetros de base de datos

shared_buffers = 2048MB 
effective_cache_size = 4096MB 
work_mem = 32MB 

Total memory: 32GB 
CPU: Intel Xeon X3470 @ 2.93 GHz, 8MB cache

Fuente

2012-07-26 twoflower

¿Puedes publicar la definición de la tabla? –

@JohnTotetWoo Actualizado – twoflower

¿Está su instalación sintonizada en absoluto? ¿Cuáles son las configuraciones para shared_buffers, effective_cache_size, work_mem y las especificaciones de tu sistema? – eevar

Aquí hay una parte interesante de la documentación oficial de ANALYZE.

Para tablas grandes, ANALYZE toma una muestra aleatoria del contenido de la tabla, en lugar de examinar cada fila. [...] El alcance del análisis se puede controlar ajustando la variable de configuración default_statistics_target o columna por columna estableciendo el objetivo de estadísticas por columna con ALTER TABLE ... ALTER COLUMN ... SET ESTADÍSTICA.

Al parecer, es una forma común de mejorar el plan de mala consulta. Analizar será un poco más lento, pero el plan de consulta puede ser mejor.

ALTER TABLE

Fuente

2012-07-29 00:21:57 basgys

En la primera consulta, el optimizador toma stretegies que omiten tipo de exploración a través de la clave principal. El problema es que los resultados cumplen con la condición en document.fk_id se basan muy raramente. Entonces index scan y nl join deberían ir muy lejos para llenar el cubo de resultados.

Fuente

2012-08-06 10:35:44 dialogbox

PostgreSQL: Cuanto más bajo es el límite, más lenta será la consulta

Respuesta

Cuestiones relacionadas