The idea of the repository is to have a streamlined version of the teachings I had per chapter1, focused on the Postgres itself. I found it interesting to change the database to avoid a simple copy/paste by myself. I chose a public dataset from the Brazilian government regarding confiscated cigarettes (and destroyed)2, that I may (or may not) change in the future.
In this repository we will use docker-compose to run a local postgres instance, to start run:
docker compose up --remove-orphans [-d]
# or
sh scripts/00_start.sh
To use the psql CLI3:
sh scripts/10_psql.sh
# or directly commands
sh scripts/10_psql.sh <<< 'select 1;'
sh scripts/10_psql.sh < my_sql.sql
- Respect the ETL (extract, transform, load)
- Before a
create
must have adrop * if exists [cascade]
- Create a schema just for importing the data with text type
- Put everything in an idempotent script (same input same output)
- Create lookup tables: reduce repetitions and could speedup search. A new table with all distinct values from the given column (of the import.* table), with a primary key (that will be used as foreign key)
- Date in Postgres is stored as a UTC
- TIMESTAMPT: values in UTC
- TIMESTAMPTZ: converts TIMESTAMP values (UTC) to the client's session time zone
- make clean && make
- A query is
Sargable
(Search ARGument ABLE) if it can take advantage of an index to speed up the execution of the query - Avoid
LIKE %*
as it isnon-sargable
ILIKE
is case-insensitive
- Create a view to make querying easier (and drop if exists)
- \x: toggle expanded display (table vs record by record)
- \H: change output to html layout
- \o file_name: change output to a file
- Full-text indexing:
to_tsvector(my_text_column)
andsearch @@ to_tsquery('my_search_query')
- GIN: Generalized Inverted Index (key, posting list)
- Materialized View: a view does not store data, a
Materialized View
is stored in disk (and so, allowed to create an index). Once the original data is updated, it is necessary to update the materialized view:REFRESH MATERIALIZED VIEW my_view_name
- Don't burn yourself out
- Count how many files follow the pattern:
ls */*.txt | wc -l
- csvkit: command-line tools for converting to and working with CSV, such as:
- in2csv: convert * to CSV
- csvcut -n: print column names
- csvsql: import into PostgreSQL (remember to replace varchar to text)
csvsql my_csv_path -i postgresql --tables "import.my_name" --noconstraints –overwrite | sed 's/VARCHAR/text/g' > import.sql
- Using
cut
command to retrieve the desired columns:cut -d ';' -f 1-2,4-6 user_data/cigarros.csv
- CTE (Common Table Expression):
with temp_table_name as ( select ...), other_name as (select ... where temp_table_name.x = 2) select * from other_name
with count_events as (select count(*) from events),
count_regions as (select count(*) from regions)
select * from count_events, count_regions;
- PostgreSQL support the following languages: PL/pgSQL, PL/Tcl, PL/Perl, and PL/Python.
drop function if exists my_function(arguments_types);
create function my_function (arguments)
returns return_datatype
IMMUTABLE -- if pure function (no side effects, same input same output)
as $$
declare
...
begin
...
return my_return_value
end;
$$ language sql;
Or similar to the one presented in the book
drop function if exists max_price_per_region(int);
create function max_price_per_region(rid int, out numeric)
as $$
select max(price) from events where events.region_id=rid
-- implicity return the first row of this select
$$ language sql;
To other languages, such as python, it is necessary to run CREATE LANGUAGE [plpython3u]
that requires installing python itself along with Postgres.
- Window function: similar to aggregation without grouping rows
count(1) over (partition by ...)
(you may want to adddistinct
in after theselect
)
as example, the percentage of entries per year with max/min price per year.
select distinct
year,
round(100.0 * (count(1) over (partition by year)::numeric
/ (count(1) over())::numeric), 2) as percentage_of_entries,
(max(price) over (partition by year)::numeric
/ (count(1) over())::numeric) as max_price,
(min(price) over (partition by year)::numeric
/ (count(1) over())::numeric) as min_price
from events;
- Chapter dedicated to explaining 'why to choose postgres'
- PostgreSQL License: liberal Open Source license, similar to the BSD or MIT licenses. You can distribute, modify and commercial use.
- Price per hour comparison on AWS-RDS with db.t3.xlarge (16GB, 4vCPU): $0.60 oracle vs $0.29 Postgres
- MVCC (Multi-Version Concurrency Control): locks acquired for querying (reading) data don't conflict with locks acquired for writing data. (reading doesn't blocks writing, and writing doesn't blocks reading).
- For writing: the data is lifted into virtual memory, only when the transaction is complete is the entire change written to disk in one operation.
- You (unfortunately) may add bias to the data (but avoid it)
- Occam's razor: the simplest explanation is usually the best one
- The Temporary table is automatically dropped at the end of a session / current transaction
CREATE TEMPORARY TABLE my_temp_table_name
- Use
explain [analyze] my_query
to shows the execution plan of a query, it is important to check itscost
and itsscans
- The
cost
is a relative measurement, which you may want to minimize. - There are several ways to scan a table such as
sequential scan
,index scan
, ... \d my_element
describes a element (table, view, index, ...)- In a many-to-many (N:M) relationship, a
junction table
is used to support relationship between two tables. The junction table contains the two foreign keys and creates its own primary key using those two foreign keys. - In postgres is possible to index with
where
(e.g.: index where a column is not null), and alsoconcurrently
(without paying) - BTREE index (balanced tree)
- Also the
tsrange
(timestamp range) in which you define a window (start, end) and inclusive/exclusive options. - Other range options:
int4range
,numrange
,daterange
... - Ranges Types: Your Life Will Never Be The Same
- To query ranges, use
@>
e.g:SELECT int4range(10, 30) @> 20
, to compute the intersection between two ranges use*
, and to check if the ranges overlaps, use&&
- Remember to create a new index using GIST (Generalized Search Tree) on the ranges (if you wish)
- However, if you are using a range to constrain results it may not help (as Dee problem)
- Time for plot
- Overall, I loved this book! I never thought a tutorial could be so well done
- Please, feel free to contribute to more lessons :D
Footnotes
-
Some notes here were not directly found in the book, but the book led me to research about. ↩
-
Data From: https://dados.gov.br/dados/conjuntos-dados/destinacoes-de-mercadorias-apreendidas ↩