Loading Data into your PGD Cluster v6.0.1
PGD is, at its core, a Postgres database, so you can use the same tools and methods to load data into your PGD cluster as you would with any PostgreSQL database. To get you started, this guide will walk you through the process of loading data into your PGD cluster.
Online CSV Importing
First, we are going to show how you can import data from an online CSV file into your PGD cluster. In this case, it's some historical baseball data from Baseball Databank. We are going to use the \COPY
command in psql to import directly from a URL. One thing \COPY
doesn't do is create the table for you, so we will need to create the table first.
Connect to your PGD cluster using psql
, either using docker compose exec host-1 psql
or if you have psql
installed locally, using that to connect to port 6432 on your host machine.
CREATE TABLE batters ( id SERIAL, playerid VARCHAR(9), yearid INTEGER, stint INTEGER, teamid VARCHAR(3), lgid VARCHAR(2), g INTEGER, ab INTEGER, r INTEGER, h INTEGER, "2b" INTEGER, "3b" INTEGER, hr INTEGER, rbi INTEGER, sb INTEGER, cs INTEGER, bb INTEGER, so INTEGER, ibb INTEGER, hbp INTEGER, sh INTEGER, sf INTEGER, gidp INTEGER, PRIMARY KEY (id) );
Now we can import the CSV data into the batters
table using the \COPY
command:
\COPY batters(playerid,yearid,stint,teamid,lgid,g,ab,r,h,"2b","3b",hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp) FROM PROGRAM 'curl "https://raw.githubusercontent.com/cbwinslow/baseballdatabank/master/core/Batting.csv"' DELIMITER ',' CSV HEADER
This command uses curl
to fetch the CSV file from the URL and pipes it directly into the \COPY
command, which imports the data into the batters
table. The batters(...) entry defines which fields in the row the CSV data should go to.
The DELIMITER ',' CSV HEADER
options specify that the file is a CSV, using commas, with a header row, that gets skipped.
Copy and the command and paste it into your psql
session. If everything is set up correctly, you should see the data being imported without any errors. You should see output indicating the number of rows copied, like this:
COPY 110495
To verify that the data has been loaded correctly, you can run a simple query:
SELECT COUNT(*) FROM batters;
You should see a result like this:
count
-------
110495
(1 row)
This confirms that 110,495 rows have been successfully imported into the batters
table.
Let's quickly user it to work out who 1998's home run leader was
SELECT playerid, yearid, teamid, hr FROM batters WHERE yearid = 1998 ORDER BY hr DESC LIMIT 1;
You should see output like this:
playerid | yearid | teamid | hr
-----------+--------+--------+----
mcgwima01 | 1998 | SLN | 70
(1 row)
And if we want to put that into the context of the top 5 highest ranked home run hitters in 1998, we can do:
SELECT playerid, yearid, teamid, rank() OVER (PARTITION BY yearid ORDER BY hr desc) hr_rank, hr FROM batters WHERE yearid = 1998 ORDER BY hr_rank LIMIT 5;
You should see output like this:
playerid | yearid | teamid | hr_rank | hr
-----------+--------+--------+---------+----
mcgwima01 | 1998 | SLN | 1 | 70
sosasa01 | 1998 | CHN | 2 | 66
griffke02 | 1998 | SEA | 3 | 56
vaughgr01 | 1998 | SDN | 4 | 50
belleal01 | 1998 | CHA | 5 | 49
(5 rows)
With PGD, you can enjoy the full power of PostgreSQL, including advanced SQL features like window functions, to analyze your data, but with the added benefit of it being fully replicated and highly available across multiple nodes when a node fails.
Next Steps
Now that you have loaded some data into your PGD cluster, you can explore the following topics:
- Using the PGD CLI to manage your PGD cluster from the command line.
- On this page
- Online CSV Importing
- Next Steps