Loading Data into your PGD Cluster v6.1.0

Suggest edits

PGD is, at its core, a Postgres database, so you can use the same tools and methods to load data into your PGD cluster as you would with any PostgreSQL database. To get you started, this guide will walk you through the process of loading data into your PGD cluster.

Online CSV Importing

First, we are going to show how you can import data from an online CSV file into your PGD cluster. In this case, it's some historical baseball data from Baseball Databank. We are going to use the \COPY command in psql to import directly from a URL. One thing \COPY doesn't do is create the table for you, so we will need to create the table first.

Connect to your PGD cluster using psql, either using docker compose exec host-1 psql or if you have psql installed locally, using that to connect to port 6432 on your host machine.

CREATE TABLE batters (
                      id SERIAL,
                      playerid VARCHAR(9),
                      yearid INTEGER,
                      stint INTEGER,
                      teamid VARCHAR(3),
                      lgid VARCHAR(2),
                      g INTEGER,
                      ab INTEGER,
                      r INTEGER,
                      h INTEGER,
                      "2b" INTEGER,
                      "3b" INTEGER,
                      hr INTEGER,
                      rbi INTEGER,
                      sb INTEGER,
                      cs INTEGER,
                      bb INTEGER,
                      so INTEGER,
                      ibb INTEGER,
                      hbp INTEGER,
                      sh INTEGER,
                      sf INTEGER,
                      gidp INTEGER,
                      PRIMARY KEY (id)
);

Now we can import the CSV data into the batters table using the \COPY command:

\COPY batters(playerid,yearid,stint,teamid,lgid,g,ab,r,h,"2b","3b",hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp) FROM PROGRAM 'curl "https://raw.githubusercontent.com/cbwinslow/baseballdatabank/master/core/Batting.csv"' DELIMITER ',' CSV HEADER

This command uses curl to fetch the CSV file from the URL and pipes it directly into the \COPY command, which imports the data into the batters table. The batters(...) entry defines which fields in the row the CSV data should go to. The DELIMITER ',' CSV HEADER options specify that the file is a CSV, using commas, with a header row, that gets skipped.

Copy and the command and paste it into your psql session. If everything is set up correctly, you should see the data being imported without any errors. You should see output indicating the number of rows copied, like this:

COPY 110495

To verify that the data has been loaded correctly, you can run a simple query:

SELECT COUNT(*) FROM batters;

You should see a result like this:

 count
-------
 110495
(1 row)

This confirms that 110,495 rows have been successfully imported into the batters table.

Let's quickly user it to work out who 1998's home run leader was

SELECT playerid, yearid, teamid, hr
FROM batters
WHERE yearid = 1998
ORDER BY hr DESC
LIMIT 1;

You should see output like this:

 playerid  | yearid | teamid | hr
-----------+--------+--------+----
 mcgwima01 |   1998 | SLN    | 70
(1 row)

And if we want to put that into the context of the top 5 highest ranked home run hitters in 1998, we can do:

SELECT playerid, yearid, teamid,
       rank() OVER (PARTITION BY yearid ORDER BY hr desc) hr_rank,
       hr
FROM batters
WHERE yearid = 1998
ORDER BY hr_rank LIMIT 5;

You should see output like this:

 playerid  | yearid | teamid | hr_rank | hr
-----------+--------+--------+---------+----
 mcgwima01 |   1998 | SLN    |       1 | 70
 sosasa01  |   1998 | CHN    |       2 | 66
 griffke02 |   1998 | SEA    |       3 | 56
 vaughgr01 |   1998 | SDN    |       4 | 50
 belleal01 |   1998 | CHA    |       5 | 49
(5 rows)

With PGD, you can enjoy the full power of PostgreSQL, including advanced SQL features like window functions, to analyze your data, but with the added benefit of it being fully replicated and highly available across multiple nodes even when a node goes offline.

Next Steps

Now that you have loaded some data into your PGD cluster, you can explore the following topics:

Using the PGD CLI to manage your PGD cluster from the command line.

← Prev

Working with SQL and the PGD Cluster

↑ Up

First steps with your Quickstart PGD Cluster

Using PGD CLI

Loading Data into your PGD Cluster v6.1.0

Online CSV Importing

Next Steps

← Prev

↑ Up

Next →