Open Source Software Technical Articles

Want the Best of the Wazi Blogs Delivered Directly to your Inbox?

Subscribe to Wazi by Email

Your email:

Connect with Us!

Current Articles | RSS Feed RSS Feed

Take Advantage of Database Field Arrays: Examples Using PostgreSQL

  
  
  

Storing arrays in database fields provides a number of challenges, but also a number of opportunities for RDBMS developers. We'll illustrate some of those issues using PostgreSQL as an example, but the principles can be applied to other database systems as well, including MySQL.

Beginners studying PostgreSQL database design are often familiar with boolean, integer, char, and so on as field data types, but they may be puzzled by notations such as integer[]. The addition of the brackets is a mystery. When you are given a list of field types to select from, for example in a dropdown list in PgAdmin while designing a table containing both the plain character varying and the alternate character varying[], it is easy to select one when you mean the other, and wind up spending several hours doing painful debugging.

The brackets signify that the column, instead of holding just one item, such as a name, will hold multiple items. Let's see some data that illustrates the idea. The players table in Data set 1 below has three fields: pid (serial type, an integer, one only), name (character varying, maximum length 100, a person's name, one only), and teams (character varying[], an array of character varying strings, with multiple elements in the array):

Data set 1
pid - serial name (character varying) teams (character varying[])
1 Alan {badminton,squash}
2 Alice {badminton,hockey}
3 Ben {soccer,rugby,cricket}
4 Betty {cricket,soccer,hockey}

Suppose you want to know which people are on the badminton team. The SQL statement

select name from players where 'badminton' = any(teams)

returns Alan and Alice.

Note that while the teams field has a type of character varying[], the actual data you see in the field is a string enclosed in opening and closing braces. Yes, I agree, it is confusing.

Now consider what happens if you change Alan's teams to '{badminton,squash,[hockey,cricket]}'. One reason to do this might be that you wish to indicate a primary sport, a secondary sport, and separately those sports the person is qualified to referee. If you repeat the query but this time specify cricket as the team, the query still only returns Ben and Betty, since the reference to cricket in Alan's record is one level down from the top level. The change in Alan's teams is somewhat easier to read since you have used both brackets and braces. The braces define the field, the brackets define a separate array or list in the field, which is not found unless you programmatically tease it out in a script.

At the beginner level, arrays in fields provide a simple solution to the problem of storage of multiple items that are the same type of thing, in this case the names of a sport activity.

Arrays or Junction Tables?

Intermediate users will immediately pounce on the teams field, complaining that this table design is not good database practice since the data is not normalized. When the data is not normalized or broken up into components that don't have to be repeated, there's a good possibility of errors such as spelling inconsistency creeping in. This is much more likely using the arrays approach.

It is better database practice to split up or normalize the data. You might consider three tables: persons, sport activities, and a junction table. The players and sport activity tables would contain just the name of the item and a serial number, and the junction table would contain the serial numbers of linked items:

Data set 2 - persons
pid (serial) name (character varying)
1 Alan
2 Alice

 

Data set 3 - sports
sid (serial) sport (character varying)
1 Badminton
2 Cricket

 

Data set 4 - junction
jid (serial) pid (integer) sid (integer)
1 1 1
2 1 2

Now, to find out what sports Alan is involved in, use the SQL statement:

select sport from sports
inner join junction using (sid)
inner join persons using (pid)
where name = 'Alan'

The result shows that Alan plays Badminton and Cricket.

The advantage of the junction table approach is that you now only specify the spelling of "Cricket" once, which minimizes the chances of entering it incorrectly elsewhere. The big disadvantage is that you now have three tables to manage and not just one. The larger and more complex the project, the more junction tables help; the smaller the project, the converse applies.

While the the junction table approach has many advantages, field arrays can still be useful. For example, there is also a hybrid possibility, that of separate tables for players and teams, but in this case you replace the junction table with an array of integers in the teams field of the players table. The field teams would be defined as type integer[] and contain data such as {1,2,3}, where each integer refers to a sport activity whose ID number is defined in the sports table. In addition, you can specify subscripts for the array in your SQL, such as:

select player,sport from players
inner join sports on players.teams[1] = sports.sid

In this statement you are creating a join between the players and sports tables using element one of the teams array to connect to the related id in the sports table. This implies that there is something significant about the role of element 1 of the teams array that is common to all records.

This hybrid approach solves the issue of potential spelling errors in the sports table, but does not provide as much richness of functionality as junction tables. Field arrays provide maximum readability directly from the table and ease of SQL statement construction, while the junction table approach provides maximum control but least readability, at least directly from the tables on those odd occasions when you need to be able to read the tables without referring to other data through queries.

Scripting

Perhaps the best reason for using arrays in database fields is to take advantage of the power of scripting. When you do so, you rely on the script to keep all the data normalized. Here is a sample Python 2.7 script that illustrates the use of a field array using Data set 1 above. The script asks the question, "What sports does Ben play?"

#
from __future__ import print_function
import sys
import psycopg2
pqkb = psycopg2.connect("host='server' user='user' password='password' dbname='dbname'")
pqcur = pqkb.cursor()
#
sql = "select teams from players where name = %s"
data = ('Ben',)
pqcur.execute(sql,data)
row = pqcur.fetchone()
sports = row[0]
for sport in sports:
    print("Ben plays %s" % sport)

The important line to note is where the value from the database field is assigned directly to a Python list in the line sports = row[0]. The ability to assign a field directly to a Python list is a convenient feature of database field arrays. You can of course go on to pop items from the sports list, or push items onto the same list and store the new array back in the field as required.

19a98812-f823-48dc-841e-bf029c63c6d7

More Advanced Usage

While PostgreSQL field arrays may contain loose data, they really come into their own when dealing with highly organized data.

Consider the important philosophical question: "What is reddish?" Substitute greenish, bluish, sort of violet, or vaguely beige if you prefer, it comes down to the same thing in database terms. Let's stick with reddish for now.

In RGB terms, pure red is {255,0,0}, which indicates lots of red and no green or blue at all. But {255,1,1} is also pretty solidly red, and so is {250,10,10}. So the question then becomes, when does red stop being red and become another color? Somebody has to decide the upper and lower boundaries of values given specific circumstances, whether it is for, say, a study of bee vision or how, forensically, the color of dried blood changes over time in the atmosphere.

You will note that I used the brace notation to continue the context of PostgreSQL field arrays. Let's change the length of the array to take into account upper and lower bounds of each color. Thus in the specification {245,255,0,10,0,10} someone has decided that reddish in this case can lose 10 units of red and gain either 10 units of green or blue or both and still remain reddish for the purposes.

"Reddish" can now be stored in one record in a database, along with greenish and bluish and all the others. All of them will have a field with an array of six components (the upper and lower bounds for each of the three colors). And you can run a test to decide on truth or not. If you are examining an image and a pixel reports RGB(250,11,5), then the test for reddish fails since there is too much green.

Here is a Python 2.7 project that illustrates the database usage:

#
from __future__ import print_function
import sys
import psycopg2
pqkb = psycopg2.connect("host='server' user='user' password='password' dbname='dbname'")
pqish = pqkb.cursor()
#
(r,g,b,hue) = int(sys.argv[1]),int(sys.argv[2]), int(sys.argv[3]),sys.argv[4]
print("the RGB pixel info is:",r,g,b)
sql = "select rgb_lower_upper from colorish where colorlabel = %s"
data = (hue,)
pqish.execute(sql,data)
row = pqish.fetchone()
rgb = row[0]
print("%s is defined in the db as %s" % (hue,rgb))
if r in range(rgb[0],rgb[1]+1):
    print('red passes')
    red = True
else:
    print('red fails')
    red = False
if g in range(rgb[2],rgb[3]+1):
    print('green passes')
    green = True
else:
    print('green fails')
    green = False
if b in range(rgb[4],rgb[5]+1):
    print('blue passes')
    blue = True
else:
    print('blue fails')
    blue = False
if red and green and blue:
    print("The color is %s" % hue)
else:
    print("The color is not %s" % hue)

In this code, you fetch the parameters for reddish from the database, but read the test parameters on the command line, and where necessary convert them to integers to make them comparable to the upper and lower bounds of reddish.

Here is the output when the script runs:

$ python pgsqlarrays.py 235 2 3 reddish
the RGB pixel info is: 235 2 3
reddish is defined in the db as [245, 255, 0, 10, 0, 10]
red fails
green passes
blue passes
The color is not reddish

Note that you don't have to follow this structure. The lower and upper bound idea defines a three-dimensional "box" in RGB space. The edges of the box are sharp: A red value of 245 is reddish, but, according to this definition, 244 is not. An interesting alternate structure would be to store floats, such as {5.0,250.0,10.0,5.0,10.0,5.0} where the first number in the pairs is the standard deviation and the second is the mean, and apply a probability function to the result. Then, instead of a sharp true or false answer, you could generate "This color is 90% reddish," with the boundaries a lot more vague. Additionally, you can compare two hues and decide which of them is more reddish.

As you can see, adding [] to a field type has enormous implications that can either lead to errors or be used to great advantage under the appropriate circumstances.




This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.


This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics