eXtropia: the open web technology company
Technology | Support | Tutorials | Development | About Us | Users | Contact Us
 ::   Tutorials
 ::   Presentations
Perl & CGI tutorials
 ::   Intro to Perl/CGI and HTML Forms
 ::   Intro to Windows Perl
 ::   Intro to Perl 5
 ::   Intro to Perl
 ::   Intro to Perl Taint mode
 ::   Sherlock Holmes and the Case of the Broken CGI Script
 ::   Writing COM Components in Perl

Java tutorials
 ::   Intro to Java
 ::   Cross Browser Java

Misc technical tutorials
 ::   Intro to The Web Application Development Environment
 ::   Introduction to XML
 ::   Intro to Web Design
 ::   Intro to Web Security
 ::   Databases for Web Developers
 ::   UNIX for Web Developers
 ::   Intro to Adobe Photoshop
 ::   Web Programming 101
 ::   Introduction to Microsoft DNA

Misc non-technical tutorials
 ::   Misc Technopreneurship Docs
 ::   What is a Webmaster?
 ::   What is the open source business model?
 ::   Technical writing
 ::   Small and mid-sized businesses on the Web

Offsite tutorials
 ::   ISAPI Perl Primer
 ::   Serving up web server basics
 ::   Introduction to Java (Parts 1 and 2) in Slovak


Intro to the Web Application Development Environment
Introduction to Data  

Previous | Table of Contents

[The Data Layer]

I suppose the most basic part of a web application is the data itself. As I said earlier, all web applications allow a user to submit instructions on how the web application should massage a bit of data on the web server. This might involve searching a database, creating a shopping cart of products, or emailing some information to someone.

Regardless, the data that is being manipulated all have some basic characteristics.

  1. Data have values
  2. Data have types
  3. Data have descriptions
  4. Data have formats

    Data Values
    To say that data have values may seem a bit obvious at first. After all, what is data if not a representation of something? That is, if there is no value, there is no data right? Without a value there is just a blank page or an empty database.

    Well okay, perhaps the statement is a bit obvious when taken at face value.

    What is crucial to understand however, is the deeper significance of the statement. That is, data, to be useful, must have value to the consumer. It must be "information".

    It cannot be said enough that on the web and within web applications, content is king. It is crucial that whatever technology you use for storing, describing, searching, or modifying your data, that that technology helps you to create information.

    Data Types
    Data types represent ways of categorizing different bits of data. At a most basic level, you can imagine two types of data: letters and numbers. Like any methodology for categorizing things, 'typing' your data helps make sense of it. In some cases, it also helps you store it more efficiently.

    Actual data types include more useful categories (from a programmers perspective) such as dates, ints, floats. strings. For example, the moment you tell a program that a set of data consists of numbers and not just strings, the program knows it can perform numeric operations. For example, if the program is an accounting package, when it knows that the accounts store numeric data representing money, then debits and credits may be performed on the account data using standard number-based operations like addition or subtraction.

    Data Descriptions
    As anyone knows, one of the biggest bummers about the information age is the fact that it tends to create information glut rather than information itself. That is, quickly data overwhelms the users ability to make sense of it and the consumer is buried.

    One way that data harvesters help to solve this problem is to include meta descriptions which help consumers quickly place data into categories that can be filed away so that they may be found more quickly on the basis of the clues that the meta descriptions provide.

    One example of a meta description are cross reference keywords. If you have a title of a book in a database, you might consider also applying keywords to the title that would help it match a search for the data even if the exact word is not used.

    Consider a possible query a person might want to make. Let's say that person wants to find all the books on cooking in a web store. The truth of the matter is that many books on cooking do not have the word "Cooking" actually in the title. So a title like "Scandanavian Cuisine" would have keywords such as "Cooking" applied to the data. Thus, providing keywords that match the title is an appropriate piece of information to store because it makes searches on the database more useful.

    Data Formats
    Finally, data may be encapsulated within a format that helps the consumer understand how to display it. Formats continue to define the data beyond the actual data type we described earlier. A data type tells a program what type of data something is so that basic functional operations may be performed such as adding a day to a date or adding a list of numbers together.

    A format determines how that representative data type should display itself. For example, if a number is represented as Money in US Dollars, the format might be coded in such a way as to include a $ symbol in front with two decimal places for the cents in the dollars and cents that make up a US dollar figure. Likewise, in Europe, dates tend to be formatted as DD/MM/YYYY format whereas the same data might be displayed in MM/DD/YYYY format in the United States.

    Summary of Data Characteristics
    All of the web technologies at the Data Layer define, describe, or standardize one or more of these characteristics.

    In the next few weeks we will review some of the most important data layer technologies to give you a sense of how they fit into the bigger picture. We will begin with raw data, move into database, add a section on HTML and conclude with an overview of XML.

    We begin with a discussion of raw data.

Raw Data  
Raw data is perhaps the most basic form of data that you will come across when designing web applications, but is often the best choice.

Typically, you will see raw data as delimited rows such as


Raw data is easy to parse, easy to access and easy to write to. Usually, data will be stored in some data file on the same file system as your web application. Accessing data is as simple as opening, reading, writing and closing local files. Unfortunately, as you will see in the next section, raw data is hard to maintain and manage.

Once upon a time, in the primitive and barbarian days before computers, the amount of information shepherded by a group of people could be collected in the wisdom and the stories of its older members. In this world, storytellers, magicians, and grandparents were considered great and honored storehouses for all that was known.

Apparently, and according to vast archeological data, campfires were used (like command-line middleware) by the younger members of the community to access the information stored in the minds of the elders using API's such as

public String TellUsAboutTheTimeWhen(String s);.

And then of course, like a sweeping and rapidly-encompassing viral infection, came agriculture, over-production of foodstuffs, and the origins of modern-day commerce.

Dealing with vast storehouses of wheat, rice, and maize became quite a chore for the monarchs and emperors that developed along with the new economy. There was simply too much data to be managed in the minds of the elders (who by now were feeling the effects of hardware obsolescence as they were being pushed quietly into the background).

And so, in order to store all the new information, humanity invented the technology of writing. And though great scholars like Aristotle warned that the invention of the alphabet would lead to the subtle but total demise of the creativity and sensibility of humanity, data began to be stored in voluminous data repositories, called books.

As we know, eventually books propogated with great speed and soon, whole communities of books migrated to the first real "databases", libraries.

Unlike previous versions of data warehouses (people and books), that might be considered the australopithecines of the database lineage, libraries crossed over into the modern-day species, though they were incredibly primitive of course.

Specifically, libraries introduced "standards" by which data could be stored and retrieved.

After all, without standards for accessing data, libraries would be like my closet, endless and engulfing swarms of chaos. Books, and the data within books, had to be quickly accessible by anyone if they were to be useful.

In fact, the usefulness of a library, or any base of data, is proportional to its data storage and retrieval efficiency. This one corollary would drive the evolution of databases over the next 2000 years to its current state.

Thus, early librarians defined standardized filing and retrieval protocols. Perhaps, if you have ever made it off the web, you will have seen an old library with its cute little indexing system (card catalog) and pointers (Dewey decimal system).

And for the next couple thousand years libraries grew, and grew, and grew along with associated storage/retrieval technologies such as the filing cabinet, colored tabs, and three ring binders.

All this until one day about half a century ago, some really bright folks working for the British government were asked to invent an advanced tool for breaking German cryptographic codes and aiming missiles.

That day the world changed again. That day the computer was born.

The computer was an intensely revolutionary technology of course, but as with any technology, people took it and applied it to old problems instead of using it to its revolutionary potential.

Almost instantly, the computer was applied to the age-old problem of information storage and retrieval. After all, by World War Two, information was already accumulating at rates beyond the space available in publicly supported libraries. And besides, it seemed somehow cheap and tawdry to store the entire archives of "The Three Stooges" in the Library of Congress. Information was seeping out of every crack and pore of modern day society.

Thus, the first attempts at information storage and retrieval followed traditional lines and metaphors. The first systems were based on discrete files in a virtual library. In this file-oriented system, a bunch of files would be stored on a computer and could be accessed by a computer operator. Files of archived data were called "tables" because they looked like tables used in traditional file keeping. Rows in the table were called "records" and columns were called "fields".

Consider the following example:

First Name Last Name Email Phone
Eric Tachibana erict@eff.org 213-456-0987
Selena Sol selena@eff.org 987-765-4321
Li Hsien Lim hsien@somedomain.com 65-777-9876
Jordan Ramacciato nadroj@otherdomain.com 222-3456-123

The "flat file" system was a start. However, it was seriously inefficient.

Essentially, in order to find a record, someone would have to read through the entire file and hope it was not the last record. With a hundred thousands records, you can imagine the dilemma.

What was needed, computer scientists thought (using existing metaphors again) was a card catalog, a means to achieve random access processing, that is the ability to efficiently access a single record without searching the entire file to find it.

The result was the indexed file-oriented system in which a single index file stored "key" words and pointers to records that were stored elsewhere. This made retrieval much more efficient. It worked just like a card catalog in a library. To find data, one needed only search for keys rather than reading entire records.

However, even with the benefits of indexing, the file-oriented system still suffered from problems including:

  • Data Redundancy - the same data might be stored in different places
  • Poor Data Control - redundant data might be slightly different such as in the case when Ms. Jones changes her name to Mrs. Johnson and the change is only reflected in some of the files containing her data
  • Inability to Easily Manipulate Data - it was a tedious and error prone activity to modify files by hand
  • Cryptic Work Flows - accessing the data could take excessive programming effort and was too difficult for real-users (as opposed to programmers).

Consider how troublesome the following data file would be to maintain.

Name Address Course Grade
Mr. Eric Tachibana 123 Kensigton Chemistry 102 C+
Mr. Eric Tachibana 123 Kensigton Chinese 3 A
Mr. Eric Tachibana 122 Kensigton Data Structures B
Mr. Eric Tachibana 123 Kensigton English 101 A
Ms. Tonya Lippert 88 West 1st St. Psychology 101 A
Mrs. Tonya Ducovney 100 Capitol Ln. Psychology 102 A
Ms. Tonya Lippert 88 West 1st St. Human Cultures A
Ms. Tonya Lippert 88 West 1st St. European Governments A

What was needed was a truly unique way to deal with the age-old problem, a way that reflected the medium of the computer rather than the tools and metaphors it was replacing.

Enter the database.

Simply put, a database is a computerized record keeping system. More completely, it is a system involving data, the hardware that physically stores that data, the software that utilizes the hardware's file system in order to 1) store the data and 2) provide a standardized method for retrieving or changing the data, and finally, the users who turn the data into information.

Databases, another creature of the 60s, were created to solve the problems with file-oriented systems in that they were compact, fast, easy to use, current, accurate, allowed the easy sharing of data between multiple users, and were secure.

A database might be as complex and demanding as an account tracking system used by a bank to manage the constantly changing accounts of thousands of bank customers, or it could be as simple as a collection of electronic business cards on your laptop.

The important thing is that a database allows you to store data and get it or modify it when you need to easily and efficiently regardless of the amount of data being manipulated. What the data is and how demanding you will be when retrieving and modifying that data is simply a matter of scale.

Traditionally, databases ran on large, powerful mainframes for business applications. You will probably have heard of such packages as Oracle 8 or Sybase SQL Server for example.

However with the advent of small, powerful personal computers, databases have become more readily usable by the average computer user. Microsoft's Access and Inprise's (formerly Borland's) Paradox are two popular PC-based engines around.

More importantly for our focus, databases have quickly become integral to the design, development, and services offered by web sites.

Consider a site like Amazon.com that must be able to allow users to quickly jump through a vast virtual warehouse of books and compact disks.

[Screen Shot of Amazon.com Page]

How could Amazon.com create web pages for every single item in their inventory and how could they keep all those pages up to date. Well the answer is that their web pages are created on-the-fly by a program that "queries" a database of inventory items and produces an HTML page based on the results of that query.

For more information, check out my Introduction to Databases for Web Developers

HTML, as its name implies, is a markup language. As such, it is used to markup text. But what exactly does it mean to markup text?

Abstractly, marking up text is a methodology for encoding data with information about itself. Examples of markups (encoded data) are ubiquitous in the real world.

For example, back when you were slogging through high school, you probably used to use a bright yellow highlighter pen to highlight sentences in your schoolbooks (or at last you knew someone who did!). You did so because you thought that the highlighted sentences would be useful to review around exam time and you wanted a quick way to skim through the important points. Just like you, thousands of kids around the world did the exact same thing for the exact same reason.

By highlighting certain bits of text, you were effectively "marking-up" the data. Essentially, you specified that certain sentences (data) were important by marking them in yellow. These sentences became encoded with the fact that they were important.

And what's more, since everyone followed the same standard of marking up, you could easily pick up a used text book and get a good idea just from reading the highlighted sections what were core points of the book.

There are two crucial points to take away from this example. For markups to transmit useful information about data to a pool of users...

  1. a standard must be in place to define what a valid markup is - In the example above, markup is defined as a bit of yellow ink atop text. In HTML a markup is a tag.
  2. a standard must be in place to define what markup means - In the example above, a yellow highlight means the highlighted text represents an important point. In HTML each tag communicates its own layout of formatting meaning.

Markups are also ubiquitous in the world of computers. They are used by word processors to specify formatting and layout, by communications programs to express the meaning of data sent over the wires, by database applications that must associate meaning and relationships with the data they serve, and by multimedia processing programs which must express meta-data about images or sound.

As data is sent through dumb computers and programs, it is essential that the data carries with it information necessary to communicate what the data means and/or what the receiver should do with that data.

Data with no context is meaningless just as an unhighlighted book is bad news around exam time!

HTML is one of the more famous computer markup systems. HTML defines a set of tags that associate formatting rules with bits of text. Documents which have been marked up (which contain plain text as well as the tags that specify the rules for formatting that text) are read by an HTML processing application (a web browser for example) that knows how to display the text according to the rules.

For example, the <B> tag specifies a rule which instructs an HTML processing application to bold a specific bit of text. Similarly, the <CENTER> tag instructs the HTML processing application to center the text.

Thus <CENTER><B>BOLD</B></CENTER> would be displayed by an HTML processing application as


You might imagine a client contact list which could look like the following bit of HTML code:

<LI>Gunther Birznieks
<LI>Client ID: 001
<LI>Company: Bob's Fish Store
<LI>Email: gunther@bobsfishstore.com
<LI>Phone: 662-9999
<LI>Street Address: 1234 4th St.
<LI>City: New York
<LI>State: New York
<LI>Zip: 10024
<LI>Susan Czigonu
<LI>Client ID: 002
<LI>Company: Netscape
<LI>Email: susan@eudora.org
<LI>Phone: 555-1234
<LI>Street Address: 9876 Hazen Blvd.
<LI>City: San Jose
<LI>State: California
<LI>Zip: 90034

The above HTML-encoded data would be displayed by an HTML processing application as:

  • Gunther Birznieks
    • Client ID: 001
    • Company: Bob's Fish Store
    • Email: gunther@bobsfishstore.com
    • Phone: 662-9999
    • Street Address: 1234 4th St.
    • City: New York
    • State: New York
    • Zip: 10024
  • Susan Czigonu
    • Client ID: 002
    • Company: Netscape
    • Email: susan@eudora.org
    • Phone: 555-1234
    • Street Address: 9876 Hazen Blvd.
    • City: San Jose
    • State: California
    • Zip: 90034

Is HTML a Programming Language?
Actually, though HTML is often called a programming language it is really not. Programming languages are 'Turing-complete', or 'computable'. That is, programming languages can be used to compute something such as the square root of pi or some other such task. Typically programming languages use conditional branches and loops and operate on data contained in abstract data structures. HTML is much easier than all of that. HTML is simply a 'markup language' used to define a logical structure rather than compute anything. It is sort've a semantic issue, but it is one which you should officially be aware of.

The language itself is fairly simple and follows a few important standards.

Firstly, document description is defined by "HTML tags" that are instructions embedded within a less-than (<) and a greater-than (>) sign. To begin formatting, you specify a format type within the < and the >. Most tags in HTML are ended with a similar tag with a slash in it to specify an end to the formatting. For example, to emphasize some text, you would use the following HTML code:

this text is not bold
<EM>this text is bold</EM>
 this text is not bold

It is important to note that the formatting codes within an HTML tag are case-insensitive. Thus, the following two versions of the bold tag would both be understood by a web browser:

<em>this text is bold</em>
this text is not 
<EM>this text is bold</EM>

You can also compound formatting styles together in HTML. However, you should be very careful to "nest" your code correctly. For example, the following HTML code shows correct and incorrect nesting:

<CENTER><EM>this text is bolded and centered 

<EM><CENTER>this text is bolded and centered 

In the incorrect version, notice that the bold tag was closed before the center tag, even though the bold tag was opened first. The general rule is that tags on the inside should be closed before tags on the outside.

Finally, HTML tags can not only define a formatting option, they can also define attributes to those options as well. To do so, you specify an attribute and an attribute value within the HTML tag. For example, the following tag creates a heading style aligned to the left:

<H2 ALIGN = "LEFT">this text has a heading
level two style and is 
aligned to the left </H2>

There are a few things to note about attributes however. First, it is not necessary to enclose attribute values within quotes unless white space is included in the value. Secondly, it is not necessary to have a space before or after the equal sign that matches an attribute to its value. Finally, when you close an HTML tag with an attribute, you should not include attribute information in the closing.

Finally, you should know that web browsers do not care about white space that you use in your HTML document. For example, the following two bits of HTML will be displayed the exact same way:

This is some text that is displayed
as you would expect

This     is  some     text
that is displayed in a way
would not expect:
exactly the same as the above

Like HTML, XML (also known as Extensible Markup Language) is a markup language which relies on the concept of rule-specifying tags and the use of a tag-processing application that knows how to deal with the tags.

"The correct title of this specification, and the correct full name of XML, is "Extensible Markup Language". "eXtensible Markup Language" is just a spelling error. However, the abbreviation "XML" is not only correct but, appearing as it does in the title of the specification, an official name of the Extensible Markup Language.

The name and abbreviation were invented by James Clark; other options under consideration had included MGML, (Minimal Generalized Markup Language), MAGMA (Minimal Architecture For Generalized Markup Applications), and SLIM (Structured Language for Internet Markup)" - Extensible Markup Language (XML) 1.0 Specs, The Annotated Version.

However, XML is far more powerful than HTML.

This is because of the "X". XML is "eXtensible". Specifically, rather than providing a set of pre-defined tags, as in the case of HTML, XML specifies the standards with which you can define your own markup languages with their own sets of tags. XML is a meta-markup language which allows you to define an infinite number of markup languages based upon the standards defined by XML.

"The design goals for XML are:
  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance."
- Extensible Markup Language (XML) 1.0 Specs, The Annotated Version.

Let's consider a very simple example. Let's create a new markup language called SCLML (Selena's Client List Markup Language). This language will define tags to represent contact people and information about contact people.

The set of tags will be simple. However, they will be expressive. Unlike <UL> and <LI> XML tags can be immediately understood just by reading the document.

<NAME>Gunther Birznieks</NAME>
<COMPANY>Bob's Fish Store</COMPANY>
<STREET>1234 4th St.</STREET>
<CITY>New York</CITY>
<ZIP>Zip: 10024</ZIP>

<NAME>Susan Czigonu</NAME>
<STREET>9876 Hazen Blvd.</STREET>
<CITY>San Jose</CITY>

Note that the use of XML is not limited to text markup. The very extensibility of XML means that it could just as easily be applied to sound markup or image markup. A tag such as <EMPHASIZE> might be displayed textualy as being bold but audibly as a louder voice!

What you see above is a very simple "XML document". As you can see, it looks pretty similar to an HTML document.

But don't forget, as we said before, it is not enough to simply encode (markup) the data. For the data to be decoded by someone or something else, the encoding markup languages must follow standard rules including:

  1. The syntax for marking up
  2. The meaning behind the markup

In other words, a processing application must know what a valid markup is (perhaps a tag) and what to do with it if it is valid? After all, how would Netscape know what to do with the above document? What in the world is a <PHONE> tag? Is it a legal tag? How should it be displayed? Our markup language must somehow communicate the syntax of the markup so that the processing application will know what to do with it.

In XML, the definition of a valid markup is handled by a Document Type Definition (DTD) which communicates the structure of the markup language. The DTD specifies what it means to be a valid tag (the syntax for marking up).

We'll discuss the details of DTDs later. For now, just get comfortable with the idea of a DTD as a separate component to the equation.

Yet we must also communicate the meaning of the markup as well as the syntax.

To specify what valid tags mean, XML documents are also associated with "style sheets" which provide GUI instructions for a processing application like a web browser. A style sheet, the details of which we will discuss later, might specify display instructions such as:

  1. Anytime you see a <CONTACT>, display it using a <UL> tag. Similarly, </CONTACT> tags should be converted to </UL>
  2. All <NAME> tags can be substituted for <LI> tags and </NAME> tags should be ignored.
  3. All <EMAIL> tags can be substituted for <LI> tags and </EMAIL> tags should be ignored.

In this example, the style sheet utilizes the functionality of HTML to define the formatting of SCLML. But if the XML document was being processed by a program other than a web browser, the HTML translation step might be bypassed.

Processing applications combine the logic of the style sheet, the DTD, and the data of the SCLML document and display it according to the rules and the data.

But wait, isn't this quite complex? Now instead of a single HTML document which defines the data and the rules to display the data, we have an SCLML document, a DTD, AND a style sheet. That's three pieces as opposed to just one.

Further, we need a processing agent that can do the work of putting the DTD, style sheet, and SCLML document together. Remember, web browsers are made to read a specific markup language (like HTML), not any markup language. That means we have three documents to pull together plus one processing program to write or buy. What a mess.

"A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, called the application." - Extensible Markup Language (XML) 1.0 Specs, The Annotated Version.

Actually however, though there are a few more hurdles to jump in order to use XML, there are several reasons why all this is worth it. Let's take a look at them. . . .

Advantages of XML: Breaking the Tag Monopoly  
The first benefit of XML is that because you are writing your own markup language, you are not restricted to a limited set of tags defined by proprietary vendors.

Rather than waiting for standards bodies to adopt tag set enhancements (a process which can take quite some time), or for browser companies to adopt each other's standards (yeah right!), with XML, you can create your own set of tags at your own pace.

Of course, not only are you free to develop at your own pace, but you are free to develop tools that meet your needs exactly.

By defining your own tags, you create the markup language in terms of your specific problem set! Rather than relying on a generic set of tags which suits everyone's needs adequately, XML allows every person/organization to build their own tag library which suits their needs perfectly.

"From the earliest days of the Web, we've been using essentially the same set of tags in our documents....There's a significant benefit to a fixed tag set with fixed semantics: portability. However, HTML is very confining. Web designers want more control over presentation. Enter XML" - Norman Walsh

That is, though the majority of web designers do not need tags to format musical notation, medical formula, or architectural specifications, musicians, doctors and architects might.

XML allows each specific industry to develop its own tag sets to meet its unique needs without forcing everyone's browser to incorporate the functionality of zillions of tag sets, and without forcing the developers to settle for a generic tag set that is too generic to be useful.

Check out these customized XML-based languages:
Advantages of XML: Moving Beyond Format  
However cool the idea of escaping the limitations of a basic tag set (like HTML) sounds, it isn't even close to the best thing about XML?

The real power of XML comes from the fact that with XML, not only can you define your own set of tags, but the rules specified by those tags need not be limited to formatting rules. XML allows you to define all sorts of tags with all sorts of rules, such as tags representing business rules or tags representing data description or data relationships.

Consider again the case of the contact list in SCLML. Using standard HTML, a developer might use something like the following:

<LI>Gunther Birznieks
<LI>Client ID: 001
<LI>Company: Bob's Fish Store
<LI>Email: gunther@bobsfishstore.com
<LI>Phone: 662-9999
<LI>Street Address: 1234 4th St.
<LI>City: New York
<LI>State: New York
<LI>Zip: 10024
<LI>Susan Czigonu
<LI>Client ID: 002
<LI>Company: Netscape
<LI>Email: susan@eudora.org
<LI>Phone: 555-1234
<LI>Street Address: 9876 Hazen Blvd.
<LI>City: San Jose
<LI>State: California
<LI>Zip: 90034

While this may be an acceptable way to store and display your data, it is hardly the most efficient or powerful. As you are probably aware, there are many potential problems associated with marking up your data using HTML. Three particularly serious problems come to mind:

  1. The GUI is embedded in the data. What happens if you decide that you like a table-based presentation better than a list-based presentation? In order to change to a table-based presentation, you must recode all your HTML! This could mean editing many of pages.
  2. Searching for information in the data is tough. How would you get a quick list of only the clients in California? Certainly, some type of script would be necessary. But how would that script work? It would probably have to search through the file word for word looking for the string "California". And even if it found matches, it would have no way of knowing that California might have a relationship to "New York" - that they are both states. Forget about the relationships between pieces of data which are crucial to power searching.
  3. The data is tied to the logic and language of HTML. What happens if you want to present your data in a Java applet? Well, unfortunately, your Java applet would have to parse through the HTML document stripping out tags and reformat the data. Non-HTML processing applications should not be burdened with extraneous work.

With XML, these problems and similar problems are solved. In XML, the same page would look like the following:

<NAME>Gunther Birznieks</NAME>
<COMPANY>Bob's Fish Store</COMPANY>
<STREET>1234 4th St.</STREET>
<CITY>New York</CITY>
<ZIP>Zip: 10024</ZIP>

<NAME>Susan Czigonu</NAME>
<STREET>9876 Hazen Blvd.</STREET>
<CITY>San Jose</CITY>

As you can see, custom tags are used to bring meaning to the data being displayed. When stored this way, data becomes extremely portable because it carries with it its description rather than its display. Display is "extracted" from the data and as we will see later, incorporated into a "style sheet".

Let's consider some of the benefits.

  1. With XML, the GUI is extracted. Thus, changes to display do not require futzing with the data. Instead, a separate style sheet will specify a table display or a list display.
  2. Searching the data is easy and efficient. Search engines can simply parse the description-bearing tags rather than muddling in the data. Tags provide the search engines with the intelligence they lack.
  3. Complex relationships like trees and inheritance can be communicated.
  4. The code is much more legible to a person coming into the environment with no prior knowledge. In the above example, it is obvious that <ID>002</ID> represents an ID whereas <LI>002 might not. XML is self-describing.
Disadvantages of XML  
However, awesome XML is, there are some drawbacks which have hindered it from gaining widespread use since its inception. Let's look at the biggest drawback: The lack of adequate processing applications.

For one, XML requires a processing application. That is, the nice thing about HTML was that you knew that if you wrote an HTML document, anyone, anywhere in the world, could read your document using Netscape. Well, with XML documents, that is not yet the case. There are no XML browsers on the market yet (although the latest version of IE does a pretty good job of incorporating XSL and XML documents provided HTML is the output).

Thus, XML documents must either be converted into HTML before distribution or converting it to HTML on-the-fly by middleware. Barring translation, developers must code their own processing applications.

The most common tactic used now is to write parsing routines in DHTML or Java, or Server-Side perl to parse through an XML document, apply the formatting rules specified by the style sheet, and "convert" it all to HTML.

"While it's true that browser support is limited, IE 5 and Netscape 5 are expected to fully support XML. Also, W3C's Amaya browser supports it today, as does the JUMBO browser that was created for the Chemical Markup Language.

XML isn't about display -- it's about structure. This has implications that make the browser question secondary. So the whole issue of what is to be displayed and by what means is intentionally left to other applications. You can target the same XML (with different XSL) for different devices (standard web browser, palm pilot, printer, etc.). You should not get the impression that XML is useless until browsers support it. This is definitely not true -- we are using it at NASA in ways where no browser plays any role." - Ken Sall

However, this takes some magic and the amount of work necessary even to print "hello world" are sometimes enough to dissuade developers from adopting the technology.

Nevertheless, parsing algorithms and tools continue to improve over time as more and more people see the long-term benefits of migrating their data to XML. The backend part of XML will continue to become simpler and simpler. Already Internet Explorer and Netscape provide a decent amount of built in XML parsing tools.

Style Sheets (XSL, CSS)  
Essentially, style sheets are written instructions explaining how a certain document should be displayed. Style sheets are as old as the printing press and probably older.

Frank Boumphrey, in "Style Sheets for HTML and XML", puts style sheets into their historical perspective as such....

"In the days of manual type-setting, style sheets were nothing more than a set of written instructions from the publisher to the printer telling the printer what kind of style to use when printing up the publisher's manuscript. Traditionally, the editor would deliver a "marked-up" manuscript full of terse notations like dele and stet. The printer would then consult the style sheet of that oarticular publishing house for a range of specifications. These specifications would encompass such details as what size the pages were to be, what size and family of font to use for chapter titles, sub-headings, body text, and so on. Plus, how much leading to put between the lines, what margins to leave, and whether, and by how much, to indent the paragraphs."

As we have said above, HTML is a markup language that helps define how the web browser should display a given HTML marked up document.

However we have also said that it is dangerous to embed too much style into your HTMl code. What happens when the company decides to change its corporate font? What about the company colors. If you have hardcoded style throughout your website, it is very difficult to change.

This is where style sheets come into play. Style sheets allow you to specify generic styles which apply broadly but are located uniquely. In other words, a single style sheet can be referenced by every web page on a site (or every component on a page if the style sheet is defined in the page header).

If you want to change an aspect of style such as color, font, spacing, or whatever, you simply change the style sheet and that change is propogated to every page that references the style sheet.

There are several languages/specifications to help you create style sheets for your site but the two most popular are CSS (Cascading Style Sheets) and XSL (eXtensible Style Sheet Language) where CSS certianly has market share.

CSS works using the <STYLE> tag and allows you to specify styles using scripting languages like VBScript. Consider the following small example:

          TD {FONT-FAMILY: "TimesRoman", "SANS-SERIF";}
          This is in Arial
          <TD>This is in TimesRoman</TD>

As you might imagine, BODY text will by displayed as ARIAL and Table cell data will be displayed as Times Roman font.

Previous | Table of Contents