LESSON 1 - 2 -- SPECIAL CHARACTERS AND TAGS

This lesson doesn't add much to your HTML knowledge directly but you need more detail on what we are talking about before we move on.

The number one thing to do is get some terms straight. You'll notice in Lesson 1 that I made a distinction between naming a tag in caps without the brackets and tags in brackets. The deal is that "tag" is technically a specific term. Anything between angle brackets is mark-up, as I already told you. A single piece of mark-up is a tag. A tag pair, comprising an opening tag and a closing tag, is just a special case of the term "element". When I refer to "<head>" I'm speaking of a particular tag but when I speak of "HEAD" I am referring to the element that happens to include the two tags and whatever they contain.

What I called an option before -- the 'href="address"' part of an "A" element -- is technically called an attribute. An attribute that has only letters, numbers, the period, the colon, and the underscore in it's value -- the part after the "=" -- can be left unquoted. That is a bad idea because you don't want to get into that habit, it'll catch up to you. Closing tags, those that have a slash at the beginning like "</head>", do not ever have attributes. The attributes are found in single tags and opening tags. That way, when the attribute modifies the behavior of an element, the marked-up text hasn't already been displayed in the browser by the time it finds out what to do with it.

Since we are being picky at this stage, you'll notice that it might be kind of hard to put a doublequote into a value. Haha, good eyes! We have made another character special so we need another ampersand entity.

The new entity for the doublequote is "&quot;" This one and the other three are the only ones required by HTML's special character needs. There are a load of others out there and some are handy enough to mention here. These are characters that are available in modern fonts but aren't easily called up by the keyboard.

First there is "&nbsp;" which is a non-breaking space character. This is a space that doesn't count as whitespace, it acts like a letter but displays as blank. It is occasionally useful when you don't want a line break to happen between two words or when you need invisible content in an element that requires non-whitespace.

Another couple that you will see a lot of are "&copy;" and "&reg;" which are the copyright "©" and registered trademark "®" characters. Also, "&trade;" works sometimes as a basic trademark sign "™". You'll just see the spelled out "&trade;" if your browser doesn't support it yet.

A few that you will see a lot in foreign text are "&iexcl;" and "&iquest;" which are the inverted exclamation "¡" and the inverted question "¿". You'll also see "&szlig;" or "&THORN;" and other non-english letters expressed this way sometimes, like this SZ ligature "ß" or the Latin capital Thorn "Þ". Note that entity names -- unlike HTML tags -- are case sensitive! Like "&euml;" and "&Euml;" are different; see the umlauted e in lowercase "ë" and uppercase "Ë" as an example. One last handy pair are the "&oelig;" and the "&OElig;", these Latin OE ligatures in lower "œ" and upper "Œ" are nice once in a while. Like the "&trade;" they don't work in Netscape yet though.

Another note on case sensitivity, the attributes of elements aren't sensitive but their values CAN be. The URL addresses in the element "A"'s "HREF=" attribute are case sensitive after the third slash "/". The "http://" part of a URL isn't sensitive, the domain name -- mine is "www.hostile.org" -- isn't sensitive but the document name after that -- this document is "/tut/lesson1-2.html" USUALLY is. There is the rare operating system out there that doesn't care about filename case, MS-DOS didn't care but Windows does. Just assume that it is and be careful to reproduce them exactly.

Now, for being good little boys and girls, a new HTML element so that you feel like you learned something here! The new tag is the "<!--" tag and its closing tag the "-->". Now you are probably mad at me and saying "Hey, those aren't tags! They don't have have BOTH angle brackets in them!"

Well, you are kinda right. In a way this one big special element called the "COMMENT" element. This special form allows it to encompass almost any text INCLUDING text with other HTML mark-up in it! Anything between the the two tags will be not be rendered by the browser. Nice for sticking in notes to yourself about what you still need to work on in the file. I'll start using it in the source of these documents so that you can get extra notes from viewsource!

The only thing to warn you about is the standard allows for white space between the last two hyphens "--" and the closing angle bracket. The standard recommends that you avoid stringing hyphens together in comments to prevent "problems". This is a sucky legacy of SGML and most people have no idea this trap is waiting to accidently expose comments in some browsers or cause some test scripts to fail with non-existent errors.

MORE DETAIL!

We're going to define a number of things with different character sets coming up soon, so right now would be a good time to make that easier. Keep in mind this ISN'T HTML but it is in common usage in programming. In these documents (and pretty much all over the web) we are going to use [] notation to mark character sets. Later on I'll put them in a tag that will make them stand out better too. Basically the list of characters above would have been written [a-zA-Z0-9:._]. Notice that the "-" is special in character sets and so is the "]". As a rule a hyphen between two characters means the whole range but a hyphen at the end or beginning is just a hyphen. A "]" as the first character is just a "]" since an empty set is meaningless. Also, we are going to need one more special character. If the set has a "^" as the first character it means "NOT". In other words a set of [^!#@<>$%^&*-] means it can be any character BUT those. If you are wondering how to specify JUST a "^" then you should roll your eyes at yourself since you have noticed this sentence answers the question; single characters don't need fancy set notation, I can just put them in quotes. =)

Now I need to define another bit of weirdness. There are characters that I need to use in many sets that aren't normally visible. The whitespace set is [ \i\n\r\f&#x200B;]. The "\i" means a tab -- a tab is actually CTRL-I --, the \f is the form-feed character, the "&#x200B" is a weird zero-width whitespace thingy, "\n" is a new line, and the "\r" is a return. Surprise, you are about to learn a dirty little computer secret! On MS-DOS and Windows, text documents have a two character sequence at the end of each line "\n\r" while UNIX boxes only use the "\n" and Macintoshes only use the "\r"! Every geek in the world knows it was a mistake but there is nothing anyone can do now. Part of HTML is designed to fix this. They made sure that HTML eats all the returns as whitespace so that if you put a Windows text file on a UNIX box or a Macintosh file on a Windows box and read it a browser, it wont matter.

REVIEW SUMMARY

* Tag means an individual piece of mark-up.

* Element means a complete mark-up set. It may be one tag, two tags, or two tags and the data they contain.

* I put element names in doublequotes and all caps.

* I put tags in lowercase, they aren't case sensitive so do as you like.

* Attributes are the parts of a tag that modify and element's behavior

* Attributes are only found in the first tag if there are two in an element.

* An attribute has an attribute name and a value.

* Values must be in quotes, well not MUST but at least SHOULD!

* Entities are the characters we create with the "&wordhere;" trick.

* There are scads of characters, some can only be expressed as a number.

* Numbered characters can be made with &#123; or &#xA6; ("{", "¦";")

* Plenty of new named characters like "&quot;" -- '"' are out there.

* We REALLY needed "&quot;" since we made doublequote special in values.

* "&nbsp;" non-breaking space " "

* "&copy;" copyright "©"

* "&reg;" registered trademark "®"

* "&trade;" trademark "™"

* "&iexcl;" inverted exclamation "¡"

* "&iquest;" inverted question "¿"

* "&szlig;" SZ ligature "ß"

* "&THORN;" Latin capital Thorn "Þ"

* "&oelig;" lowercase oe ligature "œ"

* "&OElig;" uppercase OE ligature "Œ"

* "&euml;" umlauted lowercase e "ë"

* "&Euml;" umlauted uppercase E "Ë"

* "&nbsp;" non-breaking space " "

* URLs are webpage addresses.

* URLs are case sensitive after the domain name because most operating systems are.

* "COMMENT" elements have weird opening "<!--" and closing "-->" tags.

* "COMMENT" elements can even hide other HTML tags from browsers.

* More that one hyphen "-" in a row is dangerous in comments.

* SGML still sucks.

* Sometime when talking about what characters are allowed in certain cases we'll use a character set like [a-zA-Z]

* The whitespace set is [ \i\n\r\f]

* Text files on UNIX, Windows, and Macintoshes are all screwed up.

* HTML's whitespace rules save you from worrying about weird text formats.

* [^0-9a-zA-Z] means any character BUT the numbers and letters.

* Considering we only learned one new HTML element, and it's normally invisible, we sure did cover a lot.

* The next Lesson has lots of tags!

Now, head back to the Main tutorial page back to Lesson 1 or move on to Lesson 3.