An HTML document is a file containing HyperText Markup Language, and its filename most often ends in the .html extension. An HTML document is a text document read in by a Web browser and then rendered on the screen. To better extract any data from webpages in Octoparse, we need to know a little bit about HTML document. When you understand the main structure and the inner content of HTML document, it will be easier for you to grab more data from webpages.
Main structure
So, what does a HTML document look like? You can check out the HTML source code of any webpage in any web browsers. Just open your browser and navigate to any webpage.
Right-click your mouse and choose “View Page Source” or simple press CTRL + U on your keyboard and then you can see the website source code.
The HTML document above is the simplest HTML basic structure. You can add more styles and information to enrich the HTML document.
A complete HTML document contains at least <html> tag, <head> tag, <title> tag and <body> tag. They are all paired tags. So what do these tags mean?
1. <!DOCTYPE html> This is a declaration that let the browser what version of HTML document is using and must placed before <html> tag.
2. The <html>element is the root of an HTML documents. Each HTML document, whether it is a dynamic or static website, all starts with <html> and ends with</html>. Elements which contain all the contents of a web page are placed between the <html> tag and </html> tag.
3. The <head> tag, followed after the <html> tag, describes the document and defines all the heading of the HTML document. Its content is not displayed on the web page. The <title> tag must place within the <head> tag and defines the titles of the HTML document. The content of the <title> tag is showed in the top-left corner of the browser.
4. The <body> tag, followed after the <html> tag, represents the main content such as text, images, hyperlinks, table, lists of an HTML document. The content between the opening <body> tag and the closing </body>tag can be presented to users through a internet browser.
5. The HTML document is ended with </html> tag.
Concepts
Tag
Tag begins with (<) and ends with (>) , such as <html>.
Elements included all the contents of a webpage are placed between the <> tag and </> tag. The first tag<> is called opening tag and the second one</> is called closing tag. For the closing tag, a slash is placed before the tag element. For example,
<p> This book is for you. </p>
Here <p> is the opening tag and </p> is the closing tag.
Element
The content of HTML document is consist of all kinds of elements, all the codes from the opening tag to the closing tag. Elements are marked by using opening tags and closing tags.
Attribute
HTML attributes can be used to modified the HTML elements or provide additional information to HTML elements and are placed inside the opening tag of HTML elements. There are two parts in the HTML attributes: name and value.
The format is name=”value”. For example:
<html lang=”en-US”>
Here inside the <html> tag, lang is the name of the attribute used to declared the language and en-US is the value of the attribute used to specify the language.
Basic HTML tags
<a> </a>: The anchor tag. This tag is mainly used for defining a hyperlink. Without the href attribute, it would just be a placeholder.
<h1> </h1>: The header tag. This tag indicates a header, a title of a section of the document.
There are six header tags, from <H1> to <H6>, and serve to divide the page into sections. And the H1 tag is the most important heading and other header tags have a lesser degree of importance on the page.
<p> </p>: The paragraph tag. It’s used to define a paragraph. <br/> tag can be used to create a link break within a paragraph.
<div> </div>: The division tag. It’s used to define a division/section of a HTML document. You can use this tag to divide a HTML document into several structured sections/divisions.
<li> </li>: The list item tag. It could be used in unordered lists (<ul>),ordered lists(<ol>), description list<dl>, and etc.
<ul> </ul>: The unordered list tag. It’s used to create an unordered list by starting with one or more <li> tags. The list item on the HTML document in <ul> tag will be displayed with a bullet.
<input>As the name suggests, users can enter data in the input field. It can be a text field, a check box or a button. It doesn’t has an end tag.
<img> </img> The image tag. It’s used to insert an image in an HTML document and must include two required attributes. One is ‘src’ attribute which is used to specify the URL of the image and the other one is ‘alt’ (alternate) attribute which provides an alternate text in place of the image when the image is not displayed.
<option></option> The option tag. It represents option items within a list.
Simple HTML table consists of <table> tag and other related tags like <th>,<tr>, <td> and etc. The relationship among these tags: All <td> and <th> tags are within <tr> tag, and all <tr> tags are within <table> tag.
<table></table> The table tag. It’s used to define an HTML table.
<tr></tr> The table row tag. It defines a row of cells in an HTML table and contains <th> or <td> tags.
<td></td>The table data tag.
<th></th> The table heading tag.
<border> It’s used to specify the border of the table.
Common attributes
Attribute is used to modify HTML element or provide extra information to the HTML element.
class: used to define the name of an HTML element.
<div class=”header”>
id: define a unique identifier for an HTML element within the HTML document.
<style id=”firepath-matching-node-style” type=”text/css”> </style>
href: specify the destination address, that is the target URL of link. Often used in the <a> tag.
<a href=”http://www.octoparse.com”>This is our website!</a>
src: specify the URL of the media file such as image.
<script src=”/scripts/app.js”/>
For more information about HTML Language, you can go to W3schools.