XPath

When you couldn’t find HTML elements you want on the web page, you will need to use XPath expressions to find web page elements in the source code. XPath, the XML Path Language, is a query language for selecting nodes from an XML document. It’s often referred to simply as “an XPath” and used to navigate through elements and attributes in an XML document.

XML and HTML

Before we dive into XPath, we will briefly introduce what XML and HTML are, and the difference between these two languages.

According to Wikipedia, XML (Extensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable. It’s designed to store and transport data, as well as to used for the representation of arbitrary data structures.

HTML (HyperText Markup Language) is the standard markup language used to create web pages. HTML, along with CSS and JavaScript, is used to create web pages and user interfaces for mobile and web applications. Web browsers can read HTML files and render them into visible or audible web pages. HTML describes the structure of a website semantically and is used for the presentation or appearance of the document (web page).

HTML can be recognized as an non-standard XML format. XML is focus more on carrying data while HTML is focus more on displaying data.

XPath is used to navigate through elements and attributes in an XML document. All the web pages are HTML documents in nature. Octoparse provides an XPath engine for HTML documents so that we can use XPath to locate data on web page precisely.

Here are examples of XPath that Octoparse generated automatically on the Customize Current Action pane:

//UL[@class=’nav navbar-nav center-nav’]

//*[@id=’gdp’]

So what do these path expressions mean?

XPath uses path expressions to select nodes. The node is selected by following a path or steps. (More detailed information please visit https://en.wikipedia.org/wiki/XPath.)

Below, we’ve listed the most useful path expressions posted on w3school.com:

Expression	Description
nodename	Selects all nodes with the name “nodename”
/	Selects from the root node
//	Selects nodes in the document from the current node that mach the selection no matter where they are
.	Selects the current node
..	Selects the parent of the current node
@	Selects attributes
*	Matches any element node
@*	Matches any attribute node
node()	Matches any node of any kind

There are some predicates in XPath expressions that are used to find a specific node or a node that contains a specific value and always embedded in square brackets. Below we would share with you the table posted on w3school.com about some path expressions with predicates and the corresponding results:

X path Expression	Results
/bookstore/book[last()]	Selects the last book element that is the child of the bookstore element
/bookstore/book[position()<3]	Selects the first two book elements that are children of the bookstore element
//title[@lang=’en’]	Selects all the title elements that have a “lang” attribute with a value of “en”
/bookstore/book[price>35.00]/title	Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00

Now, we know that //UL[@class=’nav navbar-nav center-nav’] means to select all the UL elements that have a “class” attribute with a value of “nav navbar-nav center-nav”, and //*[@id=’gdp’] means to select all elements in the document that have a “id” attribute with a value of “gdp”.

It happens that we sometimes need to manually edit the XPath with XPath tools on Octoparse to fetch data on web page.