Jsoup Selectors

From WikiOD

Remarks[edit | edit source]

A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).

The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).

Pattern Matches Example
* any element *
tag elements with the given tag name div
E elements of type E in the namespace ns name finds <fb:name> elements
#id elements with attribute ID of "id" div#wrap, #logo
.class elements with a class name of "class" div.left, .result
[attr] elements with an attribute named "attr" (with any value) a[href], [title]
[^attrPrefix] elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets [^data-], div[^data-]
[attr=val] elements with an attribute named "attr", and value equal to "val" img[width=500], a[rel=nofollow]
[attr="val"] elements with an attribute named "attr", and value equal to "val" span[hello="Cleveland"][goodbye="Columbus"], a[rel="nofollow"]
[attr^=valPrefix] elements with an attribute named "attr", and value starting with "valPrefix" a[href^=http:]
[attr$=valSuffix] elements with an attribute named "attr", and value ending with "valSuffix" img[src$=.png]
[attr*=valContaining] elements with an attribute named "attr", and value containing "valContaining" a[href*=/search/]
[attr~=regex] elements with an attribute named "attr", and value matching the regular expression jpe?g)]
The above may be combined in any order div.header[title]

Selector full reference

Selecting elements using CSS selectors[edit | edit source]

String html = "<!DOCTYPE html>" +
              "<html>" +
                "<head>" +
                  "<title>Hello world!</title>" +
                "</head>" +
                "<body>" +
                  "<h1>Hello there!</h1>" +
                  "<p>First paragraph</p>" +
                  "<p class=\"not-first\">Second paragraph</p>" +
                  "<p class=\"not-first third\">Third <a href=\"page.html\">paragraph</a></p>" +
                "</body>" +
              "</html>";

// Parse the document
Document doc = Jsoup.parse(html);

// Get document title
String title = doc.select("head > title").first().text();
System.out.println(title); // Hello world!

Element firstParagraph = doc.select("p").first();

// Get all paragraphs except from the first
Elements otherParagraphs = doc.select("p.not-first");
// Same as
otherParagraphs = doc.select("p");
otherParagraphs.remove(0);

// Get the third paragraph (second in the list otherParagraphs which
// excludes the first paragraph)
Element thirdParagraph = otherParagraphs.get(1);
// Alternative:
thirdParagraph = doc.select("p.third");

// You can also select within elements, e.g. anchors with a href attribute
// within the third paragraph.
Element link = thirdParagraph.select("a[href]");
// or the first <h1> element in the document body
Element headline = doc.select("body").first().select("h1").first();

You can find a detailed overview of supported selectors here.

Extract Twitter Markup[edit | edit source]

    // Twitter markup documentation: 
    // https://dev.twitter.com/cards/markup
    String[] twitterTags = {
            "twitter:site", 
            "twitter:site:id", 
            "twitter:creator", 
            "twitter:creator:id", 
            "twitter:description", 
            "twitter:title", 
            "twitter:image", 
            "twitter:image:alt", 
            "twitter:player", 
            "twitter:player:width", 
            "twitter:player:height", 
            "twitter:player:stream", 
            "twitter:app:name:iphone", 
            "twitter:app:id:iphone", 
            "twitter:app:url:iphone", 
            "twitter:app:name:ipad", 
            "twitter:app:id:ipad", 
            "twitter:app:url:ipadt",
            "twitter:app:name:googleplay", 
            "twitter:app:id:googleplay", 
            "twitter:app:url:googleplay"        
    };

    // Connect to URL and extract source code
    Document doc = Jsoup.connect("http://stackoverflow.com/").get();

    for (String twitterTag : twitterTags) {

        // find a matching meta tag
        Element meta = doc.select("meta[name=" + twitterTag + "]").first();

        // if found, get the value of the content attribute
        String content = meta != null ? meta.attr("content") : "";

        // display results
        System.out.printf("%s = %s%n", twitterTag, content);
    }

Output

twitter:site = 
twitter:site:id = 
twitter:creator = 
twitter:creator:id = 
twitter:description = Q&A for professional and enthusiast programmers
twitter:title = Stack Overflow
twitter:image = 
twitter:image:alt = 
twitter:player = 
twitter:player:width = 
twitter:player:height = 
twitter:player:stream = 
twitter:app:name:iphone = 
twitter:app:id:iphone = 
twitter:app:url:iphone = 
twitter:app:name:ipad = 
twitter:app:id:ipad = 
twitter:app:url:ipadt = 
twitter:app:name:googleplay = 
twitter:app:id:googleplay = 
twitter:app:url:googleplay =

Credit:Stack_Overflow_Documentation