Jsoup is a HTML parsing and data extraction library for Java, focused on flexibility and ease of use. It can be used to extract sepecific data from HTML pages, which is commonly known as “web scraping”, as well as modify the content of HTML pages, and “clean” untrusted HTML with a whitelist of allowed tags and attributes.
Reverse engineer how the page loads it’s data. Typically, web pages which load data dynamically do so via AJAX, and thus, you can look at the network tab of your browser’s developer tools to see where the data is being loaded from, and then use those URLs in your own code. See how to scrape AJAX pages for more details.
Official website & documentation
You can find various Jsoup related resources at jsoup.org, including the Javadoc, usage examples in the Jsoup cookbook and JAR downloads. See the GitHub repository for the source code, issues, and pull requests.
Jsoup is available on Maven as
org.jsoup.jsoup:jsoup, If you’re using Gradle (eg. with Android Studio), you can add it to your project by adding the following to your
build.gradle dependencies section:
If you’re using Ant (Eclipse), add the following to your POMs dependencies section:
<dependency> <!-- jsoup HTML parser library @ http://jsoup.org/ --> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.8.3</version> </dependency>
Jsoup is also available as downloadable JAR for other environments.