Retrieve song lyrics in Java using Screenscraping with JSoup

0

Last year I wrote about JSoup, a Java library that helps with screenscraping: Screenscraping from Java using jsoup – effective data gathering from websites (http://technology.amis.nl/blog/13121/screenscraping-from-java-using-jsoup-effective-data-gathering-from-websites). Last month I had another opportunity for using JSoup, this time to gather song lyrics for the songs on a CD. The context in this case was the internal SOA for Java Professionals training program at AMIS. The students did an assignment to complete the second block in this three-piece program. Their assignment required them to implement a Web Service that produced the CD Booklet for a certain CD – returned as PDF document with illustration, song titles and song lyrics. One of the resources we made available to the students was a Java Class that returned song lyrics. It was their challenge to integrate this class in a proper way in their application (be it PL/SQL, SOA Suite 11g or OSB based).

The LyricsGatherer is easily constructed using JSoup and the website http://www.songlyrics.com/ (that suffers from periodic and unfortunate loss of service) :

Image

Downdrilling on the search results brings us to the actual song lyrics:

Image

 

And if a browser can do this, so can a Java program (generally speaking and definitely true in this case).

The Java code – leveraging JSoup – to retrieve song lyrics looks like this:

package nl.amis.music.lyrics;

import java.io.IOException;

import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;

public class LyricsGatherer {

   private final static String songLyricsURL = "http://www.songlyrics.com";


   public static List<String> getSongLyrics( String band, String songTitle) throws IOException {
     List<String> lyrics= new ArrayList<String>();

     Document doc = Jsoup.connect(songLyricsURL+ "/"+band.replace(" ", "-").toLowerCase()+"/"+songTitle.replace(" ", "-").toLowerCase()+"-lyrics/").get();
     String title = doc.title();
     System.out.println(title);
     Element p = doc.select("p.songLyricsV14").get(0);
      for (Node e: p.childNodes()) {
          if (e instanceof TextNode) {
            lyrics.add(((TextNode)e).getWholeText());
          }
      }
     return lyrics;
   }

   public static void main(String[] args) throws IOException {
      System.out.println(LyricsGatherer.getSongLyrics("U2", "With or Without You"));
      System.out.println(LyricsGatherer.getSongLyrics("Billy Joel", "Allentown"));
      System.out.println(LyricsGatherer.getSongLyrics("Tori Amos", "Winter"));
    }
}

The results

Image

can easily be returned in a Web Service style fashion.

Resources

Download the source discussed in this article:SongLyricsProvider.zip .

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

Leave a Reply