Apache Tika 抽取文本內容的好工具

Apache Tika，對於想要進行內容分析的人來說，絕對是個必要的工具。它是個抽取文字內容的工具箱，集結了POI、Pdfbox等多種函式庫以提取多種檔案內容。Apache Tika最大的優點，在於提供單一的提取界面，只要幾行，就能自動偵測並傳回文字。

還沒發現Apache Tika之前，我得要自己去判斷檔案類型，然後分別撰寫不同的程式碼，才有辦法讀取這些不同的檔案內容。而光是讀取Microsoft的Office文件就讓人傷透腦筋，因為.doc和.docx幾乎是完全不同的格式規範。這幾天試了Apache Tika後，果真覺得方便多了，可以把之前的程式碼都丟了。

public static void main(String[] args) throws Exception {

 File file = new File("your/file");

 String content = new Tika().parseToString(file);

 System.out.println(content);

}

抽取檔案內容就是這麼簡單。然而，使用字串（String）在處理大檔案上有很大的缺點，因為它占用了太多的記憶體。Apache Tika提供了Reader的方式，傳回檔案內容的一個個字元，可以用BufferedReader接過來，一次處理一小段緩存。

publicstaticvoid main(String[] args) throws Exception {
 File file = new File("your/file");
 Reader reader = new Tika().parse(file);
 BufferedReader br = new BufferedReader(reader);
 try {
  String line;
  while ( (line = br.readLine()) != null) {
   System.out.println(line);
  }

 } finally {
  br.close();
 }

}

以上是Apache Tika最簡便的使用方式，但它也提供進階的方式，讓你能夠進一步篩選資料。一個方法是應用不同的Parser來處理特定文件，另一個方式則是選擇特定的ContentHandler來處理特定內容。當然，兩種方法都可以應用和延伸。

public static void main(String[] args) throws Exception {

 InputStream input = new FileInputStream("your/html/file");

 ContentHandler handler = new BodyContentHandler();

 Parser parser = new AutoDetectParser();

 parser.parse(input, handler, new Metadata(), new ParseContext());

 String bodyContent = handler.toString();

 System.out.println(bodyContent);

 input.close();

}

最後，再來看一個自動抽取網頁主要內文的例子，這大概是進行網路內容研究最重要的部分。在這個例子裡面，你還必須囊括HttpClient的函式庫（包含在Apache HttpComponents專案裡頭），用來擷取網頁的主要內容。

publicstaticvoid main(String[] args) throws Exception {

 HttpGet httpget = new HttpGet("http://kuanming-style.blogspot.tw/");
 HttpEntity entity = null;
 HttpClient client = new DefaultHttpClient();
 HttpResponse response = client.execute(httpget);
 entity = response.getEntity();
 if (entity != null) {
  InputStream instream = entity.getContent();

   
  BodyContentHandler handler = new BodyContentHandler();
  BoilerpipeContentHandler boilerpipHandler =
    new BoilerpipeContentHandler(handler);
  Metadata metadata = new Metadata();
  Parser parser = new AutoDetectParser();
  parser.parse( instream, boilerpipHandler,
    metadata, new ParseContext());

   
  String content =
   boilerpipHandler.toTextDocument().getContent();
  System.out.println(content);
 }

}

擷取網頁主要內容的函式庫來自boilerpipe，理論上在安裝Tika時也一併裝了。但是相關的API Javadocs，還是得回到boilerpipe的專案網頁。如果你覺得這個ContentHandler還不夠好，那麼你可能要寫一個自己的。

影。像。生。活

搜尋此網誌

Apache Tika 抽取文本內容的好工具

標籤

留言

張貼留言

熱門文章

差不多食譜：手工巧克力餅乾 Chocolate Cookies

【豐原大蔥】免揉大蔥佛卡夏 No-knead Leek Focaccia - 差不多食譜

差不多食譜：壽桃 Birthday Bunns