Lawrence Berkeley National Laboratory masthead A-Z Index Berkeley Lab masthead U.S. Department of Energy logo Phone Book Jobs Search
Tech Transfer
Licensing Interest Form Receive Customized Tech Alerts

Classifying and Comparing Information by Content

IB-2597

APPLICATIONS OF TECHNOLOGY:

  • Classifying
    • websites
    • books
    • medical, legal, regulatory, and government documents
    • genetic codes, genomes, and proteomes
    • digitized media: music, books, video
  • Database construction and search

ADVANTAGES:

  • Provides more accurate classification
  • Provides information about objects’ proximity and relationships
  • Allows for regrouping of objects based on different characteristics

ABSTRACT:

Berkeley Lab scientists Sung-Hou Kim and Gregory E. Sims have developed a computational method that compares, categorizes and indexes objects that contain information, according to content. Objects with linear or linearizable information, including books, genetic codes and digitized audio or video recordings are compared based on the frequency of certain predefined features, such as a string of letters or numbers. The Berkeley Lab technology provides eigenvalues and eigenvectors, which convey not only the degrees of difference or similarity between content but also identify the characteristics that make certain objects different or similar. In addition, the eigenvalues provide an objective and simple way of indexing. The relationships then can be depicted with a multidimensional matrix or diagrammatic tree.

As a result, the new method surpasses techniques in which certain characteristics or segments of data are subjectively chosen to analyze for grouping objects. For example, indices and website search engines rely on the presence or absence of specific keywords, traffic and the frequency at which sites are accessed. Existing textual comparisons, such as those used to determine plagiarism, depend on the frequency of certain words, yet cannot account for the ordering or syntax of words. Comparisons of genetic code used to classify organisms and identify targets for new medications rely on the alignment of a tiny fraction of subjectively chosen DNA (1% or less).

The Berkeley Lab technology has been tested in several venues. It categorized works of literature by genre, using the books’ content, more accurately than the traditional method based on word frequency. In another demonstration, the whole genomes of mammals were compared to produce a classification tree that matched the established phylogeny based on morphology. The technology was also used to produce phylogenetic trees of bacteria and viruses, which led to the classification of previously unclassified genomes.

STATUS:

  • Patent pending. Available for licensing or collaborative research.

To learn more about licensing a technology from LBNL see http://www.lbl.gov/Tech-Transfer/licensing/index.html.

FOR MORE INFORMATION:

Sims, G.E., S. -R. Jun, G.A. Wu, and S.-H. Kim, “Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions,” Proceedings of the National Academy of Sciences of the USA 106(8), 2677-2682 (February 24, 2009).

REFERENCE NUMBER: IB-2597

SEE THESE OTHER BERKELEY LAB TECHNOLOGIES IN THIS FIELD:

See More Computing Technologies
Last updated: 01/15/2013