Cluster-Based Patent Retrieval for Scientists and Technologists


Patents are a rich resource of scientific knowledge as each invention patented or desired to be patented embodies several technological concepts besides the key concept on which it is often based. Like in scientific literature, there is a practice of citing previous patents and other sources of information while describing the invention in a patent application.

Retrieval of past relevant patents irrespective of having been cited or not, to a target technology is a sine qua non in the filing and prosecution of patent applications. The main purpose of retrieving patent documents is to validate the genuineness of the technology in a patent application which is assiduously followed by large companies in the corporate sector maintaining sizable patent portfolios and also by their Patent Agents and Patent Attorneys. Competitive intelligence, to keep track on the competitor’s activities is usually their main objective for continuously conducting the patent search. Patent Examiners in patent offices around the world also conduct significant patent search to establish the novelty and check any infringement in the context of invalidity search (i.e., finding prior patents that contain some conflicting claims) which is often critical for a newly applied patent applications.

For academic researchers, usually the purpose of patent search is to get an insight and knowledge for new leads in research. Most often these are conducted based on key-word searches in free patent databases and assiduously building a set of patents by hand-picking from a maze of collection thus obtained. These are fairly painstaking exercises and fail to inspire many scientists to conduct patent searches of their interests on regular basis. As we shall see in the following sections, patent retrieval, especially the emerging mode of cluster-based patent search is a potentially new tool for gathering relevant existing knowledge in the hands of technology developers.

On-line Search of Patents

With growing harmonization of patent laws of various countries, the taxonomy of patent documents and the respective fields of their content have also now become quite comparable with adoption of ‘Internationally agreed Numbers for the Identification of (bibliographic) Data (INID)’ codes since 1970s to identify bibliographic data on the front page of patent documents. These INID codes lend the patent documents collected from different sources to be collated in a common database for their bibliographic analysis on global basis. As on date, there are approximately 60 INIDs representing distinct bibliographic data. These are widely used on the first page of patent documents or in Patent Gazettes.

Some key INIDs pertain to patent classification codes that refer to the nature of technical information possessed by the patent documents. Various classification systems are in vogue, mostly evolved by different countries as per their own convenience and requirement. International Patent Classification (IPC) has been introduced by World Intellectual Property Organization (WIPO) which is regularly being updated from time to time. Fortunately, IPC is universally accepted and most Patent Offices ascribe IPC codes besides their own codes on all patents granted by them. Another redeeming feature of IPC is that as the old number codes are changed as a result of expansion policy, all old patents residing in various online databases are also updated with new number codes, so that patent search based on a valid IPC code never goes dysfunctional.

The IPC system divides all fields of technology into hierarchical sets of sections, classes, subclasses and groups. The technical areas which may mean any technical matter, e.g., process, product, technique or apparatus are defined and suitably differentiated at the level of a class or a subclass in the International Patent Classification in a patent. As of now the total number of codes as per this classification runs over 70,000. It is an indispensable tool for industrial property offices world over, in conducting searches to establish the novelty of an invention, or to determine the state of the art in a particular area of technology.

There are several patent classification schemes that are in use by different patent authorities but International Patent Classification (IPC) is by far the most popular and more universally applied. For example, UK Patent Office withheld using its own classification system (UKC) in favour of IPC since July 2007. More than 70 patent granting authorities are believed to be using the IPC codes and this number is growing further. Moreover, there are now several comprehensive patent databases online offering free search facility for patent data from a large number of countries. For example, ‘esp@cenet’, a worldwide patent database of European Patent Office is one source that provides free search facility from its database pooling patent documents from more than 90 countries.

It presupposes that a scientist or technologist needs to fully understand the IPC system or more precisely know the IPC codes referring to the areas of his / her interest to optimally benefit from such a comprehensive source for search based on these codes and build a meaningful patent inventory.

Cluster Based Patent Retrieval & Visualisation

Searches based on IPC codes are essentially cluster based on ‘topic’ and not on ‘search terms’, by the very nature of IPC codes. During the prosecution stage, each document is hand-assigned to its appropriate IPC codes by the Patent Examiners resulting in the document being part of several pre-defined clusters. Many automated software for patent searches and mapping have been based on IPC clusters and further segmentation and clustering based on specific terms before visualization of resulting trends and profiles from the refined data. Many commercial mapping and analytical software vendors maintain their own databases of processed patents which are intuitively categorized and classified into different scientific fields as per their own scheme instead of using IPC clusters of raw patents. Clearly, such a facility provides a significant value-addition but at a cost which may be out of reach for many scientists.


Prior IP: Patent Cluster Visualization

Prior IP ( is one of the latest to join the band wagon of patent search engines on the internet with a novel concept of clustering based on patent citations and visually display technology areas with patent data. The methodology is not clearly explained in the available information though, the site claims to have evolved over 50,000 clusters and a patent or application could belong to more than one cluster. Three distinct modes of visualizations of related clusters are provided which include cluster maps, cluster landscapes and cluster neighborhoods.

The very first step a user is supposed to take is to search relevant term(s) in the box(es) for a ‘technology’, ‘organization, i.e., assignee or applicant’, ‘inventor’, ‘document number’, etc. which returns a list of relevancy ranked patent documents (granted and applications) available within its database. Clicking on a listed patent or application displays the selected patent document in a familiar format with text of the document along with a thumbnail of the front page. Alongside the list of patents / applications, a small window with a thumbnail of network of clusters is provided on the right side indicating a label, “Visualize IP Search Results”. Clicking on the this thumbnail opens up a ‘Cluster Landscape’ with a series of ‘Cluster Maps’ bearing a set of specific top terms for each cluster and indicating number of patents and applications in that cluster. How these clusters bring forth relevant patents and applications through the maze of patent clusters and allow users to hand pick most relevant for them by surfing through visually friendly and spatially orientated landscape of familiar technological areas can be demonstrated through an example of actual search.

Since contamination of soil environment is a hot topic and variety of technologies based on different approaches have been developed and still new technologies are being developed in many countries, we chose a term ‘soil contamination’ for our initial search that resulted a simple list of 1529 patents and applications. As anticipated, a thumbnail for “Visualize IP Search Results” appears on the right side. Running through the list, we can see that there are patents from various countries, namely USA, Germany, Taiwan, Korea, Japan etc. Clicking on the document titles, we access bibliographic data with a facsimile of the first page of the document. Also provided are the links to reach the source database for full text of the document and also a link for downloading a pdf version of it. The running list is provided along with a ‘relevancy score’ presumably based on the intensity of search term within the document. Please note we commissioned a simple search with the given terms and it is also possible to search through various sections of the document, viz, title, abstract, description etc. and hence we can get a different number and scores of returned documents.

Most interesting part of this search exercise, however, is cluster maps instead of the running list. Thus, when we click on “Visualize IP Search Results”, we are presented with a cluster landscape with a network of a series of thumbnail view of the cluster maps. In this case, we get as many as 43 cluster maps (even though elsewhere it indicates 114 clusters!). All these cluster maps have a unique names (with top terms in the parentheses) referring to a specific area of research and technology. Many of the cluster names may appear remotely connected with key area of interest, but some are definitely too close. Table 1 below shows some of these cluster titles, e.g, i) ddt contaminated soil, ii) removing soil contaminants, iii) remediating contaminated soil and so on.

The patents and applications from any of the clusters can be viewed by clicking on the provided links. Similarly, by clicking on the cluster image or link as provided brings forth the spread of the network of clusters. Thus clicking on the link for Cluster 3 in Table 1 above, we can see the desired cluster map as in Figure1.

Once again, all cluster titles may not appear to be quite relevant but some could attract the attention of the user. Some of the other clusters as accessed at a successive stage through previous cluster are shown in the Table 1. It may be noted that all the successive clusters are not sub-group of the previous cluster but all are independent clusters networked with each other through a pre-determined relationship (citations!). This can be understood from the fact that the number of patents / applications in these clusters are highly variable and bear no hierarchical relationships of any kind. The details of patents and applications in any of these clusters can be seen in pop-up windows by clicking on the relevant link. It is interesting to note that a separate link is also provided to access details of some patents / applications which do not form the part of the cluster for some reason.

Pop up of cluster details appears in an interactive template as shown in Figure 2 and is quite informative. In this figure, details of the cluster – remediating contaminated soil (water, removing, verfahren) which we reached through Cluster 3 shown in Table 1 above are shown. This has under its fold 1780 patents and 1689 applications. Besides, patent time-line which essentially shows the growth profile of patents applied / granted on a time scale for the patents covered in the cluster, It shows in separate windows, top assignees, top inventors and also top patents and top applications. It is understood that while top patents and top applications are both with reference to the citations the patents / applications listed in the relevant cluster, the top assignees and top inventors should be based on their absolute numbers in each cluster. Attempts to view top inventors, however, did not succeed in returning any results. Nonetheless, the window for top assignees presents very interesting information. The active links for the top assignees point to a comprehensive profile of the assignee with details of patents, applications, technologies and the clusters to which it can be attributed. It also provides the assignee’s most recent patents, applications and licensable technologies which are not just restricted to the field of the cluster. Thus one can easily comprehend the overall technological strength and back up of the assignee beyond one’s narrow area of interest.


While experts in information science are still vigorously trying to develop various algorithms for cluster-based patent search, there already is on the horizon a very exciting tool albeit with a few bugs yet. The cluster-based patent search facility as offered by Prior IP is certainly very useful for everyone and above all for scientists and technologist who are likely to patronize its maximum use. Scientists with keen interest for building a patent inventory in their narrow subject of specialization should feel extremely at home searching patents through the available clusters without bothering to learn the basics of patent systems. The added beauty of the facility is that the user can download all patents of interest on his / her desk-top in a csv format to build a personal inventory of relevant patents in Excel.

Go to