Refining Patent Model Searches for Optimal Results
In a move aimed at improving the accuracy of patent search models, Google has developed a unique dataset of phrases specifically curated for this purpose. However, it's important to note that Google does not publicly provide a direct downloadable dataset of phrases for patent search model training.
The dataset, which contains approximately 50,000 phrase-to-phrase pairs, is designed to help enhance the search capabilities of patent databases. The pairs in the dataset are labeled to denote how phrases are related to each other, with relationships including synonyms, exact matches, or unrelated terms.
This dataset is particularly useful given that many patent owners use non-standard language to describe their patents' subjects, leading to widely varied and impractical search returns. An example of non-standard language could be describing a soccer ball as a "spherical recreation device."
To access the patent data required to generate this dataset, users can turn to resources like Google Patents or the Global Dossier. Google Patents provides a searchable interface to U.S. and worldwide patents, including full texts, images, and metadata. Users can query titles, abstracts, claims, and descriptions to extract phrases relevant to patent search tasks. The Global Dossier service, offered by the USPTO and cooperating patent offices, provides access to related patent family documents from multiple jurisdictions, including file histories and classifications.
While Google has not released a pre-packaged dataset for public use, the availability of patent data through these resources enables users to generate their own phrase datasets for patent search model training. This process involves collecting patent text data, using text processing techniques to extract and analyze phrases, and then labeling the relationships between these phrases.
For those seeking more open-source or bulk access, specialized patent search tools like PatentLens allow programmatic or bulk querying of international patent data, potentially useful for compiling phrase datasets.
It's worth noting that Google has patented technologies for phrase-based thematic search that use information gain to identify predictive phrases for indexing and retrieval. This suggests that such phrase extraction is done dynamically from actual patent data rather than via a static dataset.
In summary, while Google has not released an official, pre-packaged dataset for patent search model training, full patent documents and their metadata are publicly accessible through Google Patents and related services. Users can generate their own phrase datasets by collecting patent text data, using text processing techniques to extract and analyze phrases, and then labeling the relationships between these phrases. This can help improve the accuracy of patent search models and make the patent search process more efficient.
The AI-driven patent search models may benefit from data-and-cloud-computing technology by using datasets like the one generated by Google, which can help enhance the search capabilities of patent databases. As Google does not publicly provide a direct downloadable dataset, those interested can collect patent text data from resources such as Google Patents, the Global Dossier, or PatentLens, and generate their own dataset for patent search model training.