Please cite Wäschle and Riezler (2012b), if you use the corpus in your work, or use the data citation specified in the Hei DATA entry.

The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstract, claims and description.

For a detailed description of the corpus construction process, please see the publications.

Pat TR is a sentence-parallel corpus extracted from the MAREC patent collection.To prevent overlap, make sure family ids of test and training set are disjunct. ung dating Lolland Furthermore, about 7% of the description data are duplicates.Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United States Patent and Trademark Office (USPTO) corpus, following Utiyama and Isahara (2007).All sections were sentence-aligned using the Gargantua aligner. Sentence boundaries were detected using the Europarl processing tools.

The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.Pat TR is available under a Creative Commons Attribution-Non Commercial-Share Alike 3.0 Unported License.inventor or company, can be found in the original patent indicated by the document id.For description data, where the bitext has been collected from two separate documents, metadata is given for both original patents.The numbers for de-en differ slightly from those reported in Wäschle and Riezler (2012b) due to some additional processing steps that were performed before the release.

