Skip to content

Pravin Paratey

Natural Language Processing, Data mining and Information Extraction consultant based in London.

Jul 23 2009
Jul 23 2009

Script to generate URS from Wikipedia

A persons' URS is a phrase that could be used instead of his/her usual name in all circumstances, which makes it absolutely clear who he/she is. A good URS for a person should meet the following criteria:

analyzer.py
1 #!/usr/bin/python 2 """ 3 Script to generate URS from the starting paragraph of Wikipedia 4 articles about persons. 5 6 by Pravin Paratey (pravinp -at- gmail.com) 7 8 Current Implementation: 9 ---------------------- 10 1. Extract first sentence 11 2. Clean wiki markup 12 3. Observing given data, and the data on wikipedia, shows that there 13 is a pattern that is followed while writing wikipedia entries for 14 persons. Replacing (was/is)(an/a/the/) with (/the) does the trick 15 4. Output sentence formed 16 17 Ideally: 18 -------- 19 Ideally, the piece of code should identify the following concepts: 20 1. Name of person 21 2. Time period 22 3. Son/Daughter/Father/Mother of (in case of famous personality) 23 4. Renowned for 24 25 How do we go about it? 26 1 and 2 - straight forward. Wikipedia gives cues through its markup 27 3 - straight forward. String matching using "son of", "daughter of", etc 28 4 - will need to match against a database. 29 30 For 3, we only keep the "son of", "daughter of", "X of Y" if Y is a prominent 31 person. An easy way of doing this is using incoming links on wikipedia OR 32 to search for X and Y individually on google and noting the number of results. 33 """ 34 35 import re, sys, codecs 36 37 def cleanUri(m): 38 """ Cleans Uri wiki markup """ 39 word = m.group(1) 40 if '|' in word: word = word.split('|')[1] 41 return word.strip() 42 43 def dotRemove(m): 44 """ Replaces . by # inside tags """ 45 return m.group(0).replace('.', '#') 46 47 def cleanMarkup(text): 48 """ Removes 49 1. wiki markup 50 2. sanitize html entities 51 3. comments """ 52 #text = re.sub(r"\[\[[\w\s\-,]+\|(\w+)\]\]", r"\1", text) 53 text = re.sub(r"\[\[(.*?)\]\]", cleanUri, text) 54 text = re.sub(r"\{\{.*?\}\}", r"", text) 55 text = re.sub(r"<ref>.*?<\/ref>", r"", text) 56 text = re.sub(r"<!--.*?-->", r"", text) 57 text = re.sub(r"\[.*?\]", r"", text) 58 text = text.replace("'''", "").replace("''", "'") 59 text = text.replace("[[", "").replace("]]", "") 60 text = text.replace("&ndash;", "-").replace("&amp;", "&") 61 return text 62 63 def getFirstSentence(text): 64 """ Returns the text until first instance of '.' 65 It also makes sure that the '.' isn't part of a wiki link 66 or name""" 67 tmp = re.sub(r"\[\[.*?\]\]", dotRemove, text) 68 tmp = re.sub(r"\[.*?\]", dotRemove, tmp) 69 tmp = re.sub(r"<ref>.*?<\/ref>", dotRemove, tmp) 70 tmp = re.sub(r"<!--.*?-->", dotRemove, tmp) 71 tmp = re.sub(r"'''.*?'''", dotRemove, tmp) 72 tmp = re.sub(r"''.*?''", dotRemove, tmp) 73 index = tmp.find('.') 74 75 if index == -1: 76 return text 77 else: 78 return text[:index] 79 80 def makeArticle(m): 81 """ Changes a, an to the when appropriate """ 82 retval = ', the' 83 if len(m.group(2)) == 0: 84 retval = ' ' 85 return retval 86 87 def extractURS(text): 88 """ The function to call. Returns the URS """ 89 text = getFirstSentence(text) 90 text = cleanMarkup(text) 91 text = re.sub(r",?\s+(was|is)\s+(an|the|a|)", makeArticle, text) 92 return text 93 94 if __name__ == '__main__': 95 #fp = open(sys.argv[1]) 96 fp = codecs.open("input.txt", "r", "utf-8") 97 fp2 = codecs.open("output.txt", "w", "utf-8") 98 fp2.write(codecs.BOM_UTF8.decode("utf-8")), # Add BOM for UTF-8 99 for line in fp: 100 line = line.rstrip() 101 if len(line) == 0 or line.startswith("#"): # For debugging 102 continue 103 urs = extractURS(line) 104 fp2.write(urs + '\r\n') 105 fp.close() 106 fp2.close()

Example Inputs and Outputs

These are inputs from Wikipedia (Click on the article and then Edit). Ex Lala Lajpat Rai. The above script outputs the URS.

Example Input: '''B. S. Johnson ''' (Bryan Stanley Johnson) ([[5 February]],[[1933]] - [[13 November]],[[1973]]) was an English experimental novelist, poet, literary critic and film-maker.

Script Output: B. S. Johnson (Bryan Stanley Johnson) (5 February,1933 - 13 November,1973), the English experimental novelist, poet, literary critic and film-maker

How are URS used?

URS can be directly substituted in a sentence containing that persons' name. (Hover over Bhagat Singh to see this URS.

ex. Bhagat Singh was executed by the British in 1931.

This way, a person who had no idea who Bhagat Singh was, now has more context about the person.

Latest Articles