rss
logo

I provide consulting and custom development for Natural Language Processing, Information Extraction and Search solutions.Self Picture


 learn more   get in touch 

Logo - I Build Search
Jul 23
2009

Script to generate URS from Wikipedia digg

A person’s URS is a phrase that could be used instead of his/her usual name in all circumstances, which makes it absolutely clear who he/she is. A good URS for a person should meet the following criteria:

  • Everyone familiar with the person will confidently recognise him/her from the URS.
  • There is no possibility that the URS could also describe anyone other than the person.
  • Even someone who isn’t familiar with the person will have some understanding of who he/she is from the URS.
analyzer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
#!/usr/bin/python
""" 
Script to generate URS from the starting paragraph of Wikipedia 
articles about persons.
 
by Pravin Paratey (pravinp -at- gmail.com)
 
Current Implementation:
----------------------
1. Extract first sentence
2. Clean wiki markup
3. Observing given data, and the data on wikipedia, shows that there 
   is a pattern that is followed while writing wikipedia entries for
   persons. Replacing (was/is)(an/a/the/) with (/the) does the trick
4. Output sentence formed
 
Ideally:
--------
Ideally, the piece of code should identify the following concepts:
1. Name of person
2. Time period
3. Son/Daughter/Father/Mother of (in case of famous personality)
4. Renowned for
 
How do we go about it?
1 and 2 - straight forward. Wikipedia gives cues through its markup
3 - straight forward. String matching using "son of", "daughter of", etc
4 - will need to match against a database.
 
For 3, we only keep the "son of", "daughter of", "X of Y" if Y is a prominent
person. An easy way of doing this is using incoming links on wikipedia OR
to search for X and Y individually on google and noting the number of results.
"""
 
import re, sys, codecs
 
def cleanUri(m):
    """ Cleans Uri wiki markup """
    word = m.group(1)
    if '|' in word: word = word.split('|')[1]
    return word.strip()
 
def dotRemove(m):
    """ Replaces . by # inside tags """
    return m.group(0).replace('.', '#')
 
def cleanMarkup(text):
    """ Removes
    1. wiki markup
    2. sanitize html entities 
    3. comments """
    #text = re.sub(r"\[\[[\w\s\-,]+\|(\w+)\]\]", r"\1", text)
    text = re.sub(r"\[\[(.*?)\]\]", cleanUri, text)
    text = re.sub(r"\{\{.*?\}\}", r"", text)
    text = re.sub(r"<ref>.*?<\/ref>", r"", text)
    text = re.sub(r"<!--.*?-->", r"", text)
    text = re.sub(r"\[.*?\]", r"", text)
    text = text.replace("'''", "").replace("''", "'")
    text = text.replace("[[", "").replace("]]", "")
    text = text.replace("&ndash;", "-").replace("&amp;", "&")
    return text
 
def getFirstSentence(text):
    """ Returns the text until first instance of '.'
    It also makes sure that the '.' isn't part of a wiki link
    or name"""
    tmp = re.sub(r"\[\[.*?\]\]", dotRemove, text)
    tmp = re.sub(r"\[.*?\]", dotRemove, tmp)
    tmp = re.sub(r"<ref>.*?<\/ref>", dotRemove, tmp)
    tmp = re.sub(r"<!--.*?-->", dotRemove, tmp)
    tmp = re.sub(r"'''.*?'''", dotRemove, tmp)
    tmp = re.sub(r"''.*?''", dotRemove, tmp)
    index = tmp.find('.')
 
    if index == -1: 
        return text
    else:
        return text[:index]
 
def makeArticle(m):
    """ Changes a, an to the when appropriate """
    retval = ', the'
    if len(m.group(2)) == 0:
        retval = ' '
    return retval
 
def extractURS(text):
    """ The function to call. Returns the URS """
    text = getFirstSentence(text)
    text = cleanMarkup(text)
    text = re.sub(r",?\s+(was|is)\s+(an|the|a|)", makeArticle, text)
    return text
 
if __name__ == '__main__':
    #fp = open(sys.argv[1])
    fp = codecs.open("input.txt", "r", "utf-8")
    fp2 = codecs.open("output.txt", "w", "utf-8")
    fp2.write(codecs.BOM_UTF8.decode("utf-8")), # Add BOM for UTF-8
    for line in fp:
        line = line.rstrip()
        if len(line) == 0 or line.startswith("#"): # For debugging
            continue
        urs = extractURS(line)
        fp2.write(urs + '\r\n')
    fp.close()
    fp2.close()

Example Inputs and Outputs

These are inputs from Wikipedia (Click on the article and then Edit). Ex Lala Lajpat Rai. The above script outputs the URS.

Example Input: ”’B. S. Johnson ”’ (Bryan Stanley Johnson) ([[5 February]],[[1933]] – [[13 November]],[[1973]]) was an English experimental novelist, poet, literary critic and film-maker.

Script Output: B. S. Johnson (Bryan Stanley Johnson) (5 February,1933 – 13 November,1973), the English experimental novelist, poet, literary critic and film-maker

How are URS used?

URS can be directly substituted in a sentence containing that persons’ name. (Hover over Bhagat Singh to see this URS.

ex. Bhagat Singh was executed by the British in 1931.

This way, a person who had no idea who Bhagat Singh was, now has more context about the person.

2 Responses (rss) (trackback)

#1

Shashi

July 23rd, 2009 at 11:18 am

What’s URS?

#2

Pravin Paratey

July 24th, 2009 at 12:27 am

Unique Recognition Strings – A phrase that will uniquely identify the person.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">

Latest Articles

Apr
07

Palindromic sub-sequences in python

This bit of python code returns all palindromic subsequences in the input string.

[Read More]
Feb
19

Join a list of integers in Python

How do you run a string join on a list of integers in Python? After googling for about 10 mins, I gave up and did this. I am sure there is a better way of doing it!

[Read More]

Featured Projects

Document Tagger

Document Tagger

DocTagger lets you automatically classify text documents. Use this as a starting point to write apps that can sort through volumes of unorganized data.

[Read More]

Deebot

Deebot

Deeb0t is an IRC chat bot capable of making meaningful conversation with other users. It also responds to commands issued by its owner.

[Read More]

This page and its contents are copyright © 2010, Pravin Paratey.