rss
logo

I provide consulting and custom development for Natural Language Processing, Information Extraction and Search solutions.Self Picture


 learn more   get in touch 

Logo - I Build Search
Feb
19

Join a list of integers in Python digg

Today, I had to pretty print a list of integers for debugging. This does not work:

>>> t = [1, 2, 3, 4]
>>> ' '.join(t)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: sequence item 0: expected string, int found

So I came up with this:

>>> def concat(x, y): return str(x) + ' ' + str(y)
>>> reduce(concat, t)
'1 2 3 4'

I am sure there is a better way of doing this!

Jan
21

Writing a spider in 10 mins using Scrapy digg

I came across Scrapy a few days back and have grown to really love it. This tutorial will illustrate how you can write a simple spider using Scrapy to scrape data off Paul Smith. All this in 10 minutes.

Lets begin

  1. Download and install scrapy and its dependencies.
  2. This done, open up your terminal and type python scrapy-ctl.py startproject paul_smith. A scrapy project will be created.
  3. Navigate to ~/paul_smith/paul_smith/spiders and create the file paul_smith.py with the following contents:

    paul_smith.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    from scrapy.spider import BaseSpider
     
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
     
      def parse(self, response):
        open('paulsmith.html', 'wb').write(response.body)
     
    SPIDER = PaulSmithSpider()
  4. To run the spider, go to ~/paul_smith type python scrapy-ctl.py crawl paulsmith.co.uk on the command line. This will fetch the page and save it to paulsmith.html.
  5. The next step is to parse the contents of the page. Open the page in your favourite editor and try to understand the pattern of the items we want to capture. You can see that <div class="yui-u"> contains the required information. We are going to modify out code like so:

    paul_smith.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
     
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
     
      def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="yui-u"]')
        for site in sites:
          print site.extract()
     
    SPIDER = PaulSmithSpider()

    You can read more on XPath Selectors here.

  6. Finally, looking at the HTML again, we can extract title, link, img-src & sale-price like so:

    paul_smith.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    import random
     
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
     
      def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="yui-u"]')
        random.shuffle(sites)
        for site in sites:
          title = site.select('a/strong[@class="thumbnail-text"]/text()').extract()
          hlink = site.select('a/@href').extract()
          price = site.select('a/strong[@class="sale"]/text()').extract()
          image = site.select('a/img/@src').extract()
     
          print title, hlink, image, price
     
    SPIDER = PaulSmithSpider()

    You can save this data to your datastore in whatever way you wish.

  7. The output of 3 random items scraped using the above code can be seen below.

Output

Shawl Collar Block Stripe Jumper
Sale: £ 74.00

Crew Neck Placement Stripe Jumper
Sale: £ 67.00

Tailored Fit, Organic Cotton Cravat Print Shirt
Sale: £ 74.00

Sep
09

Using PHP and ImageMagick to resize images digg

Today, I had to write some code to generate thumbnails in PHP. The php-gd library wasn’t installed and I had to work with ImageMagick. Not the most elegant of solutions, but it works:

functions.php
1
2
3
4
5
6
7
8
9
10
11
define('PRAVIN_THUMBNAIL_DIR', '/home/firedev/public_html/wp-content/cache/thumbnails/');
function pravin_resize($img_path, $width, $height) {
    $resolution = '"' . $width . 'x' . $height . '"';
    $output_path = PRAVIN_THUMBNAIL_DIR . md5($img_path) . "-$resolution.jpg";
    // If file does not exist OR the thumbnail was generated more than 
    // 5 mins (5 x 60 sec) then re-create the thumbnail
    if(!file_exists($output_path) || (time() - filemtime($output_path)) > (5 * 60)) {
        system("/usr/bin/convert -resize $resolution $img_path $output_path");
    }
    return $output_path;
}
Jul
23

Script to generate URS from Wikipedia digg

A person’s URS is a phrase that could be used instead of his/her usual name in all circumstances, which makes it absolutely clear who he/she is. A good URS for a person should meet the following criteria:

  • Everyone familiar with the person will confidently recognise him/her from the URS.
  • There is no possibility that the URS could also describe anyone other than the person.
  • Even someone who isn’t familiar with the person will have some understanding of who he/she is from the URS.
analyzer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
#!/usr/bin/python
""" 
Script to generate URS from the starting paragraph of Wikipedia 
articles about persons.
 
by Pravin Paratey (pravinp -at- gmail.com)
 
Current Implementation:
----------------------
1. Extract first sentence
2. Clean wiki markup
3. Observing given data, and the data on wikipedia, shows that there 
   is a pattern that is followed while writing wikipedia entries for
   persons. Replacing (was/is)(an/a/the/) with (/the) does the trick
4. Output sentence formed
 
Ideally:
--------
Ideally, the piece of code should identify the following concepts:
1. Name of person
2. Time period
3. Son/Daughter/Father/Mother of (in case of famous personality)
4. Renowned for
 
How do we go about it?
1 and 2 - straight forward. Wikipedia gives cues through its markup
3 - straight forward. String matching using "son of", "daughter of", etc
4 - will need to match against a database.
 
For 3, we only keep the "son of", "daughter of", "X of Y" if Y is a prominent
person. An easy way of doing this is using incoming links on wikipedia OR
to search for X and Y individually on google and noting the number of results.
"""
 
import re, sys, codecs
 
def cleanUri(m):
    """ Cleans Uri wiki markup """
    word = m.group(1)
    if '|' in word: word = word.split('|')[1]
    return word.strip()
 
def dotRemove(m):
    """ Replaces . by # inside tags """
    return m.group(0).replace('.', '#')
 
def cleanMarkup(text):
    """ Removes
    1. wiki markup
    2. sanitize html entities 
    3. comments """
    #text = re.sub(r"\[\[[\w\s\-,]+\|(\w+)\]\]", r"\1", text)
    text = re.sub(r"\[\[(.*?)\]\]", cleanUri, text)
    text = re.sub(r"\{\{.*?\}\}", r"", text)
    text = re.sub(r"<ref>.*?<\/ref>", r"", text)
    text = re.sub(r"<!--.*?-->", r"", text)
    text = re.sub(r"\[.*?\]", r"", text)
    text = text.replace("'''", "").replace("''", "'")
    text = text.replace("[[", "").replace("]]", "")
    text = text.replace("&ndash;", "-").replace("&amp;", "&")
    return text
 
def getFirstSentence(text):
    """ Returns the text until first instance of '.'
    It also makes sure that the '.' isn't part of a wiki link
    or name"""
    tmp = re.sub(r"\[\[.*?\]\]", dotRemove, text)
    tmp = re.sub(r"\[.*?\]", dotRemove, tmp)
    tmp = re.sub(r"<ref>.*?<\/ref>", dotRemove, tmp)
    tmp = re.sub(r"<!--.*?-->", dotRemove, tmp)
    tmp = re.sub(r"'''.*?'''", dotRemove, tmp)
    tmp = re.sub(r"''.*?''", dotRemove, tmp)
    index = tmp.find('.')
 
    if index == -1: 
        return text
    else:
        return text[:index]
 
def makeArticle(m):
    """ Changes a, an to the when appropriate """
    retval = ', the'
    if len(m.group(2)) == 0:
        retval = ' '
    return retval
 
def extractURS(text):
    """ The function to call. Returns the URS """
    text = getFirstSentence(text)
    text = cleanMarkup(text)
    text = re.sub(r",?\s+(was|is)\s+(an|the|a|)", makeArticle, text)
    return text
 
if __name__ == '__main__':
    #fp = open(sys.argv[1])
    fp = codecs.open("input.txt", "r", "utf-8")
    fp2 = codecs.open("output.txt", "w", "utf-8")
    fp2.write(codecs.BOM_UTF8.decode("utf-8")), # Add BOM for UTF-8
    for line in fp:
        line = line.rstrip()
        if len(line) == 0 or line.startswith("#"): # For debugging
            continue
        urs = extractURS(line)
        fp2.write(urs + '\r\n')
    fp.close()
    fp2.close()

Example Inputs and Outputs

These are inputs from Wikipedia (Click on the article and then Edit). Ex Lala Lajpat Rai. The above script outputs the URS.

Example Input: ”’B. S. Johnson ”’ (Bryan Stanley Johnson) ([[5 February]],[[1933]] – [[13 November]],[[1973]]) was an English experimental novelist, poet, literary critic and film-maker.

Script Output: B. S. Johnson (Bryan Stanley Johnson) (5 February,1933 – 13 November,1973), the English experimental novelist, poet, literary critic and film-maker

How are URS used?

URS can be directly substituted in a sentence containing that persons’ name. (Hover over Bhagat Singh to see this URS.

ex. Bhagat Singh was executed by the British in 1931.

This way, a person who had no idea who Bhagat Singh was, now has more context about the person.

Apr
26

Snippet to generate a random word in python digg

Since I haven’t posted here in a while, I figured I’d whip up this example real quick. It illustrates the usage of the random function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/usr/bin/python
import random
from time import time
 
def generateWord():
	char_array = 'abcdefghijklmnopqrstuvwxyz'
	random.seed(time())
	word = ''
	for i in range(0, 8): # 8 letter word
		word += char_array[random.randint(0, 25)]
	return word
 
if __name__ == '__main__':
	print generateWord()
Apr
16

Java Regex Matching digg

Took me half a day to figure out matcher.find() had to be called first. Gaah!

RegexTest.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
 
public class Tester {
    public static void main(String args[]) 
    {
        Pattern pattern = Pattern.compile("name=\"QTime\">(\\d+)</int>");
        Matcher matcher = pattern.matcher("<response><lst name=\"responseHeader\">" + 
            "<int name=\"status\">0</int><int name=\"QTime\">2</int></lst></response>");
 
        try {
            if(matcher.find()) {
                System.out.println(matcher.group(1));
            }
        }
        catch (Exception e) {
            System.out.println("Couldn't find QTime");
        }
    }
}

P.S. I love Python

Apr
15

Simple Webserver in Python digg

This snippet illustrates how one can easily build a HTTP Web Server in python. self.args will contain the query parameters.

WebServer.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Simple WebServer Illustration
# Pravin Paratey (April 15, 2009) [pravinp at gmail dot com]
 
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
 
class MyHandler(BaseHTTPRequestHandler):
    binaryExtensions = ['.gif', '.png', '.jpg']
    contentTypes = {
        '.css': 'text/css',
        '.gif': 'image/gif',
        '.jpg': 'image/jpg',
        '.png': 'image/png',
        'html': 'text/html',
    }
 
    def do_GET(self):
        """ Implementing the GET method """
        try:
            if self.path == '/': self.path = '/index.html'
            mode = 'r'
            if self.path[-4:] in self.binaryExtensions: mode = 'rb'
            fp = open(self.path[1:], mode)
            data = fp.read()
            fp.close()
            # Send response
            self.send_response(200)
            self.send_header('Content-Type', self.__getContentType())
            self.send_header('Transfer-Encoding', 'chunked')
            self.end_headers()
            self.wfile.write(data)
        except IOError:
            self.send_error(404, "File not found: %s" % self.path)
 
    def __getContentType(self):
        """ Function to figure out content types """
        content_type = 'text/plain'
        extension = self.path[-4:]
        if extension in self.contentTypes:
            content_type = self.contentTypes[extension]            
        return content_type
 
if __name__ == '__main__':
    server = HTTPServer(('', 8000), MyHandler)
    server.serve_forever()
Mar
29

XML DOM in Java digg

Of late, I have been working with Java. And one of the issues that I faced was XML parsing. With so many libraries available, I decided to stick to jaxp. What follows is sample code to Tree walk over the nodes:
TreeWalk.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import java.io.File;
 
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
 
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
 
 
public class Tester {
    public static void main(String args[]) 
    {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setNamespaceAware(false);
 
        try {
            DocumentBuilder builder = factory.newDocumentBuilder();
            org.w3c.dom.Document doc = builder.parse(new File(args[0]));
            NodeList nodes1 = doc.getChildNodes();
            for(int i=0; i<nodes1.getLength(); i++) {
                TreeWalk(nodes1.item(i), 0);
            }
        }
        catch(Exception e) {
            e.printStackTrace();
        }
    }
 
    private static void TreeWalk(Node n, int level) 
    {
        if(n.getNodeType() != Node.TEXT_NODE) {
            for(int i=0; i<level; i++)
                System.out.print("  ");
            System.out.print(n.getNodeName() + ":");
        }
        else {
            System.out.println(n.getNodeValue().trim());
        }
        NodeList list = n.getChildNodes();
        for(int i=0; i<list.getLength(); i++) {
            TreeWalk(list.item(i), level+1);
        }
    }
}
Mar
28

Hacking wp-syntax plugin to show header digg

I was recently asked how I got the wp-syntax plugin to show a header like so:

test.cpp
1
2
3
int main() {
	return 0;
}

To show the test.cpp file name, I modified the wp-syntax.php file (present in /wp-content/plugins/wp-syntax/) like so:

Changed the regular expression in the wp_syntax_before_filter function from:

wp-syntax.php
function wp_syntax_before_filter($content)
{
    return preg_replace_callback(
        "/\s*<pre(?:lang=[\"']([\w-]*)[\"']|line=[\"'](\d*)[\"']|escaped=[\"'](true|false)?[\"']|\s)+>(.*)<\/pre>\s*/siU",
        "wp_syntax_substitute",
        $content
    );
}

to

wp-syntax.php
function wp_syntax_before_filter($content)
{
    return preg_replace_callback(
        "/\s*<pre(?:lang=[\"']([\w-]*)[\"']|line=[\"'](\d*)[\"']|escaped=[\"'](true|false)?[\"']|header=[\"']([\w-\. ]*)[\"']|\s)+>(.*)<\/pre>\s*/siU",
        "wp_syntax_substitute",
        $content
    );
}

And the wp_syntax_highlight function to:

94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
function wp_syntax_highlight($match)
{
    global $wp_syntax_matches;
 
    $i = intval($match[1]);
    $match = $wp_syntax_matches[$i];
 
    $language = strtolower(trim($match[1]));
    $line = trim($match[2]);
    $escaped = trim($match[3]);
    $header = trim($match[4]);
    $code = wp_syntax_code_trim($match[5]);
    if ($escaped == "true") $code = htmlspecialchars_decode($code);
 
    $geshi = new GeSHi($code, $language);
    $geshi->enable_keyword_links(false);
    do_action_ref_array('wp_syntax_init_geshi', array(&$geshi));
 
    $output = "\n<div class=\"wp_syntax\">";
 
    if($header) {
        $output .= "<div class=\"wp_syn_hdr\">" . $header . "</div>";
    }

Node the addition of lines 104 and 114-116

All you have to do is add another attribute header="header-text" in your pre tag. ex. <pre lang="php" line="1" header="wp-syntax.php">

Jan
19

Exporting Opera email to mbox format digg

The following snippet combines the various opera mbs into one mbox format which can be used by other email clients like Evolution to import mail

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/usr/bin/python
# Quick hack to merge all opera mbs files into mbox format which can
# then be used by other email clients to import Opera email.
# by Pravin Paratey (January 19, 2009)
 
import os
 
# Change this value
folder = '/home/pravin/.opera/mail/store/account1/'
 
fp = open('combined.mbox', 'a')
 
for d0 in os.listdir(folder):
	p0 = os.path.join(folder,d0)
	if os.path.isfile(p0): continue
	for d1 in os.listdir(p0):
		p1 = os.path.join(p0, d1)
		for d2 in os.listdir(p1):
			p2 = os.path.join(p1, d2)
			for f in os.listdir(p2):
				fp2 = open(os.path.join(p2, f), 'r')
				fp.write(fp2.read())
				fp2.close()
fp.close()

Latest Articles

Feb
19

Join a list of integers in Python

How do you run a string join on a list of integers in Python? After googling for about 10 mins, I gave up and did this. I am sure there is a better way of doing it! [Read More]
Jan
21

Writing a spider in 10 mins using Scrapy

I came across Scrapy a few days back and have grown to really love it. This tutorial will illustrate how you can write a simple spider using Scrapy to scrape data off Paul Smith. All this in 10 minutes. [Read More]

Featured Projects

Document Tagger

Document Tagger

DocTagger lets you automatically classify text documents. Use this as a starting point to write apps that can sort through volumes of unorganized data.

[Read More]

Yahoo Messenger Client for *nix

Yahoo Messenger Client for *nix

Yux is an alternative Yahoo Messenger client for *nix systems that attempts to match the look and feel of the original Windows client.

[Read More]

This page and its contents are copyright © 2010, Pravin Paratey.