RedditCrawler

Title: Reddit Crawler
Author: Guillem Nicolau Alomar Sitjes
Initial release: September 1st, 2017
Code version: 0.1
Availability: Public

Index

Requirements
Documentation
- Project Structure
- Explanation
Using the application
- Executing
- Testing
Decisions taken

Requirements

Python 2.7 (not tested on python3, at least the prints should be removed)
lib xml (lxml): pip install lxml
Web.py: pip install web.py
Praw: pip install praw==3.6.0
Unidecode: pip install unidecode

Documentation

Explanation

This project is all about obtaining data from Reddit, processing it, and showing it to the user. It presents a Client-Server architecture, featuring a RestAPI server, which is backed by a SQLite database.

Project Structure

Application Architecture

Files and Folders

Using the application

First of all

I recommend creating a virtualenv for this project. After creating it, you should run:

~/RedditCrawler$ pip install -r requirements.txt

Now all pip packages needed have been installed.

Executing

Running the server

First of all, the user should start by running the server. This is done this way:

~/RedditCrawler$ python src/rest_api/rest_api.py

If this has worked correctly, this should be the output:`

http://localhost:8080/

Executing the client application

Now that the server is running, we can execute the application. This is done by typing this:

~/RedditCrawler$ python src/reddit_crawler.py

Testing

Some tests have been added to the 'tests' folder. To run them, simply type from the main project folder:

nosetests tests

Decisions taken

To do this project I followed some basic instructions, but the specific components and architecture had to be chosen by me. This wasn't the first time I implemented a RESTful API (see https://github.com/guillemnicolau/SongsPlatform), and neither that I used a SQL database.

Database

When choosing the database I decided to use SQLite3 after pondering between the mentioned one and MongoDB (those seemed to be the most mentioned on the web for small-scale projects). After seeing this post in StackOverflow about their comparison (https://stackoverflow.com/questions/9702643/mysql-vs-mongodb-1000-reads) I thought that in this case all inserts would be done once in a while, maybe once every 6 hours (Subreddit top pages don't change really often, it's not like tweeter, so with that periodicity would be enough to have an ~updated database). After thinking about this, and seeing that SQLite is faster than MongoDB for selects when the number of tables is not over +100, I decided to use SQLite3.

Reddit library

About the library to retrieve data from Reddit, I chose PRAW. This was quite an easy decision. There are not many alternatives, and after doing some research I knew it had all functionalities that I needed. Another possibility was to filter all data from the webpage myself, but after checking how it was structured, and knowing that I have close to no experience in web crawlers, for me it was clear that using PRAW was the best decision.

Application Architecture

Another important decision was the location of the database. At first the architecture was quite different from the final one. The database was located independently from both the server and the client, and it could be accessed both from the server and the processing module. Although this probably would lead to a better performance, it was contrary to the philosophy of a RESTful API. So in the end I decided to only be able to access it through the server, as that's the logical way to design it. The crawlers module was also located in the client module at the beginning, but it makes more sense to be located in the server, as this way it can connect directly to the Subreddit webpage.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
documentation		documentation
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
praw.ini		praw.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RedditCrawler

Requirements

Documentation

Explanation

Project Structure

Using the application

First of all

Executing

Testing

Decisions taken

About

Releases

Packages

Languages

guillemalomar/RedditCrawler

Folders and files

Latest commit

History

Repository files navigation

RedditCrawler

Requirements

Documentation

Explanation

Project Structure

Using the application

First of all

Executing

Testing

Decisions taken

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages