Commit cd414353 authored by Zofia Baranczuk's avatar Zofia Baranczuk
Browse files

Added Readme

parent 99540327
A tool to download and process the articles from the articles database (at the moment, soon and others).
SysReviewer downloads available .pdf from the search results.
The .pdfs are processed to .txt.
RAKE algorithm is used for .txt file to find most frequent phrases. The table of phrases and their freqencies for papers is returned.
All outputs are saved in a directory:
The .pdf files are saved in:
The .txt files are saved in:
The key phrases are saved as:
Getting started:
Download the SysReviewer repository from the github. Extract in your home folder. In the directory ./python run an example from the'Usage' section.
Python 2.7, nltk, csv, pdfminer, bs4, requests, urllib2
> python kwashiorkor
will download papers, which has kwashiorkor in any field. The output will be saved in:
The .pdf files are saved in:
The .txt files are saved in:
The key phrases are saved as:
> python "Malawi AND HIV AND DHS"
will download the papers which have Malawi, HIV and DHS in any of the fields.
The .pdf files are saved in:
The .txt files are saved in:
The key phrases are saved as:
At the moment, only search for any of field is available.
Soon there will be added year, author, keywords and other databases of papers.
-more databases of articles
-search options as powerful as in data base
-report on downloaded and parsed papers
-cleaner keypharases
-parameterisation of the keyphrases algorithm
Thanks for useful manuals to:
to add
......@@ -24,8 +24,6 @@ def process_page(my_url, output_dir):
soup_all = BeautifulSoup(r.content, "html.parser")
titles = soup_all.find_all("h4", {"class": "paper-list-title"})
for title in titles:
new_t = title.find("a").get("href")
......@@ -70,6 +68,4 @@ def parse(query, pdf_dir):
#query = sys.argv[1] # "legionella"
......@@ -8,17 +8,6 @@ import operator
import glob
import nltk
import itertools
from sklearn.cluster import DBSCAN
#from textblob import TextBlob as tb
from nltk.stem import WordNetLemmatizer
from import LancasterStemmer
#w = sys.argv[1]
lancaster_stemmer = LancasterStemmer()
#a = lancaster_stemmer.stem(w)
from nltk.corpus import wordnet as wn
wordnet_lemmatizer = WordNetLemmatizer()
def remove_non_ascii(text):
text = re.sub(r'[\357\254\200]+', 'ff', text)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment