Hello.
My name is CText. I am a C++ class. I can read a text
file – usually an HTML or XML file. I can extract a lot of useful information
from it. I can find a piece of text enclosed by given strings. I can look for
it in a particular line or in the entire file. I can parse tables. I can parse
lists. I have worked with various data sources, including Amazon, Christies,
EPO, Factiva, and many more. I significantly speed up coding and
increase code reliability. Although I am probably not perfectly optimized, I am
doing my job well. I have to admit that I have been tested only with Visual C++
on Microsoft Windows. And I require an additional header file to
function. But I still hope I can be useful. The following program
illustrates my capabilities. The program
- downloads the GDP data from the World Bank website,
- converts the GDP data and saves it to a CSV file,
- finds the first line of the table in the file and displays this line, and
- retrieves the list of the countries and displays it.
Download the source code or copy and paste it:
#include <iostream>
#include <andatathresher.h>
using namespace std;
int main()
{
cout << "Hello World!" << endl;
download("http://data.worldbank.org/indicator/NY.GDP.MKTP.CD",
"file.html");
CText txt("file.html");
CCSV csv;
csv.data = txt.parsetable(0);
csv.savetofile("gdp.csv");
int line = txt.findline("<table");
cout << "The line containing first \"table\" has the number "
<< line << " and its contents is: " << endl << txt.line(line)
<< endl;
vector<string> ctrs = txt.selectiveharvest("Country name",
"</table>", "<tr", "</tr>", "<a href", ">", "<");
cout << "Here is the list of countries covered:" << endl;
for (int i = 0; i < ctrs.size() - 1; i++)
cout << ctrs[i] << "; ";
if (ctrs.size()>0) cout << ctrs[ctrs.size() - 1] << endl;
system("pause");
return 0;
}
Class members
string content
A field with the content of the file.
vector<int> separators
A field that contains positions of new line characters in
the file.
CText::CText(const string &fname)
A class constructor. Loads a file and performs its preliminary
analysis. A constructor without parameters can be called as well and file can
be loaded manually but one has to be careful about functions using lines.
bool CText::load(const string &filename)
This function loads a file with a given file name, stores
its content in the field content, and performs its primary analysis. Returns
true upon success and false otherwise.
void CText::update()
A method performing preliminary analysis of the file. It
updates the separators vector.
int CText::findline(const string &text, int pos = 0)
This function returns the number of the first line that contains
given string. It starts searching in the file from the offset pos. Returns -1
if no line is found.
int CText::findinline(string &res, int line, const
string &before, const string &after)
This function extracts a string from a line with the line
number given by line. The string that is directly after before and before after
is returned through reference res. The function returns offset of the found
string or -1 if it cannot be found.
int CText::findafter(string &res, const string
&prefix, const string &before, const string &after, int pos = 0)
This function extracts a string from a file. It starts
searching from the offset pos. It looks for the first subsequent occurrence of
prefix and then for the first subsequent occurrence of before. The string that
is directly after before and before after is returned through reference res. The
function returns offset of the found string or -1 if it cannot be found.
int CText::findbetween(string &res, const string
&before, const string &after, int pos = 0)
This function acts as findafter but with an empty string in
prefix.
int CText::geturl(string &res, const string
&pattern)
This function looks for a line that contains string pattern.
Then it searcher for the first href HTML attribute it can find and returns its
content trough reference res. The function returns offset of the found string or
-1 if it cannot be found.
string CText::line(int index)
This function returns line from the file with the line
number given by index. Lines are numbered from 0.
vector<string> CText::harvest(const string
&ldelimit, const string &udelimit, const string &prefix, const
string &before, const string &after)
This function acts as selectiveharves but with empty strings
in starter and stopper. That is, it operates on the entire file.
vector<string> CText::selectiveharvest(const
string& starter, const string& stopper, const string &ldelimit,
const string &udelimit, const string &prefix, const string &before,
const string &after)
This functions locates all strings that satisfy some
criteria and returns them as a vector. It operates only on the fragment of the
file that is after first occurrence of starter and before first subsequent occurrence
of stopper. Within this range function looks for pieces of text enclosed by
ldelimit and udelimit. Each such enclosure is supposed to produce one element
of the resulting vector. Within each enclosure, function performs procedure
similar to findafter and adds the result as an element to the resulting vector.
int CText::occurencies(const string& text)
This function counts how many times given string can be
found in the file.
vector<TDataRec> CText::parsetable(int pos)
This function parses an HTML table and returns it as a
vector of vectors of strings. It looks for the first table after offset pos. It
will not work for tables inside tables. For the definition of TDataRec type see
file anutil.h.
Class code
Download the source code or copy and paste it:
#pragma once
#include <string>
#include <vector>
#include "anutil.h"
#include <algorithm>
using namespace std;
/* INTERFACE */
class CText {
public:
string content;
vector<int> separators;
CText() {};
CText(const string &fname);
void update();
bool load(const string &filename);
int findline(const string &text, int pos);
int findinline(string &res, int line,
const string &before, const string &after);
int findafter(string &res, const string &prefix,
const string &before, const string &after, int pos);
int findbetween(string &res, const string &before,
const string &after, int pos);
int geturl(string &res, const string &pattern);
vector<string> harvest(const string &ldelimit,
const string &udelimit, const string &prefix,
const string &before, const string &after);
vector<string> selectiveharvest(const string &starter,
const string &stopper, const string &ldelimit,
const string &udelimit, const string &prefix,
const string &before, const string &after);
string line(int index);
int occurencies(const string &text);
vector<TDataRec> parsetable(int pos);
};
/* IMPLEMENTATION */
CText::CText(const string &fname)
{
load(fname);
}
void CText::update()
{
separators.clear();
for (int i=0;i<content.length();i++)
if (content[i]=='\n') separators.push_back(i);
separators.push_back(content.length());
}
bool CText::load(const string &filename)
{
ifstream infile;
infile.open(filename,ios::binary);
if (!infile.is_open())
{
content = "";
return false;
}
stringstream cont;
string line;
while (!infile.eof())
{
getline(infile,line);
cont << line << endl;
}
content = cont.str();
infile.close();
update();
return true;
}
int CText::findline(const string &text, int pos = 0)
{
if (separators.size()==0) return -1;
int k = content.find(text,pos);
if (k<0) return -1;
int i;
for (i=separators.size()-1; i>=0; i--)
if (separators[i]<k) break;
return i+1;
}
int CText::findinline(string &res, int line,
const string &before, const string &after)
{
int start = content.find(before,separators[line]);
int stop = content.find(after,start+before.length());
if ((start<0) || (stop<0)) return -1;
if (line+1<separators.size()) if ((start>separators[line+1])
|| (stop>separators[line+1])) return -1;
int pos = start + before.length();
res = content.substr(pos,stop-pos);
return pos;
}
int CText::findafter(string &res, const string &prefix,
const string &before, const string &after, int pos = 0)
{
int k1 = content.find(prefix,pos);
if (k1<0) return -1;
int k2 = content.find(before,k1+prefix.length());
if (k2<0) return -1;
int k3 = content.find(after,k2+before.length());
if (k3<0) return -1;
pos = k2 + before.length();
res = content.substr(pos,k3-pos);
return pos;
}
int CText::findbetween(string &res, const string &before,
const string &after, int pos = 0)
{
return findafter(res,"",before,after,pos);
}
int CText::geturl(string &res, const string &pattern)
{
int line = findline(pattern);
if (line<0) return -1;
int pos = findinline(res,line,"href=\"","\"");
return pos;
}
string CText::line(int index)
{
int start;
if (index==0) start = 0; else start = separators[index-1]+1;
int stop = separators[index];
return content.substr(start,stop-start);
}
vector<string> CText::harvest(const string &ldelimit, const string &udelimit,
const string &prefix, const string &before, const string &after)
{
return selectiveharvest("","",ldelimit,udelimit,prefix,before,after);
}
vector<string> CText::selectiveharvest(const string& starter,
const string& stopper, const string &ldelimit, const string &udelimit,
const string &prefix, const string &before, const string &after)
{
vector<string> res;
int beginning;
if (starter=="") beginning = 0; else beginning = content.find(starter);
if (beginning<0) return res;
int finish;
if (stopper=="") finish = content.length();
else finish = content.find(stopper,beginning);
if (finish<0) finish = content.length();
int start = content.find(ldelimit,beginning);
if (start<0) return res;
int stop = content.find(udelimit,start);
while ((stop>=0) && (start<finish))
{
string item;
int q = findafter(item,prefix,before,after,start);
if (q>stop) item="";
res.push_back(item);
start = content.find(ldelimit,stop);
if (start<0) return res;
stop = content.find(udelimit,start);
}
return res;
}
int CText::occurencies(const string& text)
{
if (text=="") return 0;
int count = 0;
int pos = 0;
int k = content.find(text);
while (k>=0)
{
count++;
pos = k+1;
k = content.find(text,pos);
}
return count;
}
vector<TDataRec> CText::parsetable(int pos)
{
vector<TDataRec> res;
string temp = content;
transform(temp.begin(),temp.end(),temp.begin(),toupper);
int p1 = temp.find("<TABLE",pos+1);
if (p1<0) return res;
int p2 = temp.find("</TABLE",p1+1);
if (p2<0) return res;
int p3 = temp.find("<TR",p1+1);
while ((p3>=0) && (p3<p2))
{
int p4 = temp.find("</TR",p3+1);
TDataRec row;
int p5 = temp.find("<T",p3+1);
while ((p5>=0) && (p5<p4))
{
int p6 = temp.find(">",p5+1)+1;
int p7 = temp.find("</",p6+1);
string cell = htmltostring(content.substr(p6,p7-p6));
row.push_back(cell);
p5 = temp.find("<T",p7+1);
}
res.push_back(row);
p3 = temp.find("<TR",p4+1);
}
return res;
}
Let me know your experience!
No comments:
Post a Comment