Bash snippets to download data from PhoneArena and extract features.
curl -s "http://www.phonearena.com/phones/page/[1-208]" -o "page_#1.html"
sed -n 's|.*href="/phones/\([^"]\+\)".*|\1|p' page_*.html
mkdir phones
for phone in `cat phones.txt`; do
if ! [ -f phones/${phone}.html ]; then
curl -s "http://www.phonearena.com/phones/$phone" -o phones/${phone}.html;
fi;
done
Extract phones pros-cons lists to a colon-separated file, first value is the phone, the rest are sentences
for phone in `cat phones.txt`; do
echo -n $phone"|";
echo `sed -n '/<div class="proscons">/,/<\/div>/p' phones/${phone}.html | \
sed -n 's/.*<li>\([^<]*\)<.*/\1/p' | paste -sd "|" -`;
done > proscons.csv
There are only about 37 of them
cat pros_cons.csv | cut -d'|' -f2- | tr "|" "\n" | sed -e 's/^[ \t]*//;s/[ \t]*$//' | sort | uniq > tagged.csv
for phone in `cat phones.txt`; do
echo -n $phone"|";
echo `sed -n 's| *<meta name="description" content="\([^"]*\)"/>|\1|p' phones/$phone.html | \
sed 's/[&#][^;]*;//g'`;
done > desc.csv
For future reference. I did not have the time to analyze them during this project
mkdir reviews
for phone in `cat phones.txt`; do
if ! [ -f reviews/${phone}.html ]; then
curl -s "http://www.phonearena.com/phones/$phone/reviews" -o reviews/${phone}.html;
fi;
done