19. April 2012 · 2 comments · Categories: Uncategorized · Tags: ,

Both R and Python have facilities where the coder can write a script which requests a user to input some information. In Python 2.6, the main function for this task is raw_input (in Python 3.0, it’s input()). In R, there are a series of functions that can be used to request an input from the user, including readline(), cat(), and scan(). However, I find the readline() function to be the optimal function for this task. In the following blocks of code, I show how to perform the same task using both R and Python.

## R CODE:

num = 0

while(num < 3 ){
  num = num + 1
  name <- readline("Hey dude, what's your name = ")
  if( name == "quit" ) { break }
  if( name == "Bieber" ) {
     cat( "Welcome ", name )
     break }
}

## PYTHON CODE:

num = 0

while num < 3:
    num = num + 1
    name = raw_input("Hey dude, what's your name = ")
    if name == "Bieber":
        print "Welcome ", name
        break
    if name == "quit":
        break
09. February 2012 · 2 comments · Categories: Uncategorized · Tags: ,

I recently began using the TwitteR package in R to examine my tweeting patterns. One of my first projects was to identify each of my Twitter followers, where they were located, how many tweets they had, and then plot their location on a map using a bubble which was related to their total number of tweets. Unfortunately, I found that I was unable to plot the data on to a spatial map because I did not have the coordinates for each of my followers. While I wasn’t able to successfully complete my project, I am posting my code for acquiring the data using the TwitteR package.

library(twitteR)
me <- getUser("username")

follow = me$getFollowers()
df <- do.call("rbind", lapply(follow, as.data.frame))
str(df)
head(df, 3)

user = as.character(df$screenName)
name = as.character(df$name)
location = as.character(df$location)
followers = as.character(df$followersCount)
created = as.character(df$created)

mydf = data.frame(user=c(user), name=c(name), location=c(location),
                 followers=c(followers), created=c(created))
mydf

I also attempted to look at the frequency of how often I favorite a tweet that contains an #rstats hashtag. I was able to identify the tweets with #rstats, when they were created, and whether I had marked that tweet as a favorite. After going through and marking various #rstats tweets as favorites, I ran the following code. Unfortunately, I found that each R related tweet was being returned as FALSE in regards to whether I had marked as a favorite. In any case, here is my R code for this task.

tweets <- searchTwitter('#rstats', n=300)
tweets

df <- do.call("rbind", lapply(tweets, as.data.frame))
str(df)

ndf <- data.frame(text=c(as.character(df$text)), created=c(df$created),
                    favorited=c(df$favorited))
head(ndf)

table(ndf$favorited)

This was my first real attempt at using the TwitteR package, and I hope to dive further into this package over the next couple weeks. I will work on some new projects and will have some code which successfully completes a particular task.

I’ve recently been scouring the internet for a public opinion data set pertaining to job satisfaction. I was particularly interested in examining how gender, age, and socio-economic status influence how satisfied an individual is with their current employment situation. For example, existing research suggests that women and private-sector employees tend to have higher levels of job satisfaction. While I did not find an satisfactory data set on job satisfaction, I did find some intriguing information on regional variation in job satisfaction in England. I will admit that I know very little about England and can not propose any plausible explanations for this variation. In any case, I’ve generated a Cleveland Dotplot using GGPlot2 and have included the image and my R code below. It would be intriguing to map this data onto a spatial map of England and I plan to work on that soon.

df = data.frame(Region=c("Southwest","West Midlands","Southeast",
        "Yorkshire","East Midlands","East England","London",
        "Northwest","Northeast"), Satisfaction=c(49,45,41,40,39,37,33,30,22),
        Num=c(1:9))

Region = df[,1]
Satisfaction = df[,2]
Num = df[,3]
ggplot(df, aes(x=Satisfaction, y=reorder(Region,Num))) +
      geom_point(colour="red", size=2.5) +
      opts(title="Job Satisfaction in England") +
      ylab("Region") + xlim(10,70) +
      opts(plot.title = theme_text(face = "bold", size=12)) +
      xlab("") + ylab("") +
      opts(axis.text.y = theme_text(family = "sans",
                  face = "bold", size = 8))+
      opts(axis.text.x = theme_text(family = "sans",
                  face = "bold", size = 8))

A/B testing is a method for comparing the effectiveness of several different variations of a web page. For example, an online clothing retailer that specializes in mens’ streetwear may want to examine whether a black or pink background results in more purchases from visitors to the site. Lets say that our online store is just a single web page, and we run this experiment by randomly showing one variation (pink background) of the page to half the visitors and the control background to the other half. After running the experiment for one week, we find that the pink background resulted in 40% purchase rate with 500 visitors while the black background resulted in a 30% purchase rate with 550 visitors. So which background is more effective at generating purchases from visitors to the online store. One way to examine this problem is by calculating confidence intervals of the conversion rates for each variation of the site. In the following R code, I construct a function which calculates the confidence intervals for the purchase rate of each site at a 80% significance level. In this example, the purchase rate for the pink background is significantly higher than the purchase rate for the black background.

site1 = c(.40, 500) # pink
site2 = c(.30, 550) # black

abtestfunc <- function(ad1, ad2){
      sterror1 = sqrt( ad1[1] * (1-ad1[1]) / ad1[2] )
      sterror2 = sqrt( ad2[1] * (1-ad2[1]) / ad2[2] )
      minmax1 = c((ad1[1] - 1.28*sterror1) * 100,
                            (ad1[1] + 1.28*sterror1) * 100)
      minmax2 = c((ad2[1] - 1.28*sterror2) * 100,
                            (ad2[1] + 1.28*sterror2) * 100)
      print( round(minmax1,2) )
      print( round(minmax2,2) )
}

abtestfunc(site1, site2)

In an earlier post, I talked about search engine marketing and showed how to programatically generate keywords in R. In this post, I will provide a brief example of how to generate ppc keywords in Python. The itertools library in Python has a method to find permutations for a set of strings. In this example, I have produced three lists which contain the words which will compose my keywords. I construct two user defined functions to generate the keywords. For producing a large number of keywords, this strategy would not be ideal because the user would have to repeat the functions numerous times. Instead, it may be wise to simply construct classes which contain these two functions as methods. I am far from being an expert Python user, so if you have suggestions for improving my code, please leave them in the comments section below.

import itertools

roots = ["car insurance", "auto insurance"]
prefix = ["cheap", "budget"]
suffix = ["quote", "quotes"]

def func(one, two, three):
    for x, y, z in itertools.product(one, two, three):
        print x, y, z

def func2(one, two):
    for x, y in itertools.product(one, two):
        print x, y

func(prefix, roots, suffix)
func2(roots, suffix)
10. November 2011 · 11 comments · Categories: Uncategorized · Tags: , ,

The subset function is available in base R and can be used to return subsets of a vector, martix, or data frame which meet a particular condition. In my three years of using R, I have repeatedly used the subset() function and believe that it is the most useful tool for selecting elements of a data structure. I assume that many of you are familiar with this function, so I will simply conclude this post by providing some brief examples of the subset function.

numvec = c(2,5,8,9,0,6,7,8,4,5,7,11)
charvec = c("David","James","Sara","Tim","Pierre",
        "Janice","Sara","Priya","Keith","Mark",
        "Apple","Sara")
gender = c("M","M","F","M","M","M","F","F","F","M","M","F")
state = c("CO","KS","CA","IA","MO","FL","CA","CO","FL","CA","WY","AZ")

subset(numvec, numvec > 7)
subset(numvec, numvec < 9 & numvec > 4)
subset(numvec, numvec < 3 |numvec > 9)

df = data.frame(var1=c(numvec), var2=c(charvec),
          gender=c(gender), state=c(state))

subset(df, var1 < 5)
subset(df, var2 == "Sara")
subset(df, var1==5, select=c(var2, state))
subset(df, var2 != "Sara" & gender == "F" & var1 > 5)
04. November 2011 · Write a comment · Categories: Uncategorized · Tags: ,

In a previous post, I discussed how to generate PPC keywords in R. In this post I will provide another example of how to perform this task. Let’s say that I am a auto insurance company that only operates in the state of Illinois. I’m planing on bidding on keywords in Bing and Google which have the name of each city in Illinois followed by ‘auto insurance.’ Using just the sprintf function and a data structure with the Illinois cities, I can generate the desired keywords using the following code. Some of you are probably wondering why an advertiser would bid on keywords that contain each city in Illinois. Well, there are very few instances in which it would make sense to actually bid on keywords that contain city names or zip codes. Just because you can, doesn’t mean you should.

DF <- read.csv("http://www.census.gov/tiger/tms/gazetteer/zips.txt",
               header = FALSE)
str(DF)
head(DF)
DF[ ,c(1,5,6,7,8)] <- NA
head(DF)
df <- DF[,colSums(is.na(DF))<nrow(DF)]  #removes all NA values
names(df) <- c("zip", "state", "city")
head(df)
il = subset(df, state=="IL")
ilcity = unique(tolower(il$city))
ilcity = as.character(ilcity)
ilcity = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", ilcity, perl=TRUE)
ilcity
dt = c("%s auto insurance")
dat = sprintf(dt, ilcity)
02. November 2011 · 3 comments · Categories: Uncategorized · Tags: ,

Paid search marketing refers to the process of driving traffic to a website by purchasing ads on search engines. Advertisers bid on certain keywords that users might search for, and that determines when and where their ads appear. For example, an individual who owns an auto dealership would want to bid on keywords relating to automobiles that a reasonable people would search for on a search engine. In both Google and Bing, advertisers are able to specify which keywords they would like to bid for and at what amount. If the user decides to bid on just a small number of keywords, they can type that information and specify a bid. However, what if you want to bid on a significant number of keywords. Instead of typing each and every keyword into the Google or Bing dashboard, you could programmatically generate the keywords in R.

Let’s say that I run an online retail establishment that sells mens and womens streetwear and I want to drive more traffic to my online store by placing ads on both Google and Bing. I want to bid on about a number of keywords related to fashion and have created a number of ‘root’ words that will comprise the majority of these keywords. To generate my desired keywords, I have a written a function which will take every single permutation of the root words.

root1 = c("fashion", "streetwear")
root2 = c("karmaloop", "crooks and castles", "swag")
root3 = c("urban clothing", "fitted hats", "snapbacks")
root4 = c("best", "authentic", "low cost") 

myfunc <- function(){
      lst <- list(root1=c(root1), root2=c(root2), root3=c(root3),
            root4=c(root4))
      myone <- function(x, y){
            m1 <- do.call(paste, expand.grid(lst[[x]], lst[[y]]))
            mydf <- data.frame(keyword=c(m1))
      }
      mydf <- rbind(myone("root4","root1"), myone("root2","root1"))
      }

mydat <- myfunc()
mydat

write.table(mydat, "adppc.txt", quote=FALSE, row.names=FALSE)

This isn’t the prettiest code in the world, but it gets the job done. In fact, the same results could have achieved using the following code, which is much more efficient.

root5 = c("%s fashion")
root6 = c("%s streetwear")
adcam1 = sprintf(root5, root2)
adcam2 = sprintf(root6, root2)
df = data.frame(keywords=c(adcam1, adcam2))

write.table(df, "adppc.txt", quote=FALSE, row.names=FALSE)

If you have any suggestions for improving my R code, please mention it in the comment section below.

27. October 2011 · 1 comment · Categories: Uncategorized · Tags: , ,

This is the first in a series of blog posts in which I use the R package GGPlot2 to examine real world data. In this post, I construct a line graph of U.S. shoe consumption from 1995 to 2007.

A recent survey conducted by Shop Smart magazine found that the average woman in the US owns approximately 17 pairs of shoes. Furthermore, a survey conducted in 2007 found that the average male in the US owns 3 shoes, with younger men owning more shoes that their older counterparts. It’s no wonder that the US shoe market is $13 billion-per-year industry. In the following graph, you see the increase in US Shoe consumption from 1995 to 2007. The one factor that has really contributed to the increase in shoe sales has been the growth in people using athletic shoes for casual wear. Prior to the 1980′s, athletic shoes were not regularly worn as peoples’ primary shoes for everyday usage. In many ways, the emergence of hip hop music during the 1980′s had a significant impact in making athletic shoes and sneakers a more prominent part of how people dressed. It would be very interesting to find data on sneaker sales in the U.S. from the mid-1970′s to date, and it would likely show a huge increase in the consumption of those types of shoes. Unfortunately, I only have data on shoe consumption for this limited time period. In any case, the R code that I used to construct this plot can be found below.


ggplot(dat, aes(year))+
      geom_line(aes(y = consumption), size = 1) +
      opts(title = "Shoe Consumption in the US (1995-2007)", size=12) +
      opts(plot.title = theme_text(size = 12, face = "bold")) +
      ylim(c(0,3)) + xlab("") + ylab("Number of Shoes (in billions)") +
      opts(axis.text.y = theme_text(family = "sans",
                  face = "bold", size = 10))+
      opts(axis.text.x = theme_text(family = "sans",
                  face = "bold", size = 10))+
      opts(plot.margin = unit(c(0.3, 0.3, 0.3, 0.1), "lines"))

For the sneakerheads out there, do you prefer Nike SB's or Air Jordan's - Vote!
17. October 2011 · 1 comment · Categories: Uncategorized · Tags: ,

The sqldf package can be used to run sql queries on R data frames. The user simply needs to specify a sql statement enclosed by quotation marks within the sqldf() function. In the follow R code, you see various ways of using the sqldf package to run sql queries on R data frames. The sql command COUNT() is used to find the total number of rows that meet a certain condition. Furthermore, GROUP BY() can be used to structure the data in accordance with the levels of a particular variable.

dte = seq(as.Date("2011-05-01"), as.Date("2011-05-20"), by=1)
persid = c(1013,1011,1014,1015,1023,1028,1012,1018,1019,1020,1027,
        1016,1022,1017,1021,1024,1030,1025,1026,1029)
v1 = round(rnorm(20), 2)
v2 = round(rnorm(20), 2)
first=c("David","Sara","Jon","Jennifer","Ken","Ralph","Chris","David",
      "David","Joe","Melanie","Debbie","Jessica","Ally","Amy","Ralph",
      "Sara","Jane","John","Lance")
last=c("Smith","Jones","Alberts","Hudson","Jennings","Masterson","Browm",
      "Felt","Spade","Montana","Keith","Hardson","Karson","Roberts","Smith",
      "Jennings","Denver","Hudson","Reynolds","Darder")
stat = c("CA","IA","NC","FL","GA","OH","NY","CA","TX","TX","CA","CA","AZ",
      "CO","OK","MI","WI","SC","VT","IL")

df1 <- data.frame(id=c(seq(1,20)), date=c(dte), var1=c(v1),
            var2=c(v2), personid=c(persid))

df2 <- data.frame(id=c(sort(persid)), firstname=c(first),
            lastname=c(last), state=c(stat))
library(sqldf)

sqldf("SELECT COUNT(*) FROM df2 WHERE state = 'CA'")

sqldf("SELECT df2.firstname, df2.lastname, df1.var1, df2.state FROM df1
INNER JOIN df2 ON df1.personid = df2.id WHERE df2.state = 'TX'")
sqldf("SELECT df2.state, COUNT(df1.var1) FROM df1
INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
GROUP BY df2.state")
sqldf("SELECT df2.firstname, df2.lastname, df1.var1, df2.state FROM df1
INNER JOIN df2 ON df1.personid = df2.id
WHERE df1.date BETWEEN '2011-05-03' AND '2011-05-11'")