Wednesday, October 30, 2013

The Challenges of Using Big Data

By now your probably aware of the huge amount of data generated and collected by individuals, businesses, and the government on a daily basis, and you've probably also become familiar with some of the ways businesses comb through this data to extract information useful to their endeavors. Huge data warehouses have sprung up, and programmers have created platforms like Hadoop to market to businesses looking to leverage big data to their advantage. In fact, data mining has become one of the main tools a business uses to gather information about customers, in some cases replacing traditional means of customer interaction like surveys, e-mail chains and focus groups. However, businesses often run into challenges when attempting to utilize big data, and its important for them to understand the limitations of blindly relying on a platform or service to make decisions.
A saying commonly heard in the business world (and beyond) goes something like this, in one form or another: “you can't see the forest for the trees.” The idea of this metaphor is that it becomes difficult to see an overall trend or solution when faced with a seemingly insurmountable litany of information. However, with the advances of data mining and the use of big data this metaphor would perhaps be more accurately applied in its opposite form. As the “forest” is exhaustively analyzed, large-scale data on trends are created and used, sometimes at the expense of individual data points (or at least much smaller trends) that go largely unnoticed. Many important observations and ideas may come from only one source, and if a strict adherence to big data analytics is used these could potentially be overlooked. Not every outlier is unimportant, particularly when making a decision on a smaller or more localized scale.
Another concept many are likely familiar with is the idea of correlation versus causation. With the ability of advanced data mining techniques to identify trends and even predict outcomes, there is a tendency to simply act based on these trends without looking for the deeper reasons for why these trends are occurring. An interesting example of this phenomenon occurred when Google attempted to research flu outbreaks by analyzing keyword searches. Compiling data on users who searched for “nearest drugstore” or “flu symptoms” provided Google with a seemingly pertinent metric for determining where and when outbreaks are occurring. However, does the fact that someone searched for the nearest pharmacy necessarily mean that they have the flu? Without going into too many details, Google found that a useful pattern emerged only after combining various search terms and comparing them with actual flu surveillance data from the Centers for Disease Control.
Briefly mentioned earlier, Hadoop is a popular platform for analyzing big data that many companies use. As this is being written, there is a conference being held in New York City called the “Strata Conference + Hadoop World” that bills itself as the place, “where big data's most influential decision makers, architects, developers, and analysts gather to shape the future of their businesses and technologies” (Strata 1). Many companies are represented at this conference, including Facebook, whose analytics chief Ken Rudin shared some interesting observations about how his company uses big data.
"The problem is that Hadoop is a technology, and big data isn't about technology. Big data is about business needs,” said Rudin (Kanaracus 1). A lot of companies spend a great deal of money to implement an expensive platform designed for the data mining of big data, knowing that it is a technological trend and has the potential to generate profits, without fully understanding it or how it can meet their specific needs. Rudin argues that, at least in Facebook's case, Hadoop is not a total solution to their extensive data mining needs (almost all of their revenue comes from targeted advertisements). A sophisticated relational database serves their needs more effectively, particularly when “drilling down” to a more detailed level. Knowledge of data analytics, and particularly the practical application of big data, also serves an important role in Facebook's hiring process. Instead of merely focusing on if a candidate knows “how … we calculate this metric,” Rudin suggests that candidates be given a business case study and then be asked what metrics would be best applied to the situation.
What types of businesses is big data really an effective tool for? Do the potential drawbacks of using big data and data mining outweigh the perceived benefits? The answer to these questions certainly could vary from situation to situation, and indeed it appears the use of big data must be considered extensively and tailored to a specific business's needs.

Citations
About Strata + Hadoop World.” Strata Conference + Hadoop World. (2013). O'Reilly Media Inc. Web. Date Accessed: 2013/10/30. http://strataconf.com/stratany2013/public/content/about

Flu Trends.” Google.org. (2011). Web. Date Accessed: 2013/10/30. http://www.google.org/flutrends/about/how.html

Kanaracus, Chris. “Hadoop is not Enough for 'Big Data', says Facebook Analytics Chief.” IDG News Service. (October 29, 2013). ITWorld. Web. Date Accessed: 2013/10/30. http://www.itworld.com/software/380556/hadoop-not-enough-big-data-says-facebook-analytics-chief?page=0,0

Polsky, Matt and Sommer, Claire. “Dodging Big Data's Big Problems.” GreenBiz.com. (September 16, 2013). Web. Date Accessed: 2013/10/30. http://www.greenbiz.com/blog/2013/09/16/big-data-big-problems

Tuesday, October 22, 2013

Data Mining and Big Data

Marketing is a key business function, directly affected by Data Mining. In the past, companies would have to rely on outdated demographic statistics and consumer surveys to make important decisions on marketing strategies. This posed a problem, since the world is constantly changing, these strategies could become outdated/unsuccessful before managers even had a chance to implement. "Data-driven" is a term being used to describe today's economy, not only from transaction records but through new sources, such as social media, mobile devices, emails, and more (Johnson 2012). It is almost impossible for a person to make a decision without leaving a digital footprint, whether a Tweet, a receipt, or an email. It is because of this massive amount of data inflow that businesses are turning to Data Mining when developing their marketing strategy, since it is a new method of defining and measuring consumer demand. However, when using Data Mining, it is important to realize this method is only successful if the data input is accurate and relevant.

How do companies ensure that they are using relevant and accurate data? When it comes to relevant data, it depends on the situation or the problem addressed. It is important for companies to take the time to determine which variables are relevant, before imputing data into a Data Mining algorithm. As for accuracy, a rule of thumb is ensuring that the data is considered "Big Data". Big Data refers to more than just large volume, in fact there are three specific characteristics required for data to be distinguished as Big Data: Volume, Velocity, and Variety (Collet 2011).

Volume refers to the size of the data set. But just how "big" is Big Data? Generally, the volume must be between one terabyte and one petabyte. To put this in perspective, consider that every hour Wal-Mart stores have over one million customer transactions, which is estimated at over 2.5 petabytes (Johnson 2012).

Variety refers to the different forms of data and their sources. Big Data consists of internal and external data, as well as structured, semi-structured, and instructed. It also has several different formats of data, as well as many sources (IBM 2012).

Velocity refers to the increasing speed at which new data is being created or how current the data is. It is the time between when data is created and when it can be analyzed. The higher the velocity, the closer the data is to real-time, which allows businesses to make better informed decisions.

By using data that is current, large in amount, and gathered from a variety of sources, a business is able to use Data Mining more successfully. Data Mining allows companies to find common patterns, which they otherwise would not have realized, and make decisions based on the conclusions drawn (TRA 2000). The first step to using Data Mining and Big Data successfully is clarifying a goal the company wishes to achieve. What is your opinion on Big Data and Data Mining? What other key business functions could companies use Data Mining to make decisions? Do you see any issues or concerns surrounding this topic?


Sources:

Collett, S. (2011). Why Big Data is a big deal. Computerworld, 45(20), 18.

IBM. (2012). What is Big Data? Bringing Big Data to the Enterprise. Retrieved April 2, 2013 from http://www-01.ibm.com/software/data/bigdata/

Johnson, J. E. (2012). Big Data + Big Analtyics = Big Opportunity. Financial Executive, 28(6), 50-53.

TRA, I. c. (September 2000). TRA, Inc. Awarded New Patent for Improvements in Using Big Data for Television Advertising Targeting. Business Wire (English).

Wednesday, October 16, 2013

Data Mining and Twitter


On October 6, 2013 the Wall Street Journal published a half page article titled “Data mining thanks to Tweeter”. What really got my attention was how profitable and risky data mining can be.

As the company prepares for its IPO, financial statements are being released, showing investors the potential of making billions of dollars in data mining; and although data mining is a small percentage of the company’s business it has already generated over 47 Million dollars year to date by selling user data or twits to other companies that analyze the social media for trends and events in real-time.

The ways businesses are using Twitter data are limitless, supply chain uses twitter users data for inventory and brand of products, politicians use it for their campaigns and social trends, stock price trends are being built and analyzed based on real-time events communicated via twitter, etc. According to the Wall Street Journal the “social listening” business is very profitable, partially funded by millions of dollars promising high returns to IPO investors.

Twitter stands on top of other giant social media companies such as Google, Facebook, and Yahoo. Research firms estimate that the entire “big data” market has grown seven times as quickly as the information technology sector as a whole, and may be valued at $17 billion in two years.

Once Twitter goes public, there will be a higher incentive to sell users data; which makes me wonder if users understand the implications and meaning of every word they post. On the other hand corporations are highly exposed to the inevitable risk of security and malicious use of the big data.

What is your opinion on this issue? Should social media companies be allowed to go public? There is a danger of conflict of interest, maxing shareholders profits verse customer privacy protection. What are some possible ways to resolve this conflict?




Source:

Dwoskin, Elizabeth. "Data Mining Thanks to Tweeter" The Wall Street Journal 6 October 2013: C1. Print.

Wednesday, October 9, 2013

Netflix: Data Mining as a Competitive Advantage

Netflix's original series, “House of Cards”, is a real-world example of how a company uses Data Mining as a comparative advantage. Today, “House of Cards” is the most streamed piece of content and Netflix gives credit to Big Data and Data Mining, which they say allowed them to know the show would be successful before they even started filming it (Carr 2013).

The data Netflix analyzed to create “House of Cards” is distinguished as Big Data because of its volume, variety, and velocity. Netflix has over 33 million members streaming more than one billion hours of video a month. Netflix obtains large amounts of data from its consumers, even from things you would not expect. Netflix keeps record of not only what videos a customer watches, but also where they are watching, when they are watching, what they are searching, what device they are watching on, if they pause/rewind/fast-forward at certain points, what TV shows they complete, and what ratings they give (Leonard 2013).

This means that Netflix has access to a rapid direct inflow of information on what customers are watching, searching, and preferring at any given time. With Big Data and Data Mining research, they found the average rating of all videos, defined the television shows where viewers were most likely to complete the series, and used data mining to find common feedback/requests on social media sites, emails, and search inquiries. Netflix measured patterns with Data Mining tools to find the most common actors in successful videos, most common popular themes, and in general what the most popular videos had in common. In doing so, Netflix was successfully able to measure consumer demand for a new television series.

Netflix defined three specific factors of a television show that would attract the highest consumer demand. These determining factors were Kevin Spacey acting, David Fincher directing, and political drama genre, which is what they used to create “House of Cards” and why they claim it is such a success (Leonard 2013). Given the success of the show, many companies are beginning to believe that Data Mining can give a comparative advantage by predicting consumer demand correctly. While there concerns with privacy concerning Data Mining, there are plenty of gains to be made with Data Mining. Given this example of the benefits of Data Mining, what is your stance on using Data Mining?


SOURCES
Baron, P., Eskin, G., & Lloyd, C. (1975). Some Conceptual Issues in the Testing of the Economic Theory of Consumer Demand. Advances In Consumer Research, 2(1), 679.

Carr, David. (13 February 2013) Giving Viewers What They Want. The New York Times.

Leonard, Andrew. (2013). How Netflix is turning viewers into puppets. Salon Media Group. Retreived April 13, 2013, from http://www.salon.com/2013/02/01/how_netflix_is_turning_viewers_into_puppets/

Wednesday, October 2, 2013

Data Mining and The Death of International Networking


With the National Security Agency (NSA) leaks in the news, data mining and privacy have become forced to the forefront of many discussions lately.  According to a global survey, 80% were concerned about their internet privacy.  But, do people really want privacy? Consider how quick we are to share almost all our personal information that is not only stored on our smart phones, but all of our texts and phone records, as well as social media postings, for the sake of convenience.

The indifference some people display for marketing purposes may seem harmless enough.  Who doesn’t want a 50% off coupon on an item they may purchase regularly?  Trading one’s personal data for convenience or discounts may seem like a healthy two-way relationship between the producer and consumer.

But what about the government peeking into your shopping habits, personal contacts, or messages?  While law abiding citizens may not be too concerned, if at all, about government data mining since they claim that they are not doing anything illegal, they may want to reconsider the consequences of being snared in an investigation.  According to an annual watchdog report in the U.K., 979 people were erroneously caught up in data mining operations in 2012.  While most errors were realized before any action was taken, six were not.  Five of the six were erroneously arrested and charged with crimes.

Even after the falsely accused are cleared of any wrong doing, the damage to one’s reputation could be damning. Richard Jewel was falsely accused of planting the bomb during the 1996 Olympics in Atlanta because he fit the FBI profile of a lone bomber.  He then had to spend years trying to rebuild his reputation even though he was quickly cleared of any wrong doing.  Though he was actually credited for saving lives as the security guard who noticed and called in the suspicious package containing the pipe bomb, his name was splashed across the media throughout the world as a suspect and the butt of many late-night comedic jokes.

While Richard Jewel is an example of an extreme case of being falsely accused, many companies require a background check for employment.  It isn’t too difficult to imagine how far the application would go if the results came back that you were arrested on national security issues.  A notation of “charge dismissed” on the report may not mean much.  In addition, many professional firms, such as the Big 4 firm where I’m currently employed, require an annual affidavit that, among other things, attests that I have not been part of any criminal investigation.  A positive answer could lead to reduced assignments or even termination.  So should I, as a law abiding citizen, be concerned?

Among the revelations of the recent NSA leaks regarding phone records and email logs are that the NSA use social media to create social map connections to enhance their analysis capability.  The results could tell an extraordinary amount about you…or possibly make connections that may not be there.  While the intent is to discover and track connections from a foreign suspect to American citizens, this could be problematic at most, and invasive at least, for those with a large international social network such as myself.

I still keep in touch through social media with many friends that where made while I was in the Navy, many who are still active duty and now stationed throughout the world.  In addition, my current employer is a multinational firm that encourages international rotations and I now have social media contacts from colleagues who have returned to Russia, Ukraine, Romania, Singapore and Nigeria.  Also, I recently returned from a summer semester abroad in Europe where I lived next to and took classes with students from all over the world with different political views. 

Many are now included in my social media such as a fellow student from Turkey who one night during a student party, while casually scrolling through his iPhone, wearing a Polo shirt, Levi jeans and Nike tennis shoes, bluntly told me he was anti-American.  After composing myself and pointing out the fallacy of his statement, we enjoyed many nights in spirited debate on economic and political ideology and are friends on Facebook, but now I have to be concerned of some bizarre six degrees of Kevin Bacon game if he or any of my social media friends, or possibly their friends, become a national security interest.

The months following 9/11, the U.S. government was accused of failing to connect the dots leading up to the attack which resulted in Congress passing the USA Patriot Act that provides provisions authorizing many of the data mining techniques used by the NSA today.  According to President Obama, "You can't have 100% security and also then have 100% privacy and zero inconvenience.  We're going to have to make some choices as a society." (Cohen, 2013)

I guess one of those choices is to choose your friends wisely…networking be damned.

 

Sources:

Cohen, Tom, “Data mining revelation opens political Pandora's box”, CNN.com (June 9, 2013), http://www.cnn.com/2013/06/07/politics/data-mining-after-9-11/index.html, Accessed September 21, 2013

Moglen E., “The Tangled Web We Have Woven”, Communications Of The ACM February 2013;56(2):20-22. Available from: Business Source Complete, Ipswich, MA. Accessed September 20, 2013

Singel, Ryan, “Newly Declassified Files Detail Massive FBI Data-Mining Project”, Wired.com (September 2009).  http://www.wired.com/threatlevel/2009/09/fbi-nsac. Accessed September 21, 2013 

Spiekermann, Sarah, “The Challenges of Privacy by Design”. Communications of The ACM”. July 2012;55(7):38-40

Travis, Alen, “Six people falsely accused of crimes after errors in internet data disclosure”, TheGuardian.com (July 18, 2013), http://www.theguardian.com/technology/2013/jul/18/people-falsely-accused-internet-data-disclosure, Accessed September 30, 2013