Normal Approximation in R-code

Normal approximation using R-code

Abstract

The purpose of this research is to determine when it is more desirable to approximate a discrete distribution with a normal distribution. Particularly, it is more convenient to replace the binomial distribution with the normal when certain conditions are met. Remember, though, that the binomial distribution is discrete, while the normal distribution is continuous. The aim of this study is also to have an overview on how normal distribution can also be concerned and applicable in the approximation of Poisson distribution. The common reason for these phenomenon depends on the notion of a sampling distribution. I also provide an overview on how Binomial probabilities can be easily calculated by using a very straightforward formula to find the binomial coefficient. Unfortunately, due to the factorials in the formula, it can easily lead into computational difficulties with the binomial formula. The solution is that normal approximation allows us to bypass any of these problems.

Introduction

The shape of the binomial distribution changes considerably according to its parameters, n and p. If the parameter p, the probability of “success” (or a defective item or a failure) in a single experimental, is sufficiently small (or if q = 1 – p is adequately small), the distribution is usually asymmetrical. Alternatively, if p is sufficiently close enough to 0.5 and n is sufficiently large, the binomial distribution can be approximated using the normal distribution. Under these conditions the binomial distribution is approximately symmetrical and inclines toward a bell shape. A binomial distribution with very small p (or p very close to 1) can be approximated by a normal distribution if n is very large. If n is large enough, sometimes both the normal approximation and the Poisson approximation are applicable. In that case, use of the normal approximation is generally preferable since it allows easy calculation of cumulative probabilities using tables or other technology. When dealing with extremely large samples, it becomes very tedious to calculate certain probabilities. In such circumstances, using the normal distribution to approximate the exact probabilities of success is more applicable or otherwise it would have been achieved through laborious computations. For n sufficiently large (say n > 20) and p not too close to zero or 1 (say 0.05 < p < 0.95) the distribution approximately follows the Normal distribution.

To find the binomial probabilities, this can be used as follows:

If X ~ binomial (n,p) where n > 20 and 0.05 < p < 0.95 then approximately X has the Normal distribution with mean E(X) = np

So is approximately N(0,1).

R programming will be used for calculating probabilities associated with the binomial, Poisson, and normal distributions. Using R code, it will enable me to test the input and model the output in terms of graph. The system requirement for R is to be provided an operating system platform to be able to perform any calculation.

Firstly, we are going to proceed by considering the conditions under which the discrete distribution inclines towards a normal distribution.
Generating a set of the discrete distribution so that it inclines towards a bell shape. Or simply using R by just specifying the size needed.
And lastly compare the generated distribution with the target normal distribution

Normal approximation of binomial probabilities

Let X ~ BINOM(100, 0.4).

Using R to compute Q = P(35 < X ? 45) = P(35.5 < X ? 45.5):

> diff(pbinom(c(45,35), 100, .4))

[1] -0.6894402

Whether it is for theoretical or practical purposes, Using Central Limit Theorem is more convenient to approximate the binomial probabilities.

When n is large and (np/q, nq/p) > 3, where q = 1 – p

The CLT states that, for situations where n is large,

Y ~ BINOM(n, p) is approximately NORM(? = np, ? = [np(1 – p)]1/2).

Hence, using the first expression Q = P(35 < X ? 45)

The approximation results as follows:

l ?(1.0206) – ?(–1.0206) = 0.6926

Correction for continuity adjustment will be used in order for a continuous distribution to approximate a discrete. Recall that a random variable can take all real values within a range or interval while a discrete random variable can take on only specified values. Thus, using the normal distribution to approximate the binomial, more precise approximations of the probabilities are obtained.

After applying the continuity correction to Q = P(35.5 < X ? 45.5), it results to:

?(1.1227) – ?(–0.91856) = 0.6900

We can verify the calculation using R,

> pnorm(c(1.1227))-pnorm(c(-0.91856))

[1] 0.6900547

Below an alternate R code is used to plot and illustrate the normal approximation to binomial.

Let X ~ BINOM(100, l4) and P(35 < X 45)

> pbinom(45, 100, .4) – pbinom(35, 100, .4)

[1] 0.6894402

# Normal approximation > pnorm(5/sqrt(24)) – pnorm(-5/sqrt(24))

[1] 0.6925658

# Applying Continuity Correction > pnorm(5.5/sqrt(24)) – pnorm(-4.5/sqrt(24))

[1] 0.6900506

x1=36:45

x2= c(25:35, 46:55)

x1x2= seq(25, 55, by=.01)

plot(x1x2, dnorm(x1x2, 40, sqrt(24)), type=”l”,

xlab=”x”, ylab=”Binomial Probability”)

lines(x2, dbinom(x2, 100, .4), type=”h”, col=2)

lines(x1, dbinom(x1, 100, .4), type=”h”, lwd=2)

Poisson approximation of binomial probabilities

For situations in which p is very small with large n, the Poisson distribution can be used as an approximation to the binomial distribution. The larger the n and the smaller the p, the better is the approximation. The following formula for the Poisson model is used to approximate the binomial probabilities:

A Poisson approximation can be used when n is large (n>50) and p is small (p<0.1)

Then X~Po(np) approximately.

AN EXAMPLE

The probability of a person will develop an infection even after taking a vaccine that was supposed to prevent the infection is 0.03. In a simple random sample of 200 people in a community who get vaccinated, what is the probability that six or fewer person will be infected?

Solution:

Let X be the random variable of the number of people being infected. X follows a binomial probability distribution with n=200 and p= 0.03. The probability of having six or less people getting infected is

P (X ? 6 ) =

The probability is 0.6063. Calculation can be verified using R as

> sum(dbinom(0:6, 200, 0.03))

[1] 0.6063152

Or otherwise,

> pbinom(6, 200, .03)

[1] 0.6063152

In order to avoid such tedious calculation by hand, Poisson distribution or a normal distribution can be used to approximate the binomial probability.

Poisson approximation to the binomial distribution

To use Poisson distribution as an approximation to the binomial probabilities, we can consider that the random variable X follows a Poisson distribution with rate ?=np= (200) (0.03) = 6. Now, we can calculate the probability of having six or fewer infections as

P (X ? 6) =

The results turns out to be similar as the one that has been obtained using the binomial distribution.

Calculation can be verified using R,

> ppois(6, lambda = 6)

[1] 0.6063028

It can be clearly seen that the Poisson approximation is very close to the exact probability.

The same probability can be calculated using the normal approximation. Since binomial distribution is for a discrete random variable and normal distribution for continuous, continuity correction is needed when using a normal distribution as an approximation to a discrete distribution.

For large n with np>5 and nq>5, a binomial random variable X with X?Bin(n,p) can be approximated by a normal distribution with mean = np and variance = npq. i.e. X?N(6,5.82).

The probability that there will be six or fewer cases of these incidences:

P (X?6) = P (z ? )

As it was mentioned earlier, correction for continuity adjustment is needed. So, the above expression become

P (X?6) = P (z ? )

= P (z ? )

= P (z ? )

Using R, the probability which is 0.5821 can be obtained:

> pnorm(0.2072)

[1] 0.5820732

It can be noted that the approximation used is close to the exact probability 0.6063. However, the Poisson distribution gives better approximation. But for larger sample sizes, where n is closer to 300, the normal approximation is as good as the Poisson approximation.

The normal approximation to the Poisson distribution

The normal distribution can also be used as an approximation to the Poisson distribution whenever the parameter ? is large

When ? is large (say ?>15), the normal distribution can be used as an approximation where

X~N(?, ?)

Here also a continuity correction is needed, since a continuous distribution is used to approximate a discrete one.

Example

A radioactive disintegration gives counts that follow a Poisson distribution with a mean count of 25 per second. Find probability that in a one-second interval the count is between 23 and 27 inclusive.

Solution:

Let X be the radioactive count in one-second interval, X~Po(25)

Using normal approximation, X~N(25,25)

P(23?x?27) =P(22.5

=P ( )

=P (-0.5 < Z < 0.5)

=0.383 (3 d.p)

Using R:

> pnorm(c(0.5))-pnorm(c(-0.5))

[1] 0.3829249

In this study it has been concluded that when using the normal distribution to approximate the binomial distribution, a more accurate approximations was obtained. Moreover, it turns out that as n gets larger, the Binomial distribution looks increasingly like the Normal distribution. The normal approximation to the binomial distribution is, in fact, a special case of a more general phenomenon. The importance of employing a correction for continuity adjustment has also been investigated. It has also been viewed that using R programming, more accurate outcome of the distribution are obtained. Furthermore a number of examples has also been analyzed in order to have a better perspective on the normal approximation.

Using normal distribution as an approximation can be useful, however if these conditions are not met then the approximation may not be that good in estimating the probabilities.

Models of Accounting Analysis

Historic Cost

In accounting, historic cost is the first money related quality of a financial item. Historic cost is focused around the stable measuring unit assumption. In a few circumstances, assets and liabilities may be demonstrated at their historic cost, as though there had been no change in value from the date of acquisition. The balance sheet value items may subsequently vary from the “accurate” value (WIKIPEDIA).

Principle

An accounting system in which assets are recorded on an balance sheet with the value at which they were obtained, rather than the current market value. The historic cost standard is used to get the measure of capital expended to acquire an asset, and is helpful for matching against changes in profits or expenses identifying with the asset purchased, and in addition for deciding past opportunity costs (Business Dictionary).

Impacts

Under the historical cost basis of accounting, assets and liabilities are recorded at their values when first acquired. They are not then generally restated for changes in values. Costs recorded in the Income Statement are based on the historical cost of items sold or used, rather than their replacement costs (WIKIPEDIA).

Example

The main headquarters of a company, which includes the land and building, was bought for $100,000 in 1945, and its expected market value today is $30 million. The asset is still recorded on the balance sheet at $100,000 (INVESTOPEDIA).

Current Purchasing Power Accounting

Capital maintenance in units of constant purchasing power (CMUCPP) is the International Accounting Standards Board (IASB) basic accounting model originally authorized in IFRS in 1989 as an alternative to traditional historical cost accounting (WIKIPEDIA).

Principle

Current Purchasing Power Accounting(CPPA) includes the re-statement of historical figures at current purchasing power. For this reason, historic figures must be multiple by conversion factors. The formula for the calculation of conversion component is:

Conversion factor = Price Index at the date of Conversion/Price Index at the date of item arose

Conversion factor at the beginning = Price Index at the end/Price Index at the beginning

Conversion factor at an average = Price Index at the end/Average Price Index

Conversion factor at the end = Price Index at the end/Price Index at the end

Average Price Index = Price Index at beginning + Price Index at the end/2

CPP Value = Historical value X Conversion factor (Account Managment Economics).

Impacts on Financial Statements

financial statements are ready on the basis of historical cost and a supplementary statement is ready showing historical items in terms of present value on the basis of general price index. Retail price index or wholesale price index is taken as an appropriate index for the conversion of historical cost items to show the changes in value of money. This method takes into consideration the changes in the value of items as a result of general price level, but it does not account for changes in the value of individual items (Accounting Managment).

Example

XYZ Company had a closing balance of inventory at 30 June 2012 equal to $10000. This inventory had been purchased in the last three months of the financial year. Assume the general price level index was 140 on 1 July 2011, 144 on 31 December 2011, 150 on 30 June 2012, the average for the year (July 2011-June 2012) was 145 and the average for April 2012 – June 2012 was 147. For showing updated inventory with CPPA, we will use following formula Book value of inventory X current month general price index/ average index of three months = 10000 X 150/ 147 = $ 10204 (Accounting Education).

Current Cost Accounting

Current cost accounting is a procedure of accounting that attempts to give quality of benefits on the basis of their current replacement require as opposed to the sum they were purchased for (Ask).

Principles

It influences all the records and accounting reports also their balancing items. A fundamental principle underlying the estimation of gross value added, and hence GDP, is that yield and intermediate utilization must be value at the costs present at the time the processing happens. This intimates that goods withdrawn from inventories must be value at the price prevailing at the times the goods are withdrawn and not at the costs at which they entered inventories (Glossary Of Statistical Terms).

Impacts

Accounting systems that help in the preparation of financial reports, the cost accounting systems and reports are not subject to rules and standards like the Generally Accepted Accounting Principles. As a result, there is huge variety in the cost accounting systems of the different organization and sometimes even in different parts of the same organization (WIKIPEDIA).

Advantages Of Historic Cost

Historic Cost provide straight forward procedure. It records gains until they are recognize. Historical Costing method are still using in accounting system.

Dis-Advantages Of Historic Cost

Historic Cost consider as a acquisition cost of an asset and does not recognize current market value. Historic Cost only interested in allocation of cost, not in the value of an asset. It’s neglect the current market value of the asset that may be higher or lower than its suggested. It’s gives flaws in time of inflation (Study Mode).

Comments

Historical Costing method is still using in accounting system, it is a traditional method of accounting system it is not represent the market value of items, due to which it is not a appropriate method to adopt.

Advantages Of Current Purchasing Power Accounting

Current purchasing power method uses as measuring unit.
It’s provide the calculation facilities to gain or loss in purchasing power due to holding monetary items.
In this method, historical accounts continue to be maintained because they prepared on supplementary basis.
This method intact the purchasing power of capital contributed by shareholders, so the method is important from the shareholders point of view.
This method provides reliable financial information for the management to formulate policies and plans.

Dis-Advantages Of Current Purchasing Power Accounting

This method is only consider changes in general purchasing power, it does not consider the changes in the value of individual items.
This method based on statistical index number which not used in individual firm.
It’s difficult to use suitable price index.
This method is failed to remove all defects of historical cost accounting system (Accounting Management).

Comments

Purchasing power accounting is very useful to provide financial information for management and its intact the purchasing power of capital which contributed by shareholders, its useful in inflation time so now in current time this method is very useful.

Advantages of Current Cost Accounting

This methods use present value of assets, instead of the original purchase price.
This type of accounting is addresses the difference between historical and current cost accounting system.
This method assigns higher values on the assets owned by the business.
This method also used during bankruptcy and liquidation procedures to find the total loss to the owner (Ask).

Comments

Cost accounting provides accurate situation of the connection between specific cost and specific outputs because traces resources as they moves through company. By adopting cost accounting for business, in that way we learn resources are being wasted and which resource are most profitable (Chron).

Methods of Data Collection

1. INTRODUCTION

This report consists of how data are collected and what are the methods to collect data for research. To improve a research better one or for more learning of particular thing which is to be analyzed. In this report a brief study of method of collecting data by primary data and secondary data with their classifications will be observed.

2. Methods of collecting primary data
OBSERVATION
QUESTIONNAIRE
SEMI-STRUCTURED AND IN-DEPTH INTERVIEW.

2.1 OBSERVATION

Observation means finding what people do, what they need, etc… It combines of recording, describing, analysis and interpretation of people behavior. Observation are two different types,

PARTICIPANT OBSERVATION.

In participant observation researcher will involve with subject activities and live and being a member of group. E.g. .all documentary films are all of this kind.

This type roles are:

Complete participant
Complete observer
Observer as participant
Participant as observer.

Graphical representation of participant observation researcher roles

Participant as observer complete participant

Observer as participant complete observer

STRUCTURED OBSERVATION.

As the heading its self describes about what kind of observation are done in it. It’s a structured way of dealing data collection method, which involves in high level of predetermined structured .It form only some part of data collection. Ex: A daily attendance sheet, planning sheet.

HOURS

MINUTES TAKEN

ACTIVITY

WASHING

DRESSING

EATING

MOBILITY

1

2

3

4

ACT

ADEQ

ACT

ADEQ

ACT

ADEQ

ACT

ADEQ

2.2 SEMI STRUCTURED AND IN-DEPTH INTERVIEWS

It involves in interviewing a person or on group. Where interview are classified into structured, semi-structured, unstructured interviews. In structured interviews a format of question are followed for some particular criteria to be handled, which consists of standard questions.

For semi-structured interviews it is based on optioning the customer to select their preferred section of questions. Whereas unstructured interviews deals with in depth involvement in a particular or interested area.

Interviews are done by face to face and group interviews. Face to face interviews can figure out a person behavior, but group interview show how groups are mingled together and how they differ one another.

HOW CAN THESE TYPE OF INTERVIEWS ARE USEFUL IN RESEARCH

EXPLORATORY

DESCRIPTIVE

EXPLANATORY

STRUCTURED

FREQUENT

LESS FREQUENT

SEMI-STRUCTURED

LESS FREQUENT

MORE FREQUENT

IN DEPTH

MORE FREQUENT

2.3 QUESTIONNARIE

It is a general way of collecting data, in which person is asked to answer for same set of questions in order. It is very easy to ask question for some study or research. Most of the research use questionnaire as their weapon for collecting information. This can be involved in individual level so sampling size also be larger one. An interesting one in questionnaire is modes of responding to it.

Telephonic survey.
Mail (postal) survey.
E-mail survey.

QUESTIONNARIE SELECTION CHART

2.3.1 Telephonic survey

It is a common method followed where researcher and respondent are unknown. So limited data are collected from this method. Due to limitations it restrict questionnaire format to smaller one. Question must be easier for respondent to answer quickly. Question must not be longer one which consume more time. To handle this survey a trained person must be interviewing. Answers to question can be entered directly on an excel-sheet to save time.

2.3.2 Mail (postal) survey

It is average form of survey where respondent and questionnaire cannot contact directly and without any interaction. Questioner should be preplanned about design and structure of question to be framed in such a way that respondent could answer it without neglecting any question. Questions must be in an order like easy, average, difficult, which can earn a valuable survey. Time are more valued in surveys.

2.3.3 E-mail survey

E-mail survey are most popular survey where people are gather through internet. It can be performed in two way by e-mailing or using online survey. Just as mail an e-mail can be sent to respondent for answering but they may not reply for it, due to some reasons. Online survey are better because they answer then and there so data are collected faster than mailing. Today html pages are used to frame survey questions. And exciting one for survey is Google forms which are much useful for researcher to get job done.

3. METHODS OF COLLECTING SECONDARY DATA

Collecting secondary data involves in finding publications, project and research reports, ERP/data warehouse and mining, internet/web for your necessary of research details.

3.1 PUBLICATIONS

It refers to printed media like newspapers, textbooks, magazines, journals and reports. These are otherwise known as reference material, which contains wide source of data. Researchers follow secondary data as their first priority than primary data because it will lead them to a proper or complete view of research for their respective topics. As every publications have topic specified to itself, researchers can find easily the source of topic in a systematic manner. To search these publications proper guide lines also required.

3.2 ERP/DATAWAREHOUSES AND MINING

For every organization ERP are implemented to gather information about finance, commercial, accounts, production, marketing, R&D etc…

How do ERP helps in research, since it has data stored day by day, months and yearly basis to compute as integrated one. Researcher of different phenomenon can easily track those information by authorized person of such organization for their data collection. ERP has different sectors combined for example if a researcher form financial sector comes to verify how organization development in that particular sector, he/she can collect information from ERP. Mostly these data are considered as primary data.

Data warehouses are secondary data, where large amount of data are stored. These data cannot be analyzed manually. So software for analyzing it is Data Mining Software, this will segregate all kinds of data and use statistical techniques to analyze data. Some techniques used by this software are variance analysis, cluster analysis, factor analysis, etc. It is a statistical and information technologies software. To create these software so of vendors of it are, excel miner, SPSS, SAS and SYSSTAT. Data mining is automated process where some features are selected by user.

3.3 Internet/web

Most basic way collecting secondary data is to search through web. As we know internet search topic and words related thing easily and fast where surplus amount of data are founded in thousands of websites all over the world. It includes all e-textbooks, journals, government reports. To search our results through internet search are provided those are GOOGLE, YAHOO, etc. all these search engines can show several sites but one must choose correct data related to topic of research involves. Most popular website for collecting data are Wikipedia for researcher, where note of particular topic are gives with reference site to get detailed study about research topics.

SOME OF THE IMPORTANT WEBSITES

OWNERS/SPONSORED

SITE ADDRESS

DESCRIPTION

WORLD BANK

WWW.worldbank.org

data

Reserve Bank Of India

www.rbi.org.in

Economic data, banking data

EBSCO

http://web.ebscohost.com

Research databases(paid)

ISI(Indian Statistical Institute

www.isical.ac.in/-library/

Web library

4. conclusion

From the given information we know about what are primary data and secondary data and how to collect those data from various resources. Research must be valuable one so data collection must be done enormously to predict correct result of analysis. Secondary data can be added in research reports but there must be some data which show your involvement in research process. Research is an endless process because as time changes strategy of reports containing details also vary due to respondent are not same in nature. A research about a topic gives overview, detailed and explanation according to research types. At last collection of data are most important for research because it act as proof or evidence of your valuable reports.

Table of Contents PG NO
INTRODUCTION 1
METHODS OF COLLECTING PRIMARY DATA 1
OBSERVATION 1

2.1.1 PARTICIPANT OBSERVATION

2.1.2 STRUCTURED OBSERVATION 2

2.2 SEMI STRUCTURED AND INDEPTH INTERVIEWS 3

2.3 QUESTIONNARIE 4

2.3.1 TELEPHONIC SURVEY 5

2.3.2 POSTAL SURVEY 5

2.3.3 E-MAIL SURVEY 5

METHODS OF COLLECTING SECONDARY DATA 6
PUBLICATIONS 6
ERP/DATA WAREHOUSES AND MINING 6
INTERNET/WEB 7
CONCLUSION 8

REFERENCES
PEARSON EDUCATION/ THIRD EDITION/ RESEARCH METHODS FOR BUSINESS STUDENTS/ Mark Saunders/Philip Lewis/Adrian Thorn hill
SAGE PUBLICATIONS/ESSENTIALS OF BUSINESS RESEARCH/ Jonathan Wilson.
TATA McGraw HILL/STATISTICS FOR MANAGEMENT/ G.C.BERI.

Measuring weak-form market efficiency

Measuring Weak-form Market EfficiencyABSTRACT

This paper tests weak-form efficiency in the U.S. market. Both daily and monthly returns are employed for autocorrelation analysis, variance ratio tests and delay tests. Three conclusions are reached. Firstly, security returns are predictable to some extent. While individual stock returns are weakly negatively correlated and difficult to predict, market-wide indices with outstanding recent performance show a positive autocorrelation and offer more predictable profit opportunities. Secondly, monthly returns follow random walk better than daily returns and are thus more weak-form efficient. Finally, weak-form inefficiency is not necessarily bad. Investors should be rewarded a certain degree of predictability for bearing risks.

Efficient market hypothesis (EMH), also known as “information efficiency”, refers to the extent to which stock prices incorporate all available information. The notion is important in helping investors to understand security behaviour so as to make wise investment decisions. According to Fama (1970), there are three versions of market efficiency: the weak, semistrong, and strong form. They differ with respect to the information that is incorporated in the stock prices. The weak form efficiency assumes that stock prices already incorporate all past trading information. Therefore, technical analysis on past stock prices will not be helpful in gaining abnormal returns. The semistrong form efficiency extends the information set to all publicly available information including not only past trading information but also fundamental data on firm prospects. Therefore, neither technical analysis nor fundamental analysis will be able to produce abnormal returns. Strong form efficiency differs from the above two in stating that stock prices not only reflect publicly available information but also private inside information. However, this form of market efficiency is always rejected by empirical evidence.

If weak-form efficiency holds true, the information contained in past stock price will be completely and instantly reflected in the current price. Under such condition, no pattern can be observed in stock prices. In other words, stock prices tend to follow a random walk model. Therefore, the test of weak-form market efficiency is actually a test of random walk but not vice versa. The more efficient the market is, the more random are the stock prices, and efforts by fund managers to exploit past price history will not be profitable since future prices are completely unpredictable. Therefore, measuring weak-form efficiency is crucial not only in academic research but also in practice because it affects trading strategies.

This paper primarily tests the weak-form efficiency for three stocks-Faro Technologies Inc. (FARO), FEI Company (FEIC) and Fidelity Southern Corporation (LION) and two decile indices-the NYSE/AMEX/NASDAQ Index capitalisation based Deciles 1 and 10 (NAN D1 and NAN D10). Both daily and monthly data are employed here to detect any violation of the random walk hypothesis.

The remainder of the paper is structured in the following way. Section I provides a brief introduction of the three firms and two decile indices. Section II describes the data and discusses the methodology used. Section III presents descriptive statistics. Section IV is the result based on empirical analysis. Finally, section V concludes the paper.

I. The Companies[1]

A. Faro Technologies Inc (FARO)

FARO Technologies is an instrument company whose principle activities include design and develop portable 3-D electronic systems for industrial applications in the manufacturing system. The company’s principal products include the Faro Arm, Faro Scan Arm and Faro Gage articulated measuring devices. It mainly operates in the United States and Europe.

B. FEI Company (FEI)

FEI is a leading scientific instruments company which develops and manufactures diversified semiconductor equipments including electron microscopes and beam systems. It operates in four segments: NanoElectronics, NanoResearch and Industry, NanoBiology and Service and Components. With a 60-year history, it now has approximately 1800 employees and sells products to more than 50 countries around the world.

C. Fidelity Southern Corp. (LION)

Fidelity Southern Corp. is one of the largest community banks in metro Atlanta which provides a wide range of financial services including commercial and mortgage services to both corporate and personal customers. It also provides international trade services, trust services, credit card loans, and merchant services. The company provides financial products and services for business and retail customers primarily through branches and via internet.

D. NYSE/AMEX/NASDAQ Index

It is an index taken from the Center for Research in Security Prices (CRSP) which includes all common stocks listed on the NYSE, Amex, and NASDAQ National Market. The index is constructed by ranking all NYSE companies according to their market capitalization in the first place. They are then divided into 10 decile portfolios. Amex and NASDAQ stocks are then placed into the deciles based on NYSE breakpoints. The smallest and the largest firms based on market capitalization are placed into Decile 1 and Decile 10, respectively.

II. Data and Methodology

A. Data

Data for the three stocks and two decile indices in our study are all obtained from the Center for Research in Securities Prices database (CRSP) on both daily and monthly basis from January 2000 to December 2005. Returns are then computed on both basis, generating a total of 1507 daily observations and 71 monthly observations. The NYSE/AMEX/NASDAQ Index is CRSP Capitalisation-based so that Decile 1 and 10 represent the smallest and largest firms, respectively, based on market capitalisation. In addition, The Standard and Poors 500 Index (S&P 500) is used as a proxy for the market index. It is a valued-weighted index which incorporates the largest 500 stocks in US market. For comparison purposes, both continuously compounded (log) returns and simple returns are reported, although the analysis is based on the result of the first one.

B. Methods

B.1. Autocorrelation Tests

One of the most intuitive and simple tests of random walk is to test for serial dependence, i.e. autocorrelation. The autocorrelation is a time-series phenomenon, which implies the serial correlation between certain lagged values in a time series. The first-order autocorrelation, for instance, indicates to what extent neighboring observations are correlated. The autocorrelation test is always used to test RW3, which is a less restrictive version of random walk model, allowing the existence of dependent but uncorrelated increments in return data. The formula of autocorrelation at lag k is given by:

(1) where is the autocorrelation at lag ; is the log-return on stock at time; and is the log-return on stock at time. A greater than zero indicates a positive serial correlation whereas a less than zero indicates a negative serial correlation. Both positive and negative autocorrelation represent departures from the random walk model. If is significantly different from zero, the null hypothesis of a random walk is rejected.

The autocorrelation coefficients up to 5 lags for daily data and 3 lags for monthly data are reported in our test. Results of the Ljung-Box test for all lags up to the above mentioned for both daily and monthly data are also reported. The Ljung-Box test is a more powerful test by summing the squared autocorrelations. It provides evidence for whether departure for zero autocorrelation is observed at all lags up to certain lags in either direction. The Q-statistic up to a certain lag m is given by:

(2)

B.2. Variance Ratio Tests

We follow Lo and MacKinlay’s (1988) single variance ratio (VR) test in our study. The test is based on a very important assumption of random walk that variance of increments is a linear function of the time interval. In other words, if the random walk holds, the variance of the qth differed value should be equal to q times the variance of the first differed value. For example, the variance of a two-period return should be equal to twice the variance of the one-period return. According to its definition, the formula of variance ratio is denoted by:

(3) where q is any positive integer. Under the null hypothesis of a random walk, VR(q) should be equal to one at all lags. If VR(q) is greater than one, there is positive serial correlation which indicates a persistence in prices, corresponding to the momentum effect. If VR(q) is less than one, there is negative serial correlation which indicates a reversal in prices, corresponding to the mean-reverting process.

Note that the above two test are also tests of how stock prices react to publicly available information in the past. If market efficiency holds true, information from past prices should be immediately and fully reflected in the current stock price. Therefore, future stock price change conditioned on past prices should be equal to zero.

B.3. Griffin-Kelly-Nardari DELAY Tests

As defined by Griffin, Kelly and Nardari (2007), “delay is a measure of sensitivity of current returns to past market-wide information”.[2] Speaking differently, delay measures how quickly stock returns can react to market returns. The logic behind this is that a stock which is slow to incorporate market information is less efficient than a stock which responds quickly to market movements.

S&P 500 index is employed in delay test to examine the sensitivity of stock returns to market information. For each stock and decile index, both restricted and unrestricted models are estimated from January 2000 to December 2005. The unrestricted model is given by:

(4) where is the log-return on stock i at time t; is the market log-return (return for S&P 500 index) at time t; is the lagged market return; is the coefficient on the lagged market return; and is the lag which is 1, 2, 3, 4 for the daily data and 1, 2, 3 for the monthly data. The restricted model is as follows which sets all to be zero:

(5) Delay is then calculated based on adjusted R-squares from above regressions as follows:

(6) An alternative scaled measure of delay is given by:

(7) Both measures are reported in a way that the larger the calculated delay value, the more return variation is explained by lagged market returns and thus the more delayed response to the market information.

III. Descriptive Statistics

A. Daily frequencies

Table I shows the summary statistic of daily returns for the three stocks and two decile indices. The highest mean return is for FARO (0.0012), whereas the lowest mean return is for NAN D10 (0.0000). In terms of median return, NAN D1 (0.0015) outperforms all the other stocks. Both the highest maximum return and the lowest minimum return (0.2998 and -0.2184, respectively) are for FARO, corresponding to its highest standard deviation (0.0485) among all, indicating that FARO is the most volatile in returns. On the other hand, both the lowest maximum return and highest minimum return (0.0543 and -0.0675, respectively) are for NAN D10. However NAN D10 is only the second least volatile, while the lowest standard deviation is for NAN D1 (0.0108). Figure 1 and 2 presents the price level of the most and least volatile index (stock). All the above observations remain true if we change from log-return basis to a simple return basis.

In terms of the degree of asymmetry of the return distributions, all stocks and indices are positively skewed, with the only exception of NAN D1. The positive skewness implies that more extreme values are in the right tail of the distribution, i.e. stocks are more likely to have times when performance is extremely good. On the other hand, NAN D1 is slightly negatively skewed, which means that returns are more likely to be lower that what is expected by normal distribution. In measuring the “peakedness” of return distributions, positive excess kurtosis is observed in all stocks and indices, also known as a leptokurtic distribution, which means that returns either cluster around the mean or disperse in the two ends of the distribution. All the above observations can be used to conclusively reject the null hypothesis that daily returns are normally distributed. What’ more, results from Jarque-Bera test provide supportive evidence for rejection of the normality hypothesis at all significant levels for all stocks and indices.

B. Monthly frequencies

Descriptive statistics of monthly returns are likewise presented in Table II. Most of the above conclusions reached for daily returns are also valid in the context of monthly returns. In other words, what is the highest (lowest) value for daily returns is also the highest (lowest) for monthly returns in most cases. The only exceptions are for the highest value in median returns and the lowest value and standard deviation in minimum returns. In this situation, NAN D10 (0.0460) and FARO (0.1944) have the least and most dispersion according to their standard deviations, compared with NAN D1 and FARO in daily case. From above observation, we can see that decile indices are more stable than individual stocks in terms of returns. What’s more, monthly returns have larger magnitude in most values than daily returns.

Coming to the measurement of asymmetry and peakedness of return distributions, only NAN D10 (-0.4531) is negatively skewed. However, the degree of skewness is not far from 0. Other stocks and index are all positively skewed with both FEIC (0.0395) and LION (0.0320) having a skewness value very close to 0. Almost all stocks and index have a degree of kurtosis similar to that of normal distribution, except that NAN D1 (8.6623) is highly peaked. This is also consistent with the results of JB p-values, based on which we conclude that FEIC, LION and NAN D10 are approximately normal because we fail to reject the hypothesis that they are normally distributed at 5% or higher levels (see Figure 3 and 4 for reference). However when simple return basis is used, FEIC is no longer normally distributed even at the 1% significant level. Except this, using simple return produces similar results.

IV. Results

A. Autocorrelation Tests

A.1. Tests for Log-Returns

The results of autocorrelation tests for up to 5 lags of daily log-returns and up to 3 lags of monthly log-returns for three stocks and two decile indices from January 2000 to December 2005 are summarised in Table III. Both the autocorrelation (AC) and partial autocorrelation (PAC) are examined in our tests.

As is shown in Panel A, all 5 lags of FARO, FEIC and NAN D10 for both AC and PAC are insignificant at 5% level, except for the fourth-order PAC coefficient of FARO (-0.052), which is slightly negatively significant. On the contrary, NAN D1 has significant positive AC and PAC at almost all lags except in the fourth order, its PAC (0.050) is barely within the 5% significance level. The significant AC and PAC coefficients reject the null hypothesis of no serial correlation in NAN D1, thereby rejecting the weak-form efficiency. In terms of LION, significant negative autocorrelation coefficients are only observed in the first two orders and its higher-order coefficients are not statistically significant. Besides that, we find that all the stocks and indices have negative autocorrelation coefficients at most of their lags, with the only exception of NAN D1, whose coefficients are all positive. The strictly positive AC and PAC indicates persistence in returns, i.e. a momentum effect for NAN D1, which means that good or bad performances in the past tend to continue over time.

We also present the Ljung-Box (L-B) test statistic in order to see whether autocorrelation coefficients up to a specific lag are jointly significant. Since RW1 implies all autocorrelations are zero, the L-B test is more powerful because it tests the joint hypothesis. As is shown in the table, both LION and NAN D1 have significant Q values in all lags at all levels, while none of FARO, FEIC and NAN D10 has significant Q values.

Based on above daily observations, we may conclude that the null hypothesis of no serial correlation is rejected at all levels for LION and NAN D1, but the null hypothesis cannot be rejected at either 5% level or 10% level for FARO, FEIC and NAN D10. This means that both LION and NAN D1 are weak-form inefficient. By looking at their past performance, we find that while NAN D1 outperformed the market in sample period, LION performed badly in the same period. Therefore, it seems that stocks or indices with best and worst recent performance have stronger autocorrelation. In particular, LION shows a positive autocorrelation in returns, suggesting that market-wide indices with outstanding recent performance have momentum in returns over short periods, which offer predictable opportunities to investors.

When monthly returns are employed, no single stock or index has significant AC or PAC in any lag reported at 5% level. It is in contrast with daily returns, which means that monthly returns follow a random walk better than daily returns. More powerful L-B test confirms our conclusion by showing that Q statistics for all stocks and indices are statistically insignificant at either 5% or 10% level. Therefore, the L-B null hypothesis can be conclusively rejected for all stocks and indices up to 3 lags. When compared with daily returns, monthly returns seem to follow random walk better and are thus more weak-form efficient.

A.2. Tests for Squared Log-Returns

Even when returns are not correlated, their volatility may be correlated. Therefore, it is necessary for us to expand the study from returns to variances of returns. Squared log-returns and absolute value of log-returns are measures of variances and are thus useful in studying the serial dependence of return volatility. The results of autocorrelation analysis for daily squared log-returns for all three stocks and two decile indices are likewise reported in Table IV.

In contrast to the results for log-returns, coefficients for FEIC, LION, NAN D1 and NAN D10 are significantly different from zero, except for the forth-order PAC coefficient (0.025) for FEIC, the fifth-order PAC coefficient for LION (-0.047) and third- and forth-order PAC coefficient for NAN D1 (-0.020 and -0.014, respectively). FARO has significant positive AC and PAC at the first lag and a significant AC at the third lag. The L-B test provides stronger evidence against the null hypothesis that sum of the squared autocorrelations up to 5 lags is zero for all stocks and indices at all significant levels, based on which we confirm our result that squared log-returns do not follow a random walk. Another contrasting result with that of log-returns is that almost all the autocorrelation coefficients are positive, indicating a stronger positive serial dependence in squared log-returns.

In terms of monthly data, only FEIC and NAN D10 have significant positive third-order AC and PAC estimates. Other stocks and indices have coefficients not significantly different from zero. The result is supported by Ljung-Box test statistics showing that Q values are only statistically significant in the third lag for both FEIC and NAN D10. This is consistent with the result reached for log-returns above, which says that monthly returns appear to be more random than daily returns.

A.3. Tests for the Absolute Values of Log-Returns

Table V provides autocorrelation results for the absolute value of log-returns in similar manner. However, as will be discussed below, the results are even more contrasting than that in Table IV.

In Panel A, all the stocks and indices have significant positive serial correlation while insignificant PAC estimates are only displayed in lag 5 for both FARO and LION. Supporting above result, Q values provide evidence against the null hypothesis of no autocorrelation. Therefore, absolute value of daily log-returns exhibit stronger serial dependence than in Table III and IV, and autocorrelations are strictly positive for all stocks and indices. Coming to the absolute value of monthly log-returns, only FEIC displays significant individual and joint serial correlation. NAN D1 also displays a significant Q value in lag 2 at 5% level, but it is insignificant at 1% level.

Based on the above evidence, two consistent conclusions can be made at this point. First of all, by changing ingredients in our test from log-returns to squared log-returns and absolute value of log-returns, more positive serial correlation can be observed, especially in daily data. Therefore, return variances are more correlated. Secondly, monthly returns tend to follow a random walk model better than daily returns.

A.4. Correlation Matrix of Stocks and Indices

Table VI presents the correlation matrix for all stocks and indices. As is shown in Panel A for daily result, all of the correlations are positive, ranging from 0.0551 (LION-FARO) to 0.5299 (NAN D10-FEIC). Within individual stocks, correlation coefficients do not differ a lot. The highest correlation is between FEIC and FARO with only 0.1214, indicating a fairly weak relationship between individual stocks returns. However, in terms of stock-index relationships, they differ drastically from 0.0638 (NAN D10-FARO) to 0.5299 (NAN D10-FEIC). While the positive correlation implies that the three stocks follow the indices in the same direction, the extent to which they will move with the indices is quite different, indicating different levels of risk with regard to different stock. Finally, we find that the correlation between NAN D10 and NAN D1 is the second highest at 0.5052.

Panel B provides the correlation matrix for monthly data. Similar to results for daily data, negative correlation is not observed. The highest correlation attributes to that between NAN D10 and FEIC (0.7109) once again, but the lowest is between LION and FEIC (0.1146) this time. Compared with results in Panel A, correlation within individual stocks is slightly higher on average. The improvement in correlation is even more obvious between stocks and indices. It implies that stock prices can change dramatically from day to day, but they tend to follow the movement of indices in a longer horizon. Finally, the correlation between two indices is once again the second highest at 0.5116, following that between NAN D10 and FEIC. It is also found that the correlation between indices improves only marginally when daily data are replaced by monthly data, indicating a relatively stable relationship between indices.

B. Variance Ratio Tests

The results of variance ratio tests are presented in Table VII for each of the three stocks and two decile indices. The test is designed to test for the null hypothesis of a random walk under both homoskedasticity and heteroskedasticity. Since the violation of a random walk can result either from changing variance, i.e. heteroskedasticity, or autocorrelation in returns, the test can help to discriminate reasons for deviation to some extent. The lag orders are 2, 4, 8 and 16. In Table VII, the variance ratio (VR(q)), the homoskedastic-consistent statistics (Z(q)) and the heteroskedastic-consistent statistics (Z*(q)) are presented for each lag.

As is pointed out by Lo and MacKinlay (1988), the variance ratio statistic VR(2) is equal to one plus the first-order correlation coefficient. Since all the autocorrelations are zero under RW1, VR(2) should equal one. The conclusion can be generalised further to state that for all q, VR(q) should equal one.

According to the first Panel in Table VII, of all stocks and indices, only LION and NAN D1 have variance ratios that are significantly different from one at all lags. Therefore, the null hypothesis of a random walk under both homoskedasticity and heteroskedasticity is rejected for LION and NAN D1, and thus they are not weak-form efficient because of autocorrelations. In terms of FARO, the null hypothesis of a homoskedastic random walk is rejected, while the hypothesis of a heteroskedastic random walk is not. This implies that the rejection of random walk under homoskedasticity could partly result from, if not entirely due to heteroskedasticity. On the other hand, both FEIC and NAN D10 follow random walk and turn out to be efficient in weak form, corresponding exactly to the autocorrelation results reached before in Table III.

Panel B shows that when monthly data are used, the null hypothesis under both forms of random walk can only be rejected for FARO. As for FEIC, the random walk null hypothesis is rejected under homoskedasticity, but not under heteroskedasticity, indicating that rejection is not due to changing variances because Z*(q) is heteroskedasticity-consistent.

As is shown in Panel A for daily data, all individual stocks have variance ratios less than one, implying negative autocorrelation. However, the autocorrelation for stocks is statistically insignificant except for LION. On the other hand, variance ratios for NAN D1 are greater than one and increasing in q. The above finding provides supplementary evidence to the results of autocorrelation tests. As Table III shows, NAN D1 has positive autocorrelation coefficients in all lags, suggesting a momentum effect in multiperiod returns. Both findings appear to be well supported by empirical evidence. While daily returns of individual stocks seem to be weakly negatively correlated (French and Roll (1986)), returns for best performing market indices such as NAN D1 show strong positive autocorrelation (Campbell, Lo, and MacKinlay (1997)). The fact that individual stocks have statistically insignificant autocorrelations is mainly due to the specific noise contained in company information, which makes individual security returns unpredictable. On the contrary, while the positive serial correlation for NAN D1 violates the random walk, such deviation provides investors with confidence to forecast future prices and reliability to make profits.

C. Griffin, Kelly and Nardari DELAY Tests

The results of delay test for the three stocks and two decile indices over the January 2000 to December 2005 period are summarised in Table VIII. We use lag 1, 2, 3, 4 for the daily data and 1, 2, 3 for the monthly data.

As is presented in Panel A for daily returns, Delay_1 value for NAN D10 is close to zero and hence not significant, while NAN D1 has the highest delay among all stocks and indices. The rank of delay within individual stocks seems to have a positive relationship between size and delay value, by showing that delay of LION, the stock with smallest market capitalization is lowest, while the delay of FEIC, the stock with largest market capitalization is highest. It seems to contradict with the Griffin, Kelly and Nardari (2006) study, which says that there is an inverse relationship between size and delay. One possible explanation for that is that delay calculated by daily data on individual firms is noisy.

The scaled measure Delay_2 produces consistent conclusion but with higher magnitude in values. Delay_2 values are very different from zero for FARO, FEIC, LION and NAN D1. The largest increase in value is seen in FARO from 0.0067 for Delay_1 to 0.7901 for Delay_2. Therefore, Griffin, Kelly and Nardari delay measure is preferable, because the scaled version can result in large values without economic significance.

As is displayed in Panel B, employing monthly data also leads to higher Delay_1 values, indicating that more variation of monthly returns are captured by lagged market returns and hence monthly returns are not as sensitive as daily returns to market-wide news. However, an inverse relationship is found this time between delay and market value of individual stocks. Therefore, monthly data provides consistent result to support Griffin, Kelly and Nardari (2006) result as one would normally expect larger stocks to be more efficient in responding to market. Similar to the result for daily data, scaled measure once again produces higher values than its alternative but it provides the same results.

V. Conclusion

The main objective of this paper is to test weak-form efficiency in the U.S. market. As is found by selected tests, NAN D10 and FEIC provide the most consistent evidence to show weak-form efficiency, while the deviation from random walk is suggested for other stocks and indices, especially for NAN D1 and LION. It indicates that security returns are predictable to some degree, especially for those having best and worst recent performance.

The three autocorrelation tests provide different results in terms of daily returns. While the null hypothesis of random walk is rejected for NAN D1 and LION based on log-returns, it is rejected for all stocks and indices based on both squared and absolute value of log-returns, indicating that return variances are more correlated. On the other hand, results in the context of monthly returns are consistent. Monthly returns follow a random walk much better than daily returns in all three tests. Most evidently, the autocorrelation test fails to reject the presence of random walk for all stocks and indices when monthly log-returns are employed.

The variance ratio tests provide supportive evidence for autocorrelation tests. Both tests find that in terms of daily return, NAN D1 and LION show a significant return dependence. In particular, variance ratios for NAN D1 are all above one, corresponding to its positive AC and PAC coefficients, thus implying positive autocorrelation in returns. What’s more, individual stocks have variance ratios less than one with FEIC and FARO both being insignificant. The above evidence conclusively suggest that while individual stock returns are weakly negatively related and difficult to predict, market-wide indices with outstanding recent performance such as NAN D1 tend to show a stronger positive serial correlation and thus offer predictable profit opportunities.

The evidence regarding delay tests is consistent with earlier findings to a large extent. NAN D1 has highest delay in both daily and monthly cases, implying an inefficient response to market news. In the context of monthly log-returns, delay values for individual stocks rank inversely based on market capitalisation with larger cap stocks having lower delay, suggesting that small stocks do not capture past public information quickly and are thus inefficient.

Finally, deviation from a random walk model and thus being weak-form inefficiency is not necessarily bad. In fact, investors should be rewarded a certain degree of predictability for bearing risks. Therefore, future research could be done by incorporating risk into the model.

[1] Company information is mainly obtained from Thomson One Banker database.

[2] Griffin, John M., Patrick J. Kelly, and Federico Nardari, 2006, Measuring short-term international stock market efficiency, Working Paper

Multivariate Multilevel Modeling

Literature Review

This chapter tying up the various similar studies related to modeling responses multivariately in a multilevel frame work. As a start, this chapter begins by laying out the recent history of univariate techniques for analyzing categorical data in a multilevel context. Then it gradually presents the literature available on fitting multivariate multilevel models for categorical and continuous data. More over this chapter reviews the evidence for imputing missing values for partially observed multivariate multilevel data sets.

The Nature of Multivariate Multilevel models

A multivariate multilevel model can be considered as a collection of multiple dependent variables in a hierarchical nature. Though the multivariate analysis increases the complexity in a multilevel context, it is an essential tool which facilitates to carry out a single test of the joint effects of some explanatory variables on several dependent variables (Snijders & Bosker (2000). These models have the power of increasing the construct validity of the analysis for complex concepts in the real world. Consider a study on school effectiveness which can be measured on three different output variables math achievement, reading proficiency and well-being at school. These data are collected on students those who are clustered within schools by implying a hierarchical nature. Although it is certainly possible to handle three outcomes separately, it is unable to show the overall picture about school effectiveness. Therefore multivariate analysis would be more preferable in these types of scenarios since it has the capability of decreasing the type 1 error and increasing the statistical power (Maeyer, Rymenans, Petegem and Bergh) (Draft).

Hierarchical natures of multivariate models are not like as the univariate response models. Let us focus on above example; it implies a two level multivariate model. But in reality it has three levels. In this case, the measurements are the level 1 units, the students the level 2 units and the schools the level three units.

Importance of Multivariate Multilevel Modeling

Multivariate multilevel data structures may itself present a greater complexity as it leads to focus the multilevel effects together with the multivariate context. Therefore the traditional statistical techniques would fail to face these kinds of areas since it can decrease the statistical efficiency by producing overestimated standard errors. On the other hand violation of independence assumption may cause to under estimate the standard errors of regression coefficients. Therefore multivariate multilevel approaches play an important role to get rid of these kinds of situations by allowing variation at different levels to be estimated. Furthermore Goldstein (1999) has shown that clustering provides accurate standard errors, confidence intervals and significance tests.

Some amount of articles have been published on multilevel modeling based on a single response context. Multivariate multilevel concept comes into the field of statistics during the past few years. When people want to identify the effect of set of explanatory variables on a set of dependent variables and by considering these effects separately on response variables, then if it shows a considerable difference among those effects then it can be handled only by means of a multivariate analysis (Snijders & Bosker, 2000).

Software for Multivariate Multilevel Modeling

In the past decades, due to the unavailability of the software for fitting multivariate multilevel data some researchers tend to use manual methods such as EM Algorithm (Kang et al., 1991). As a result of developing the technical environment, the software such as STATA, SAS and S plus are emerged in to the Statistical field by providing facilitates to handle the multilevel data. But none of those packages have a capability of fitting multivariate multilevel data. However there is evidence in the literature that nonlinear multivariate multilevel model can be fitted using packages such as GLLAMM (Rabe-Hesketh, Pickles and Skrondal, 2001) and aML (Lillard and Panis, 2000). But it was not flexible to handle this software.

Therefore MlwiN software which has become the under development since late 1980’s was modified at the University of Bristol in UK in order to fulfill that requirement. However, the use of MlwiN for fitting multivariate multilevel models has been challenged by Goldstein, Carpenter and Browne (2014) who concluded that MlwiN was useful if only when fitting the model without imputing for the missing values. However REALCOM software was then came into the field of Statistics and provided the flexibility to impute the missing values in the MLwiN environment.

MLwiN is a modified version of DOS MLn program which uses a command driven interface. MLwiN provides flexibility to fitting very large and complex models using both frequentist and Bayesian estimation along with the missing value imputation in a user friendly interface. Some particular advanced features which are not available in the other packages are included in this software.

Univariate Multilevel Modeling vs. Multivariate Multilevel Modeling

In general, data are often collected on multiple correlated outcomes. One major theoretical issue that has dominated the field for many years is modeling the association between risk factors and each outcome in a separate model. It may cause to statistically inefficient since it ignores outcome correlations and common predictor effects (Oman, Kamal and Ambler) (unpublished)

Therefore most of the researches tend to include all related outcomes in a single regression model within a multivariate outcome framework rather than univariate. Recently investigators have examined the comparison between Univariate and Multivariate outcomes and they have proven that Multivariate models would be preferable than several univariate models.

According to the Griffiths, Brown and Smith (2004), they conducted a study to compare univariate and multivariate multilevel models for repeated measures of use of antenatal care in Uttar Pradesh, India. In here, they examined many factors which may have a relationship to the mother’s decision to use ante-natal care services for a particular pregnancy. For that they compared Univariate multilevel logistic regression model vs. Multivariate multilevel logistic regression model. However as a result of fitting univariate models, model assumptions became violated and couldn’t get stable parameter estimates. Therefore they preferred the multivariate context rather than the univariate context after performing the analysis.

Generalized Cochran Mantel Haenzel Tests for Checking Association of Multilevel Categorical Data.

The history of arising the concepts related to Generalized Cochran Mantel Haenzel was streaming to the late 1950’s. Cochran (1958), one of a great Statistician has firstly introduced a test to identify the independence of multiple 2 ? 2 tables by extending the general chi-square test for independence of a single 2-way table. In here, the each table consists of one or two additional variables for higher levels to detect the multilevel nature. The test statistic is based on the row totals of each table. The assumption behind is that the cell counts have binomial distribution.

As an extension to Cochran’s work, Mantel and Haenzel (1959) extended the Cochran’s test statistic for both row and column totals by assuming the cell counts of each table follows a hypergeommetric distribution. Since Cochran Mantel Hanzel (CMH) statistic has a major limitation on binary data, Landis et al (1978) generalized this test into handle more than two levels. However there is a major drawback of the Generalized Cochran Mantel Haenzel (GCMH) test. This test was unable to handle clustered correlated categorical data. Liang (1985) was proposed a test statistic for get rid of this problem. However that test statistic itself had major problems and it was fail to use.

As development of the statistics field, a need for a test statistic capable of handling correlated data and variables with higher levels arouse. Zhang and Boos (1995) coming in to the field and introduced three test statistics TEL TP and TU as a solution to the above problems. However among these three test statistics TP and TU are preferred to TEL since these two use the individual subjects as the primary sampling units while TEL use the strata as the primary sampling unit (De Silva and Sooriyarachchi, 2012).

Furthermore, by a simulation study TP shows better performance than TE by maintaining its error values even when the strata are small and it uses the pooled estimators for variance. Therefore it provides a guideline to select TP as the most suitable statistic to perform this study. De Silva and Sooriyarachchi (2012) developed a R program to carry out this test.

Missing Value Imputation in Multivariate Multilevel Framework

The problem of having missing values is often arising in real world datasets. However it contains little or no information about the missing data mechanism (MDM). Therefore modeling incomplete data is a very difficult task and may provide bias results. Therefore this major problem address to a need of a proper mechanism to check the missingness. As a solution to that, Rubin (1976) presented three possible ways of arising misingness. These are classified as Missing At Random (MAR), Missing Completely At Random (MCAR) and Missing Not At Random (MNAR). According to the Sterne et. Al (2009), missing value imputation is necessary under the assumption of missing at random. However, it can also be done under the case missing complete at random. On nowadays most statistical packages have the capability of identifying the type of missingness.

After identifying the type of missingness, the missing value imputation comes into the field and it requires a statistical package to perform this. Since the missing value imputation in a hierarchical nature is little bit more advanced and it cannot be done using usual statistical packages such as SPSS, SAS and R etc. Therefore Carpenter et. al (2009), developed the REALCOM software to perform this task. However latter version of REALCOM was not deal with multilevel data in a multivariate context. Therefore the macros related to perform this task was recently developed by the Bristol University team in order to facilitate under this case.

Estimation Procedure

The estimation procedures for multilevel modeling are starting late 1980’s. However For parameter estimation using Maximum Likelihood Method, an iterative procedure called EM algorithm was used by early statisticians (Raudenbush, Rowan and Kang, 1991). Later on the program HLM was developed to perform this algorithm.

The most operational procedures for estimating multivariate multilevel models in the presence of Normal responses are Iterative Generalized Least Squares (IGLS), Reweighted IGLS (RIGLS) and Marginal Quasi Likelihood (MQL) while for discrete responses are MQL and Penalized Quasi Likelihood (PQL). According to Rasbash, Steele, Browne and Goldstein (2004) all of these methods are implemented in MLwiN along with including first order or second order Taylor Series expansions. However since these methods are likelihood based frequentist methods they tend to overestimate the precision.

Therefore more recently the methods which are implemented in a Bayesian framework using Marcov Chain Monte Carlo methods (Brooks, 1998) also used for parameter estimation which allows capability to use informative prior distributions. These MCMC estimates executed in MLwiN provides consistent estimates though they require a large number of simulations to control of having highly correlated chains.

Previous researches conducted using Univariate and Multivariate Multilevel Models

Univariate multilevel logit models

Before take a look at to the literature on multivariate multilevel analysis, the literature of univariate multilevel analysis is also be necessary to concerned since this thesis is based on some univariate multilevel models prior to fit multivariate multilevel models.

In the past decades, many social Scientists used to apply multilevel models for binay data. Therefore it is very important to review how they have implemented their work with less technology. As a aim of that, Guo and Zhao (2000) was able to do a review of the methodologies, hypothesis testing and hierarchical nature of the data involve of past literature. Also they conducted two examples for justify their results. First of all they made a comparison between estimates obtained from MQL and PQL methods which was implemented by MLn and the GLIMMIX method implemented by SAS by using examples. They have shown that the differences in PQL 1 and PQL 2 are small when fitting binary logistic models. Furthermore, they have shown that PQL- 1 and PQL-2 and GLIMMIX are probable to be satisfactory for most of the past studies undertaken in social sciences.

Noortgate, Boeck and Meulders (2003) uses multilevel binary logit models for the purpose of analyzing Item Response Theory (IRT) models. For that they carried out an assessment of the nine achievement targets for reading comprehension of students in primary schools in Belgium. They performed a multilevel analyses using the cross-classified logistic multilevel models and used the GLIMMIX macro from SAS, as well as the MLwiN software. However they found that there were some convergence problems arisen by using PQL methods in MLwiN. Therefore they used SAS to carryout analysis. Furthermore they have shown that the cross-classification multilevel logistic model is a very flexible to handle IRT data and the parameters can still be estimated even with the presence of unbalanced data.

Multivariate Multilevel Models

In the past two decades a very few of researches have sought to fit the multivariate multilevel models to the real world scenarios. Among those also all most all the researches trying to focus basically in educational sectors as well as socio economic sectors. None of them were able to focus these into the medical scenarios. However lack of multivariate multilevel analysis which presents in the field of health and medical sciences this chapter consists of the literatures of multivariate multilevel models in other fields.

According to the previous studies of education, Xin Ma (2001) examined the association between the academic achievements and the background of students in Canada by considering three levels of interest. For that the three level Hierarchical Linear Model (HLM) was developed in order to achieve his goals. This work allows him to draw the conclusions that both students and schools were differentially successful in different subject areas and it was more obvious among students than among schools. However the success of this study is based on some strong assumptions about the priors of student’s cognitive skills.

Exclusive of the field of education Raudenbush, Johnson and Sampson (2003) carried out a study in Chicago to determine the criminal behavior at person level as well as at neighborhood level with respect to some personal characteristics. For this purpose they use a Rasch model with random effects by assuming conditional independence along with the additives.

Moreover, Yang, Goldstein, Browne and Woodhouse (2002) developed a multivariate multilevel analysis of analyzing examination results via a series of models of increasing complexity. They used examination results of two mathematics examinations in England in 1997 and analyzed them at individual and institutional level with respect to some students features. By starting from a simpler model of multivariate normality without considering the institutional random effects, they gradually increased the complexity of the model by adding institutional levels together with the multivariate responses. When closely looked at, there work shows that the choice of subject is strongly associated with the performance.

Along with this growth of applications of multivariate multilevel models, researches may tend to apply those in to the other fields such as Forestry etc. Hall and Clutter (2004) presented a study regarding modeling the growth and yield in forestry based on the slash pine in U.S.A. In their work, they developed a methodology to fit nonlinear mixed effect model in a multivariate multilevel frame work in order to identify the effects of the several plot-level timber quantity characteristics for the yield of timber volume.

In addition to that they also developed a methodology to produce predictions and prediction intervals from those models. Then by using their developments they have predicted timber growth and yield at the plot individual and population level.

Grilli and Rampichini (2003) carried out a study to model ordinal response variables according to the students rating data which were obtained from a survey of course quality carried out by the University of Florence in 2000-2001 academic years. For that they developed an alternative specification to the multivariate multilevel probit ordinal response models by relying on the fact that responses may be viewed as an additional dummy bottom level variable. However they not yet assess the efficiency of that method since they were not implemented it using standard software.

When considering the evidences of the recent applications of these models the literature shows that Goldstein and Kounali (2009) recently conducted a study on child hood growth with respect to the collection of growth measurements and adult characteristics. For that they extended the latent normal model for multilevel data with mixed response types to the ordinal categorical responses with having multiple categories for covariates. Since data consists of counts they gradually developed the model by starting a model with assuming a Poison distribution. However since the data are not follow exactly a Poisson distribution they treated the counts as an ordered categories to get rid of that problem.

Frank, Cerda and Rendon (2007) did a study to identify whether the residential location have an impact to the health risk behaviors of Latino immigrants as they are increasing substantially in every year. For that they used a Multivariate Multilevel Rasch model for the data obtained by Los Angelis family and neighborhood survey based on two indices of health risk behaviors along with their use of drugs and participation for risk based activities. They starting this attempt by modeling the behavior of adolescents as a function of the characteristics related to both individual and neighborhood .According to the study they found that there is an association between increased health risk behaviors with the above country average levels of Latinos and poverty particularly for those who born in U.S.A.

Another application of multivariate multilevel models was carried out Subramanian, Kim and Kawachi (2005) in U.S.A. Their main aim was to identify the individual and community level factors for the health and happiness of individuals. For that they performed a multivariate multilevel regression analysis on the data obtained by a survey which was held on 2000. Their findings reflect that those who have poor health and unhappiness have a high relationship with the individual level covariates

By looking at the available literature, it can be seen that there are some amount of studies conducted on education and social sciences in other countries but none of the studies conducted regarding health and medical sciences. Therefore it is essential to perform a study by analyzing the mortality rates of some killing diseases which are spread in worldwide to understand risk factors and patterns associated with these diseases in order to provide better insights about the disease to the public as well as to the responsibly policy makers.

Introduction to Simple Linear Regression: Article Review

Simple Linear Regression

Introduction to simple linear regression: Article review

Abstract

The use of linear regression is to predict a trend in data, or predict the value of a variable (dependent) from the value of another variable (independent), by fitting a straight line through the data. Dallal (2000), examined how significant the linear regression equation is, how to use it to draw the best fitting line of the scatter plot and how important the best fitting line is.

Introduction to simple linear regression: Article review

The use of linear regression is to predict a trend in data, or predict the value of a variable (dependent) from the value of another variable (independent), by fitting a straight line through the data. Linear regression represents a connecting link between the independent (carrier) variable and dependent (response) variable, which if graphed on X and Y-coordinates, results in a straight line. Linear regression shows the straight line which thoroughly represents, or predicts, the value of the response variable, given the noted value of the carrier variable (Frey, 2006). This essay aims at reviewing the article introduction to simple linear regression by Dallal (2000).

Problem statement

Dallal (2000) assumed a relationship between body mass (independent or carrier variable) and muscle strength (dependent or response variable), the more body mass the more muscle strength. However, this relationship is not without exceptions, which is reflected on the scatter plot of a regression model. Therefore, the author posed the question of how to illustrate the straight line, which accurately portrays the data, or predicts the value of the response variable.

Research purpose statement

In the given example, most cases would show a perfect regression. However, standardization of the procedure of putting in a straight line is necessary to provide better communication and common grounds for analysts working on the same data. Further, in the example regression equation given (Strength = -13.971 + 3.016 LBM [Lean Body mass]), one can draw two conclusions; first, a predicted muscle strength equals LBM multiplied by 3.016 minus 13.971. Second, the difference between muscle strength of two individuals is presumably 3.016 multiplied by the difference in their LBM.

Research questions

Research question 1: Why we need to fit a regression equation into a set of data?

It is clear from the previous example there are reasons for fitting a regression equation into a set of data. These are 1) to describe the data, and 2) to predict an independent (response) variable from a dependent (carrier) one.

Research question 2: What is the underlying principle of calculating a straight line?

If the points signaling data in a scatter plot are close to a line, it means the line represents, matches or gives a good fit of data. If not, then the line with most of the points closer to it that any other is the one that gives good fit of data. Further, If the is used to predict values, these values should close enough to the noted ones, in other words, residuals (observed values – predicted values) should small values.

Research question 3: How linear regression (least squares) equation is used to illustrate the best fitting line?

The standard used, as the name implies, is the sum of squared residuals (observed – predicted values) is minimal for the best fitting line. This applies to a line fitted to a set of sample data to promote generalization to a population from which this sample was taken. Yet for a population, there is a slightly different linear regression equation. The equation illustrates that an output (dependent) variable on the Y-axis can be predicted from an input (independent) variableson the X-axis after adding a random error (si).

Research question 4: Is the sample regression equation an accurate estimate of the population regression equation?

There is a reservation for accreditation of this statement, which is directed at the confidence bands in relation to the regression line. They are understood as the standard error of the mean (the standard deviation of the mean of the sampling distribution). Yet with one exception that is the sampling mean of the dependent variables amplifies as it adds distance from the mean.

Sources of data

Dallal (2000), stated in the second part of his article (linked to the main article) are cross- sectional data. This type of data has the advantages of being used if sampling method are not weighted and-or un-stratified. This method can also be used if the researcher is concerned only with minor or small probabilities. The longitudinal data results in more statistical power, however, in repeated cross-sectional analysis, new subjects added per analysis compensates for the inherent decreased statistical power (Yee and Niemeier, 1996).

Data collection strategies and methods

A good data collection strategy should have two objectives, namely, having motivated respondents (affected by time consuming, trust in statistics, difficulty of questionnaire, and benefit included). The second objective should be having high quality data, which tailored to sample individuals, sampling method and good instruments of data collection (Statistics Norway, 2007).

Methods of data collection are many and selection of a particular method depends on the available resources, reliability, resources of analysis and reporting, besides the skills and knowledge of the analyst. Some of these methods are case studies, behavior observation check lists, attitude, and opinion surveys, questionnaires distributed by mail, e-mail, or phone calls. Other methods of data collection include time series (evaluating one variable over a period of time as a week), and individual or group interviews (The Ohio State University Bulletin Extension, 2005).

Conclusions

Dallal (2000), inferred that simple linear regression means that we can predict a dependent variable from an independent one, so whenever we need to know from the beginning each time we add information. The regression line is important as it makes the estimation of a dependent variable more accurate and it allows the estimation of a response variable for individuals with values of the carrier variable not included in the data. The author also inferred there are two methods of predicting a variable either from within the range of values of independent variable of the sample given (interpolation) or outside this range (extrapolation). The author recommended the first method as it has the advantage of being safe, yet with concerns as regards the way to demonstrate the linearity of relationship between the two variables.

References

Dallal, G. (2000). Introduction to simple linear regression. Retrieved January 14, 2008, from http://www.tufts.edu/~gdallal/slr.htm.

Frey, B. (2006). Statistics Hacks. Sebastopol, CA: O’Reilly Media Inc.

Statistics Norway (2007). Strategy for data collection. Retrieved 04/07/2008, from http://www.ssb.no/vis/english/about_ssb/strategy/strategy_data_collection.pdf

The Ohio State University (2005). Bulletin Extension – Step Four: Methods of Data Collection. Retrieved 04/07/2008, from http://www.ohioline.ag.ohio-state.edu

Yee J L. and Niemeier D (1996). Advantages and Disadvantages: Longitudinal vs. Repeated Cross-Section Survey-A Discussion Paper. Project Battelle, 94, 16-22.

Japanese traditional game

Japanese traditional gameIntroduction

Given the task to innovate a Japanese traditional game, we decided to use the Two – Ten Jack and create our very own which is much simpler to be played. It uses part of the Uno cards and also a board with numbers to be placed with a bet. In order to continuously win the prizes, we construct the game to be in ways that a player must place a bet that is either same number, or same color that is the taken out from the deck of cards played with. The Two – Ten Jack game is played without the dealer and with points deducted and added which in the end, the player with the highest points balance. The next page would be the manual to the game and after that would be the manual to the Two – Ten Jack game. Furthermore, a comparison would be made to show the innovation of our game being born.

The Game Manual for the Two – Ten Jack
Preliminaries

The object of two-ten-jack is to get the most points by taking tricks containing positive point cards while avoiding tricks containing negative point cards.

Two players receive six cards each from a standard 52-card deck ranking0 1 2 3 4 5 6 7 8 9 and the remaining undealt cards are placed between the players to form the stock. Non-dealer leads the firsttrick and winner of each trick leads to the next. Players replenish their hands between tricks by each drawing a card from the stock with the winner of the last trick drawing first. Play continues until all of the cards in the entire deck have been played. Points are then tallied before the deck is reshuffled and dealt anew.

Following, Trumping, and Speculation

In two-ten-jack a player may lead any card and the other player must play a card of the same suit if able, or otherwise must play atrump cardif able. If a player has neither cards in the lead suit or trump, then any other card may be played. The highest trump card, or the highest card of the lead suit if no trumps were played, takes the trick

In two-ten-jack hearts are always thetrump suit and theace of spadesis a special trump card known asspeculationranking above all of the hearts. Rules for playing speculation are as follows:

If a trump (heart) is lead, a player may follow with speculation and must play speculation if no other trumps are held in the hand.
If a spade is lead, a player may follow with speculation and must likewise play speculation if no other spades are held in the hand.
If a club or diamond is lead and the other player has neither of these, speculation may be played, and must be played if no other trumps are available.
A player leading speculation must declare it as either a spade or trump.
Scoring and winning

Cards are worth the following point values:

2¦, 10¦and J¦are worth +5 each
2¦, 10¦ and J¦ are worth -5 each
2¦, 10¦, J¦ and A¦ are worth +1 each
6¦is worth +1 point
Hence the total number of card points per deal is +5. Winner is the first player to reach 31 points.
Game Manual
The number of players required to play this game is one to two players and maximum five players each round
Start by placing a single bet.
Each bet is place on a number between zero to nine and four different colors
Each time six cards would be pulled out from the deck
The bet is counted with sweets.
Each sweet cost RM1.
Each player starts with a sweet
The bet with the same color out of the 6 cards drawn will get his money
The bet with the same number out of the 6 cards drawn will get win 5 sweet.
The bet without same color or same number out of the 6 cards loses 1 sweet.
The bet with the same color and same number walks away with Rm50
The bet with same color and same number and also another same number but different color in the six card drawn from the deck walks away with Rm100
Game Rules
A player can only place one bet to a number and color per round.
Not more than 1 player can bet at a same number and color in each round.
A player has to verify his/her choice of bet before the opening of the six cards from the deck.
Comparison
The amount of cards used in Two – Ten Jack is 52 while the game we have created uses 40.
Also, the Two – Ten Jack is played between players while the game we have created uses a dealer.
Besides that, the Two – Ten Jack is played with a system of addition and subtraction while we tried to make it compatible by placing bets instead of tricking the other players.
Furthermore, the game we have created has been added with little elements of western card game like 21.

Statistics Essay: Interpreting Social Data

Interpreting Social Data

The British Household Panel Survey of 1991 measured many opinions, among otherthings, of the UK population. One of the questions asked was whether thehusband should be the primary breadwinner in the household, while the wifestayed at home. Answers to the questions were provided on an ordinal scale,progressing in five ordinances from Strongly disagree to Strongly agree.Results for each ordinance were recorded from male respondents and femalerespondents. Of survey respondents, 96.75, or N = 5325.162 answered thisquestion of a total survey population of N = 5500.829. 3.2%, or N = 175.667 ofsurvey respondents did not answer the question. In lay terms, this meansapproximately 97% of the survey respondents answered the question, while 3% didnot.

The study presents ordinal ranking, or ranking in a qualitative manner, of fivesets of concordant pairs of variables: the male and female count for those whostrongly agree the husband be the primary earner while the wife stays at home,the male and female count for those who agree, the male and female count forthose who are neutral, the male and female count for those who disagree, andthe male and female count for those who strongly disagree. The sexcross-tabulation presents numeric data for responses for each of the tenvariables, arranged in five variable pairs with male and female responses foreach variable pair. Data is presented in terms of number of responses for eachof the ten variables.

The counts or number of responses for each variable aredependent variables in the data analysis. We know they are dependentvariables because first, they are presented on the y-axis in the chartgraphically representing the data. Dependent variables are graphicallyrepresented on the y-axis, with independent variables presented on the x-axis.Causally it becomes more difficult to distinguish between dependent andindependent variables at first glance. Dependent variables usually change as aresult of independent variables. For example, if one were studying the effectof a certain medication on blood sugar in diabetics, the independent variablewould be the amount of medication given to the patient. In a test group orcohort of patients, each would be given a set dosage and their blood sugarresponses recorded. One patient may respond with a blood sugar reading of 110when given 20mg of medicine. Another day the patient, again given 20mg ofmedicine, may respond with a blood sugar reading of 240. The amount ofmedicine provided to the patient is fixed, or the independent variable. Theresponse of the patient is variable, and believed to be influenced by, ordependent on, the amount of medicine provided. The dependent variable wouldtherefore be the responding blood sugar reading in each patient.

In this survey, independent variables are the fivechoices of answers available to the survey takers. These five possibleresponses are presented to each survey respondent, just as the medicine isprovided to the patient in the example above. The respondent then chooses hisor her reply to the five possible answers, or chooses not to answer thequestion at all. The amount of those choosing not to answer at all, 3.2%, isconsidered statistically irrelevant in the analysis of this data. Data relatedto non-response is not considered from either an independent variable ordependent variable standpoint.

The amount of responses or response count for a givenindependent variable in the survey is a dependent variable. The response countwill change, at least slightly, from survey to survey. This could be a due tochange in survey size, response rate or number of those choosing to respond tothe statement, or possible minor fluctuation in percentage response for thefive answer possibilities. Although the statistical results of the responsesshould be similar, given a large enough and representative sample for eachsurvey attempt, some variance is likely to occur. The independent – dependentvariable relationship in the Husband should earn, wife should stay at homeanalysis is trickier to get one’s mind around than the medical example givenabove. In the medical example, it is easy to grasp how a medicine could affectblood sugar, and the resulting cause-effect relationship. In this survey, thecreation of five answer groups causes the respondents to categorise theiropinion into one of the groups, a much more difficult mental construction thanmore straightforward cause-result examples.

Fourexamples of dependent variables in these statistics are the number of men whoagreed with the statement (525), the number of women who agreed with thestatement (520), the number of men who disagreed with the statement (688), andthe number of women who disagreed with the statement (997). As describedabove, we know these are dependent variables because they are caused by theindependent variables, the five ordinal answer groups, in the survey.

Overall,empirical data for the results is skewed towards the Disagree / Stronglydisagree end of the survey. Three of the independent variables are ofparticular note. Strongly agree is the lowest response for both men and women,with Disagree being the highest response for both men and women althoughaccording to Gaussian predictions the Not agree/disagree variable should have thehighest distribution.

Inlay terms, the graphical representation of each of the five possible answersshould have looked like a bell-shaped curve. The two independent variables oneach end of the chart, Strongly agree and Strongly disagree, should have had alow but approximately equal response. The middle independent variable on thechart, Not agree / disagree, should have been the largest response. Thisshould have produced dependent variables of approximately 935 each for both menand women for the Not agree / disagree variable. Instead, the response for menwas 586, or 63% of typical distribution of answers. The response for women was702, or 75% of the typically distributed answers. The mean, or average, of allresponses in this survey is 1065.2, with the mean or average of male responsesbeing 464.6 and the mean or average of female responses being 600.6. Were theresponses distributed evenly amongst all five possible answers, these would bethe anticipated response counts.

Inexamining this data, a hypothesis can be put forth that the correlation betweenthe counts on two of the answer possibilities (two of the dependent variables)will be some value other than zero, at least in the population represented bythe survey respondents. This hypothesis can be tested using the ordinalsymmetric measures produced in the data analysis. As Pilcher describes, whendata on two ordinal variables are grouped and given in categorical order, wewant to determine whether or not the relative positions of categories on twoscales go together’ (1990, 98). Three ordinal symmetric measures, Kendall’stau-b, Kendall’s tau-c, and Gamma, were therefore calculated to determine ifthe order of categories on the amount of agreement to the question would helpto predict the order of categories on the count or amount of those selectingeach ordinal category. The most appropriate measures of association toevaluate this hypothesis are the two Kendall’s tau measures. The Kendall tau-cmeasure allows for tie correction not considered in the Kendall tau-b measure.The results of these measures, value .083 and .102 with approximate Tbof 6.75 indicate there is neither a perfect positive or perfect negativecorrelation between variables. Results do indicate a low level of predictionand approximation of sampling distribution. The correlation between two of thedependent variables is indeed a value other than zero, proving the hypothesiscorrect.

Three nominal symmetric measures were also calculated.These showed weak relationship between category and count variables, with avalue of only .096 for Phi, Cramer’s V, and Contingency Coefficient. Thesewere not used in testing the above hypothesis.

Atheory of distribution, Chebyshev’s theorem states that the standard of deviationwill be increased when data is spread out, and smaller when data is compacted.While the data may or may not present according to the empirical rule(bell-shaped), Chebyshev’s theorem contends that defined percentages of thedata will always be within a certain number of standard deviations from themean (Pilcher 1990).

Inthis example, data is compressed into five possible answer variables. The datadoes not present according to the empirical rule, but is skewed towards thedisagreement end of the variable scale. However, Chebyshev’s theorem doesapply relating to the distribution of data according to standard deviation fromthe mean for nine of the ten dependent variables. The response count of womenwho Disagree with the statement the Husband should earn, the wife stay at home,was proportionately larger than would be indicated along normal distribution.While the response count for men is also statistically high, it is not beyondthe predictions of Chebyshev’s theorem. If the survey had been conducted withfewer independent variables, say three ordinances instead of five, theresulting data would be more tightly compacted. If the survey had beenconducted with ten ordinances, the data would have been more spread out.

REFERENCES

Pilcher, D., 1990. Data Analysis forthe Helping Professions. Sage Publications, London.

Important and application of data mining

Important and application of Data MiningAbstract

Today, people in business area gain a lot of profit as it can be increase year by year through consistent approach should be apply accordingly. Thus, performing data mining process can lead to utilize in assist to make decision making process within the organization. This paper elaborate in detail the level of importance and also the application the application of data mining which can be adopt for various fields depends on the objective, mission, goals and purpose of conducting the study within the organization. there are three main areas take as a example which are hotel, library and hotel to observe on how data mining works to these main field.

Keywords: Data Mining, KDD Process, Decision Trees, Ant Colony Clustering Algorithm; Association Rules, Neural Network, Rough Set,

1.0 Introduction

As we know, organization which conducts business transaction is keeps massive of document or data in a specific database for further retrieval. The data are combine from are a few departments that carried out different task and each of their function parallel with the mission and vision of organization. According (Imberman, 2001) the number of fields in large databases can approach magnitudes of 102 to 103. Therefore, it is necessary to make proper decision making or strategic planning using the existing data where these plays important role in order to ensure any action that are taken place does not given an impact especially bring loss to the organization. Other than that, data became obsolete when it keeps on changing and easily out dated as the user requirement shifting depends on factors such as trends, money, needs and so forth.

One way to analyze data is using of data mining technique which enable to assist organization by emphasize several steps to produce the valuable output in short period of time compare with the traditional method which may involves more than one methodologies and it derive to longer of time to accomplish the investigation towards a portion of data. Thus, in the business area an action should be done quickly in order to compete with other competitors and to improve performance both in giving service and produce a high quality product. Moreover, process interpretation of the result involves group of people to inject some of the creativity and synthesis which can lead to the solutions on the problem or tasks.

Obviously, data mining a lot assist in various fields with different purposes and depend on the objectives that want to achieve. The rest of this paper is organized as follows. Section 2 tells about definition of data mining. Section 3 determines the importance of data mining. Section 4 explains the application of data mining in various fields. Section 5 draws the conclusions.

2.0 Definition of Data Mining

There are abroad definitions listed by a few researcher and academician according to their view and opinion based on the study they have done. Moreover, these will help to understand or giving an idea before discusses more in depth towards data mining technique.

Basically, the main purpose use of data mining is to manipulate huge amount of data either existence or store in the databases by determine suitable variables which is contribute to the quality of prediction that will be use to solve problem. Define by Gargano & Raggad, 1999.

“Data mining searches for hidden relationships, patterns, correlations, and interdependencies in large databases that traditional information gathering methods (e.g. report creation, pie and bar graph generation, user querying, decision support systems (DSSs), etc.) might overlook”.

Besides that, another author also agreed with opinion toward the data mining definition which is to seek hidden pattern, orientation and also trend. Through (Palace, 1996) added to the previous is:

“Data mining is the process of finding correlations or patterns among dozens of fields in large relational databases”.

Moreover, data mining also define as process to squeeze of knowledge or information using appropriate framework or model to analyze until produce an output that assist in fulfill the objective of the study. From Imberman, 2001:

“As knowledge extraction, information discovery, information harvesting, exploratory data analysis, data archeology, data pattern processing, and functional dependency analysis”.

The statement above agreed and adds that the framework or model that adopt definitely to expose the real circumstance. Define by Ma, Chou & Yen, 2000:

“Data mining is the process of applying artificial intelligence techniques (such as advanced modeling and rule induction) to a large data set in order to determine patterns in the data”.

In the other hand, data mining is taken a few steps during analysis and this step is depending on the methodology that is chosen. Each of the methodology is not much differ from other methodology. Through Forcht & Cochran, 1999:

“Data mining is an interactive process that involves assembling the data into a format conducive to analysis. Once the data are configured, they must be cleaned by checking for obvious errors or flaws (such as an item that is an extreme outlier) and simply removing them”.

3.0 Important of Data Mining

As discusses above, it can be seen that data mining will be beneficial a lot of party and multiple range of level in the organization as the model or framework that is apply can reduce time and cost. Then, the results allow the responsible knowledge worker to transform into the strategic value of information effectively by critically analyze the result.

The process should be done carefully to avoid the useful variables or algorithm being removes or not be included in the extraction of reliable data. Data mining techniques will help in select a portion of data using appropriate tools to filter outliers and anomalies within the set of data. According to Gargano & Raggad, 1999, there are a few others important of data mining consist of:

· To facilitate the explication of previously hidden information includes the capabilities to discover rules, classify, partition, associate and optimize.

According to (Goebel & Gruenwald, 1999) in order to seek the pattern of data, a few methodologies are use in clarify the vagueness as well as to identifying the relation among one variables and other variables within the databases whereas the outcome will guide in making decision or to forecast the impact when the action were take into consideration. The chosen of methodologies should be determined in a proper way suit with the rules and condition towards the data which is to be analyzed. The methodologies include:

Statistical Methods: focused mainly on testing of preconceived hypotheses and on fitting models to data.
Case-Based Reasoning (CBR): technology that tries to solve a given problem by making direct use of past experiences and solutions.
Neural Networks: formed from large numbers of simulated neurons, connected to each other in a manner similar to brain neurons which enables the network to “learn”.
Decision Trees: each non-terminal node represents a test or decision on the considered data item and can also be interpreted as a special form of a rule set, characterized by their hierarchical organization of rules.
Rule Induction: Rules state a statistical correlation between the occurrences of certain attributes in a data item, or between certain data items in a data set.
Bayesian Belief Networks: graphical representations of probability distributions derived from co-occurrence counts in the set of data items.
Genetic algorithms / Evolutionary Programming: formulate hypotheses about dependencies between variables, in the form of association rules or some other internal formalism.
Fuzzy Sets: constitute a powerful approach to deal not only with incomplete, noisy or imprecise data, but may also be helpful in developing uncertain models of the data that provide smarter and smoother performance than traditional systems.
Rough Sets: rough sets are a mathematical concept dealing with uncertainty in data and used as a stand-alone solution or combined with other methods such as rule induction, classification, or clustering methods

· The ability to seamlessly automate and embed some of mundane, repetitive, tedious decision steps not requiring continuous human intervention.

Several steps are taken in processes or analyzes on selected data where the process involves of filtering, transforming, testing, modeling, visualization and documented the result or store accordingly in the databases or data warehouse. Each of the steps functions differently and has responsibility in carries out the process with the purpose to easier and produce the high quality of assumption by automate generate towards specific conditions. For example, data warehouse also keep previous analysis and this allow eliminating the redundant output at certain steps. Through Ma, Chou & Yen, 2000, they stress the characteristics of data mining define how it assist to reach the end process of analyzing. It comprises:

Data pattern determination: Data-access languages or data-manipulation languages (DMLs) identify the specific data that users want to pull into the program for processing or display. It also enables users to input query specifications. Therefore, users simply select the desired information from the menus, and the system builds the SQL command automatically.
Formatting capability: It generates raw data formats, tabular, spreadsheet form, multidimensional-display and visualization.
Content analysis capability: Data mining also has a strong content analysis capability that enables the user to process the specifications written by the end-users.
Synthesis capability: Data mining allows data synthesis to be timely executed.

· Simultaneously reducing cost and potential error encountered in the decision making process.

Basically, data mining can minimize the error of forecasting by following the steps of selected methodology in well manner to avoid delaying in making decision where this situation will giving big impact for the business area. Therefore, it must be careful in handling the data throughout the steps involves whereby the strategic plan should take into consideration includes of the objectives to done the analysis, the amount of data, the variables, the relationship between variables, test adopted, and so forth. Moreover, if there is need to discuss with the professional towards the study conducted and it should be included in the planning part. In the context of organization, usually a unit or group of people are given responsible to carries this duty to discover the hidden pattern for another department. Hence, the continuously meeting should be done between the professional and researchers to ensure the end result fulfill their requirement as well as to improve the performance of worker, department and organization.

In term of reducing a cost, compare to the traditional research which take time in acquiring the data from respondents and it depend on the methodologies that are use and the number of sampling. If the questionnaire method, it can be done quickly and less time consuming but if the interviewing method is adopted, it surely take time and researcher have to meets the respondent more than one time, if there is an ambiguity or the answers not meet with the requirement. For certain study, the sampling are involves from the different location which require the researcher to travel in order to gain the genuine opinion from them and this will cost a lot involves of accommodation, food, flight ticket and so forth. For data mining, it uses the existence of data (for example, data of customer transaction, data of student registration, data of patient undergo the operation process and so on) that keep in data warehouse which mostly reduce cost in aspect of acquiring data. Other than that, researcher take first action by search for the study in the data warehouse when the objective being determine at the beginning of study because previous study are store in the data warehouse. If it is found tally, a few step will be skip or easily decided towards the data and it prove that data mining can reducing the cost as well as time. Refer to Gargano & Raggad, 1999, data mining also derive long term benefit which the cost incurred due to the development, implementation, and maintenance of such systems by a wide margin.

4.0 The application of Data Mining

Nowadays, data mining is widely use especially to those organization that focuses on consumer orientation. For example, retail, financial, communication, and marketing organizations (Palace, 1996). Besides it, healthcare area also gain benefit by apply the data mining into the daily operations. These various of field shows each of the organization carries different transaction where all of details keep in the databases which enables to perform analysis for multiple purpose likes to increase revenue, gain more customer, improve customer satisfaction and others. Moreover, again through (Palace, 1996) the existence data allow to determine relationships among internal factor consists price, product positioning or staff skills and external factor consists economic indicators, competition and customer demographic.

Hence, there three examples of data mining’s application in different areas which are hotel sector, library scope and also hospital with the goals to reduce or eliminate the weakness by address it using the result that is interpret in well manner to assist in making decision for the best solutions. The examples are as follows:

· A data mining approach to developing the profiles of hotel customers.

A study conduct by Min, Min & Ahmed Emam, 2002 with the objective to target some of the valued customers for special treatment based on their anticipated future profitability to the hotel. There are a few questions regarding to the customer profiling:

Which customers are likely to return to the same hotel as repeat guests?
Which customers are at greatest risk of defecting to other competing hotels?
Which service attributes are more important to which customers?
How to segment the customer population into profitable or unprofitable customers?
Which segment of the customers’ best fits the current service capacities of the hotels?

The researchers adopt decision trees for analyzing the data from the abroad method of data mining methodology because the ability to generate appropriate rules using visualization and simplicity. There are three steps having to follows in this process and it includes:

Data collection: the process of select data that suit with objective from the previous survey. Moreover, remove the unwanted data from databases by filtering out the excel file.
Data formatting: the process of converted all data in the spreadsheet to Statistical Packages for Social Sciences (SPSS) for the purpose of classification accuracy.
Rules induction: the process of selection of algorithms to building decision trees which is C5.0 to generate sets of rules that bring important clues in order for hotel manager to take further action.

As the result, the researcher found that “if-then” rules as a useful in formulating a customer retention strategy with a predictive ranging from 80.9 per cent to 93.7 per cent whereas a predictive accuracy reflect to the rules conditions that affect by times (percentage).

· Using data mining technology to provide a recommendation service in the digital library.

A study conducted by Chen & Chen, 2006 with the purpose to provide recommendation system architecture to promote digital library service in electronic libraries. There are abroad of digital publication format likes audio, video, picture, etc. thus, it lead difficulties in analyzing or defining the keyword and content in order to gain information from the user to improve the service in the digital libraries.

In the methodology section, there are two data mining models selected which consist

o Ant Colony Clustering Algorithm;

This model is capable to find the shortest path or reduce time to find the best output fit with the problem that existence in the organizations. Each of the steps has different function to enable they too see the relation among the variables It takes a few steps which are:

Step 0: parameters and initialize pheromone trails.

Step 1: Each ant constructs its solution

Step 2: Calculate the scores of all solutions

Step 3: Update the pheromone trails.

Step 4: If the best solution has not been changed after some predefined iterations, terminate the algorithm; otherwise go to step 2.

o Association rules to discover the hidden pattern.

This model enables to find co-purchase items and assist in uncovered relationship algorithms in form of association rules. There are two main steps as follows:

Step 1: Find all large item sets

Step 2; use the large items set generated in the first step to generate all the effective association rules.

As the results, these two models encounter more than one solutions and enable to gain a lot of recommendation that can be manipulate into various problem that exists in conducting digital libraries as well as to promote the usage in multiple level of user using the appropriate mechanism and providing suitable services.

· Using KDD process to forecast the duration of surgery.

A study conducted by Combas, Meskens & Vandamme, 2007 with the aim is to identify classes of surgery likely to take different lengths of time according to the patient’s profile as well as to allow the use of the operating theatre to be better scheduled. There are many issues arise in this field that lead to the study. For example, an endoscopy unit use of endoscopy tube (shared resources) during the surgery. However their availability is limited because it takes 30-45min to clean and sterilize each one. The scheduling of endoscopies (and all other operating theatre procedures) must obviously take into account the availability of these different resources.

The researchers adopt Knowledge Discovery in Databases (KDD) process to analyze this massive data from the databases. The step as follows:

Step 1: data preparation which the selected data must be fulfill of requirement includes secondary diagnoses, “Previous active history” and system affected.
Step 2: data cleaning where filter data by concerning surgical procedures that had been performed at least 40 times (at least 20 times for combinations involving both surgery and specific surgeons).
Step 3: data mining which to decide appropriate method to test on the portion of data which it involves rough set and neural network.
Step 4: validation by comparison consist process of interpretation by comparing the result from two methods that perform data analysis in order to observe the rate of good classification.

Then, researcher added up another three steps in order to fit with the objective that is proposed and to produce the best outcomes to forecast the durations of surgery. It consists of:

o Step 5: Measuring the impact of predicting the duration of surgery on planning which in this step the duration of surgery supplied by the prediction models (empirical laws, rule-based laws, etc.) based on information stored in the database is used to feed a series of algorithms and heuristics for planning purposes
o Step 6: Simulation involves the present time will allow to simulate the activity of the different theatre suites in terms of the operating sequence determined by planning methods on the two scenarios which are operating data and patient’s profile
o Step 7: validation & selection of the best model where the results supplied by the simulation model should enable to assess the quality of scheduling on the basis of a series of performance indicators likes the length of time for which the operating theatres are not in use, the number of potential additional hours, and errors in predicting the duration of surgery.

As the results, researchers are not particularly satisfactory. The main problem seems to be the choice of variable grouping, which might possibly have an effect on prediction quality.

5.0 Conclusion

As a conclusion, data mining can be consider as an effective and efficient way to discover or to transform the invisible to visible data that retrieve from databases which have capabilities to store huge amount of data by using the right tools in assist or enable to analyze, synthesis and manipulate the content of data for various purposes and often depend on the main businesses that carries out to define the target.

From the discussion above, it can be seen that there are a lot of advantages when perform data mining especially in the business area which allow the organization to predict the trends, customer requirement, the relationship and so forth as early preparation can be identify in order to seek another or a few others way to ensure that organization can still operate their daily operation after determine that organization not agree towards the result have been gain.

In order to produce the end result that satisfying the organization and minimize the error as it successfully implement the information in order to perform business transaction. The key variables should be assign in well manner meet or suitable with the objective that propose in conducting the study because it have to repeat the procedures when found the errors as the decision making process could not been done according to the timeline.

6.0 References

Chen, Chia-Chen & Chen, An-Pin. (2006 ). Using data mining technology to provide a recommendation service in the digital library. The Electronic Library. 25(6): 711-734.

Combas, C., Meskens, N & Vandamme, J. P. (2007). Using a KDD process to forecast the duration of surgery. International Journal of Production Economics. 112: 279-293.

Forcht., Karen A. & Cochran, Kevin. (1999). Using data mining and datawarehousing techniques. Industrial Management & Data Systems. 99(5), 189-196.

Gargano., Michael L. & Raggad, Bel G. (1999). Data mining – a powerful information creating tool. OCLC Systems & Services. 15(2), 81-90.

Goebel, Michael & Gruenwald, Le. (1999). A survey of data mining and knowledge discovery software tools. ACM SIGKDD Explorations Newsletter. 1: 20 – 33.

Imberman, Susan P. (2001) Effective Use of the KDD Process and Data Mining for Computer Performance Professionals. in International Computer Measurement Group Conference. Anaheim: USA, 611-620.

Ma, Catherine, Chou, David C. &.Yen, David C. (2000). Data warehousing, technology assessment and management. Industrial Management & Data Systems. 100(3), 125-135.

Min, Hokey., Min, Hyesung & Ahmed Emam. (2002). A data mining approach to developing the profiles of hotel customers. International Journal of Contemporary Hospitality Management. 14(6): 274-285.

Palace, Bill. (1996, Spring). Data Mining: What is Data Mining? retrieved March 2, 2010, from: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm

Impact of Social Determinants on Health

Song et al (2011) studied the influence of social determinants of health on disease rates. They specified AIDS as the disease of concern and utilized data from American Community Survey. They used correlation and partial correlation coefficients quantify the effect of socioeconomic determinants on AIDS diagnosis rates in certain areas and found that the AIDS diagnosis rate was mutually related with kind, marital status and population density. Poverty, education level and unemployment also determine the cause of disease in an individual.

In developed and developing countries socioeconomic status proved to be an important cause of cardiovascular disease. Survey studies showed that education was the most important socioeconomic determinant in relation to cardiovascular risk factor. Smoking was also a major cause of cardiovascular disease. Low socioeconomic status had a direct relationship with higher levels of cardiovascular risk factors (Yu et al, 2000; Reddy et al, 2002; Jeemon & Reddy, 2010; Thurston et al, 2005; Janati et al, 2011 and Lang et al, 2012).

Lantz et al (1998) investigated the impact of education, income and health behaviors on the risk of dying within the next 7.5 years with longitudinal survey study. The results of cross tabulation showed that the mortality rate has a strong association with education and income.

Habib et al (2012) conducted a questionnaire based survey to measure the social, economic, demographic and geographic influence on the disease of bronchial asthma in Kashmir valley. After analysis in SPSS they concluded that non smokers, males working in farms and females working with animals have a high incidence of Bronchial Asthma. The study also showed a significant relationship between the age and disease.

Arif and Naheed (2012) used “The Pakistan Social and Living Standard Measurement Survey 2004-05” conducted by the Federal Bureau of Statistics to determine the socioeconomic, demographic, environmental and geographical factors of diarrhea morbidity among the sampled children. Their study found a relationship between diarrhea morbidity and economic factors particularly ownership of land, livestock and housing conditions. Child’s gender and age, total number of children born, mother’s age and education and sources of drinking water did show significant effect on the diarrhea morbidity among children.

Aranha et al (2011) conducted a survey in Brazil’s district Sao Paulo, to determine the association between children’s respiratory diseases reported by parents, attendance at school, parents’ educational level, family income and socioeconomic status. By applying chi square test they concluded that the health of children is associated with parents’ higher education, particularly mothers. Family income, analyzed according to per capita income did not affect the number of reports of respiratory diseases from parents.

Deolalikar and Laxminarayan (2000) used data from 1997 Cambodia Socioeconomic Survey to estimate the influence of socioeconomic variables on the extent of disease transmission within villages in Cambodia. They concluded that infectious diseases were the leading cause of morbidity in the country. Younger adults were less likely to get infected by others, but it increased with age. Income and the availability of a doctor had a significant effect on disease transmission.

Survey studies based on different countries showed a strong association between socioeconomic factors (income, education and occupational position) and obesity. After analysis there was a significant effect of consumption of low quality food due to economic factors on increased obesity. For men, both the highest level of occupational position and general education completed were found to have a significant effect on obesity while women in the lowest income group were three times as likely to be obese as women in the highest income group (Kuntz and Lampert, 2010; Akil and Ahmad, 2011 and Larsen et al, 2003).

Yin et al (2011) used data from the 2007 China Chronic Disease Risk Factor Surveillance of 49,363 Chinese men and women aged 15-69 years to examine the association between the prevalence of self-reported physician diagnosed Chronic Obstructive Pulmonary Disease (COPD) and socioeconomic status defined by both educational level and annual household income. Multivariable logistic regression modeling was performed. Among nonsmokers, low educational level and household income were associated with a significant higher prevalence of COPD.

Siponen et al (2011) tried to study the relationship between the health of Finnish children under 12 years of age and parental socioeconomic factors (educational level, household income and working status) by conducting population based survey. The analysis was done by using Pearson’s Chi-Square tests, and logistic regression analysis with 95% confidence intervals. The results showed that parental socioeconomic factors were not associated with the health of children aged under 12 years in Finland.

Washington State Department of Health (2007) examined Washington adults and inferred that adults with lower incomes or less education were more likely to smoke, obessed, or ate fewer fruits and vegetables than adults with the broader culture, higher incomes and more education. In cultures where smoking was culturally unacceptable for women, women died less often from smoking-related diseases than women in groups where smoking was socially accepted. Lack of access to or inadequate use of medical services, contributed to relatively poorer health among people. In lower socioeconomic position groups health care received by the poor was inferior in quality. People of higher socioeconomic position had larger networks of social support. Low levels of social capital had been associated with higher mortality rates. People who experienced racism were more likely to have poor mental health and unhealthy lifestyles.

Hosseinpoor et al (2012) took self-reported data, stratified by sex and low or middle income, from 232,056 adult participants in 48 countries, derived from the 2002–2004 World Health Survey. A Poisson regression model with a robust variance and cross tabulations were used deducing the following results. Men reported higher prevalence than women for current daily smoking and heavy episodic alcohol drinking, and women had higher growth of physical inactivity. In both sexes, low fruit and vegetable consumption were significantly higher.

Braveman (2011) concluded that there was a strong relationship between income, education and health. Health was improved if income or education increased. Stressful events and circumstances followed a socioeconomic incline, decreased as income increased.

Lee (1997) examined the effects of age, nativity, population size of place of residence, occupation, and household wealth on the disease and mortality experiences of Union army recruits while in service using Logistic regression. The patterns of mortality among recruits were different from the pattern of mortality among civilian populations. Wealth had a significant effect only for diseases on which nutritional influence was definite. Migration spread communicable diseases and exposed newcomers to different disease environments, which increased morbidity and mortality rate.

Ghias et al (2012) studied the patients having HCV positive living in province of Punjab, Pakistan. Socio-demographic factors and risk factors were sought out using questionnaire. Logistic regression and artificial neural network methods were applied and found that patient’s education, patient’s liver disease history, family history of hepatitis C, migration, family size, history of blood transfusion, injection’s history, endoscopy, general surgery, dental surgery, tattooing and minor surgery by barber were 12 main risk factors that had significant influence on HCV infection.

REFERENCES

Song, R. et al (2011) “Identifying The Impact Of Social Determinants Of Health On Disease Rates Using Correlation Analysis Of Area-Based Summary Information” Public Health Reports Supplement 3, Volume 126, 70-80.
Yu, Z. et al (2000) “Associations Between Socioeconomic Status And Cardiovascular Risk Factors In An Urban Population In China” Bulletin of the World Health Organization Volume 78, No. 11, 1296-1305.
Reddy, K. et al (2002) ” Socioeconomic Status And The Prevalence Of Coronary Heart Disease Risk Factors” Asia Pacific J Clin Nutr Volume 11, No. 2, 98–103.
Jeemon, P. & Reddy, K. (2010) ”Social Determinants Of Cardiovascular Disease Outcomes In Indians” Indian J Med Res Volume 132, 617-622.
Thurston, R. et al (2005) “Is The Association Between Socioeconomic Position And Coronary Heart Disease Stronger In Women Than In Men?” American Journal of Epidemiology Volume 162, No. 1, 57-65.
Janati, A. et al (2011) “Socioeconomic Status and Coronary Heart Disease” Health Promotion Perspectives Volume 1, No. 2, 105-110.
Lang, T. et al (2012) “Social Determinants Of Cardiovascular Diseases” Public Health Reviews Volume 33, No. 2, 601-622.
Lantz, P. et al (1998) “Socioeconomic Factors, Health Behaviors, and Mortality” JAMA Volume 279, No. 21, 1703-1708.
Habib, A. et al (2012) “Socioeconomic, Demographic and Geographic Influence on Disease Activity of Bronchial Asthma in Kashmir Valley” IOSR Journal of Dental and Medical Sciences (JDMS) ISSN: 2279-0853, ISBN: 2279-0861, Volume 2, No. 6, 04-07.
Arif, A. and Naheed, R. (2012) “Socio-Economic Determinants Of Diarrhoea Morbidity In Pakistan” Academic Research International ISSN-L: 2223-9553, ISSN: 2223-9944 ISSN-L: 2223-9553, ISSN: 2223-9944, Volume 2, No. 1, 490-518.
Aranha, M. et al (2011) “Relationship Between Respiratory Tract Diseases Declared By Parents And Socioeconomic And Cultural Factors” Rev Paul Pediatr Volume 29, No. 3, 352-356.
Deolalikar , A. and Laxminarayan, R. (2000) “Socioeconomic Determinants of Disease Transmission in Cambodia” Resources for the Future Discussion Paper, 00–32.
Kuntz, B. and Lampert, T. (2010) “Socioeconomic Factors and Obesity” Deutsches Arzteblatt International Volume 107, No. 30, 517-22.
Akil, L. and ; Ahmad, H. (2011) “Effects Of Socioeconomic Factors On Obesity Rates In Four Southern States And Colorado” Ethnicity & Disease Volume 21, 58-62.
Larsen, P. et al (2003) “The Relationship of Ethnicity, Socioeconomic Factors, and Overweight in U.S.Adolescents”OBESITY RESEARCH Volume 11, No.1, 121-129.
Yin, P. et al (2011) “Prevalence Of COPD And Its Association With Socioeconomic Status In China: Findings From China Chronic Disease Risk Factor Surveillance 2007” BMC Public Health Volume 11, 586-593.
Siponen, M. et al (2011) “Children’s Health And Parental Socioeconomic Factors: A Population-Based Survey In Finland” BMC Public Health Volume 11, 457-464.
Washington State Department of Health (2007) “Social and Economic Determinants of Health” The Health of Washington State Volume 1, No. 3, 01-07.
Hosseinpoor, A. et al (2012) “Socioeconomic inequalities in risk factors for noncommunicable diseases in low-income and middle income countries: results from the World Health Survey” BMC Public Health Volume 12, 912-924.
Braveman, P. (2011) “Accumulating Knowledge on the Social Determinants of Health and Infectious Disease” Public Health Reports Supplement 3, Volume 126, 28-30.
Lee, C. (1997) “Socioeconomic Background, Disease, and Mortality among Union Army Recruits: Implications for Economic and Demographic History” Explorations in Economic History Volume 34, 27-55.
Ghias, M. et al (2012) “Statistical Modelling and Analysis of Risk Factors for Hepatitis C Infection in Punjab, Pakistan” World Applied Sciences Journal Volume 20, No. 2, 241-252.