Classification of HTML Documents
- Xie, Wei
- Authors: Xie, Wei
- Date: 2006
- Type: Text , Thesis , PhD
- Full Text:
- Description: Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement.
- Description: Master of Computing
- Authors: Xie, Wei
- Date: 2006
- Type: Text , Thesis , PhD
- Full Text:
- Description: Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement.
- Description: Master of Computing
.comUnity : A study on the adoption and diffusion of internet technologies in a regional tourism network
- Authors: Braun, Patrice
- Date: 2003
- Type: Text , Thesis , PhD
- Full Text:
- Description: This thesis describes the initiation and evolution of an action research project, which investigates the adoption and diffusion of Internet technologies in a regional Australian tourism network. The research evolved out of a portal development consultancy. The aim of the study was two-fold: to investigate the nature of the change process when a collaborative network seeks to adopt e-commerce; and to determine how the change process differed in the face of incremental change (adding some e-commerce solutions to the network), or radical change (changing the overall business model). The purpose of the study was to gain a better understanding of the economic, strategic and social potential of regional business networks in the current techno-economic climate. The study builds on Rogers' (1995) seminal work on the diffusion of innovations and makes a unique contribution to existing diffusion studies by its focus on the nature of the network links as the unit of analysis; and by its application of an action-oriented methodology to untangle the effects of the embedded network structure on diffusion. The study suggests a strong relationship between diffusion and network positioning, both in terms of place (status and position in the network) and space (the geographic make-up of the network). Diffusion further hinged on network cohesion, actors' trust in and engagement with the network. Adoption of e-commerce was obstructed by actors’ worldview; lack of time, reflexive learning, and commitment to change. The incorporation in the study’s diffusion framework of contextual moderators such as network position, worldview, trust, time and commitment considerably extends Rogers’ traditional diffusion framework. Based on its emergent analysis framework, the study introduces a dynamic change model towards sustainable regional network development. It is suggested that both the diffusion framework and the regional innovation model developed in this study may, either jointly or separately, be applicable beyond the tourism and service sector.
- Description: Doctor of Philosophy
- Authors: Braun, Patrice
- Date: 2003
- Type: Text , Thesis , PhD
- Full Text:
- Description: This thesis describes the initiation and evolution of an action research project, which investigates the adoption and diffusion of Internet technologies in a regional Australian tourism network. The research evolved out of a portal development consultancy. The aim of the study was two-fold: to investigate the nature of the change process when a collaborative network seeks to adopt e-commerce; and to determine how the change process differed in the face of incremental change (adding some e-commerce solutions to the network), or radical change (changing the overall business model). The purpose of the study was to gain a better understanding of the economic, strategic and social potential of regional business networks in the current techno-economic climate. The study builds on Rogers' (1995) seminal work on the diffusion of innovations and makes a unique contribution to existing diffusion studies by its focus on the nature of the network links as the unit of analysis; and by its application of an action-oriented methodology to untangle the effects of the embedded network structure on diffusion. The study suggests a strong relationship between diffusion and network positioning, both in terms of place (status and position in the network) and space (the geographic make-up of the network). Diffusion further hinged on network cohesion, actors' trust in and engagement with the network. Adoption of e-commerce was obstructed by actors’ worldview; lack of time, reflexive learning, and commitment to change. The incorporation in the study’s diffusion framework of contextual moderators such as network position, worldview, trust, time and commitment considerably extends Rogers’ traditional diffusion framework. Based on its emergent analysis framework, the study introduces a dynamic change model towards sustainable regional network development. It is suggested that both the diffusion framework and the regional innovation model developed in this study may, either jointly or separately, be applicable beyond the tourism and service sector.
- Description: Doctor of Philosophy
- «
- ‹
- 1
- ›
- »